In Search of a Conversational User Interface

for Personal Health Assistance

Mathias Wolfgang Jesse

, Claudia Steinberger

and Peter Schartner

Digital Age Research Center, University of Klagenfurt, Universitätsstraße 65-67, Klagenfurt, Austria

Department of Applied Informatics, University of Klagenfurt, Universitätsstraße 65-67, Klagenfurt, Austria

Keywords: Health Technologies, Innovative Interfaces, Conversational User Interface, Voice Assistant, Ehealth, AAL.

Abstract: Conversational user interfaces (CUI) cause a paradigm shift in the interaction between user and machine. The

machine is operated via structured dialogues or partly or entirely via human language. Voice assistants that

understand and process spoken natural language are increasingly being used. The currently available conver-

sational technologies (CTs) for voice assistants range from well-known commercial technologies to quite

well-known open source platforms. The suitability of a CT for a particular application domain depends heavily

on its specific requirements. In this paper, we focus on the selection of CTs for the development of CUIs for

the elderly to assist them in their health management. We (1) propose criteria for CT selection in the domain

of personal health management for the elderly, (2) analyze commercial and open source representatives ac-

cording to these criteria and (3) we evaluate the most suitable candidates for CUI development.

1 INTRODUCTION

Language is one of the most powerful forms of com-

munication between people. Technological develop-

ments in recent years, especially in the field of artifi-

cial intelligence (AI), natural language processing

(NLP) and pervasive computing, have opened new

possibilities to realize human-to-computer interfaces.

New functions and tools have emerged that are essen-

tial for computer-aided processing of spoken natural

language. Conversational user interfaces (CUIs)

caused a paradigm shift in the human-to-computer in-

teraction (Hoy, 2018; Këpuska & Bohouta, 2018; Sid-

dike et al., 2018).

Particularly intensive work was done on technol-

ogies to understand and process spoken language

(Kinsella & Mutchle, 2019; Olson & Kemery, 2019;

Siddike et al., 2018; SUMO Heavy Industries, 2019).

Currently available conversational technologies

(CTs) for voice assistant development and deploy-

ment span from familiar commercial representatives

like Apple Siri, Amazon’s Alexa, Google Assistant or

Microsoft Cortana to quite well-known open source

technologies like Rhasspy or Mycroft. These CTs

https://orcid.org/0000-0001-6018-2780

https://orcid.org/0000-0002-5111-2286

https://orcid.org/0000-0002-5964-8480

have developed rapidly in recent years and they offer

different development possibilities and features.

Nowadays, one can use these technologies to extend

existing or build own voice assistants and offer these

functionalities on various platforms.

Voice assistants can be helpful to the user in dif-

ferent ways. The users who benefit most are those

who are restricted to use a graphical user interface

(GUI). This group includes disabled persons or per-

sons in special working or living situations, users with

low digital skills or users simply preferring to use

their voice to control applications. Furthermore, the

use of voice enables novice users to directly interact

with a device without having to adapt to a GUI. More

experienced users can utilize a CUI for a faster and

more direct interaction (Huang et al., 2001).

In addition, in comparison to a GUI the interaction

with a CUI can be made more dynamic and flexible,

adapting better to the cognitive abilities of the user

(Siddike et al., 2018; Wolters et al., 2015). However,

not every technology is equally suitable for every

problem. The appropriateness of a CT depends on the

requirements of an application domain. The scope of

724

Jesse, M., Steinberger, C. and Schartner, P.

In Search of a Conversational User Interface for Personal Health Assistance.

DOI: 10.5220/0010384907240732

In Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2021) - Volume 5: HEALTHINF, pages 724-732

ISBN: 978-989-758-490-9

this paper lies within the scope of active assisted liv-

ing (AAL) support for the elderly. The focus is on an-

alyzing available CTs to develop CUIs for personal

health assistance for the elderly. The paper investi-

gates the applicability of CTs in this domain with spe-

cial requirements e.g. on data protection and security,

multimodality and special needs of the target group.

The contribution of this paper is to (a) propose cri-

teria for the evaluation of CTs for the eHealth appli-

cation domain, to (b) analyze representatives of com-

mercial CTs, open source CTs and open source text-

to-speech and voice recognition components availa-

ble on the market to these criteria and (c) to evaluate

their suitability to develop and deploy multimodal so-

lutions for the personal eHealth management for the

elderly.

The paper is structured as follows: Chapter 2 in-

troduces the research methodology applied in this

work. Chapter 3 provides an overview of the state of

art in CTs. Chapter 4 outlines our evaluation criteria

and explains how they were determined. Chapter 5

selects technologies for further evaluation. Chapter 6

evaluates selected CTs against the defined evaluation

criteria and proposes the most appropriate for the

scope of personal health assistance for elderly people.

Chapter 7 summarizes the results and gives an out-

look on further work.

2 RESEARCH METHODS

To obtain an overview of the state of the art in CT, a

literature review was conducted. We used the key-

words “language assistant”, “voice assistant”,”voice

interface”, “cognitive assistants”, “intelligent per-

sonal assistants”, and “conversational user interface”

combined with “comparison” and “analysis” to

search for papers in Google Semantic Scholar,

Springer Link, IEEE Xplore and ScienceDirect. In ad-

dition, we searched relevant conference series and

journals for articles that fall within the scope de-

scribed above. Furthermore, we used references of the

retrieved articles to find further works on the topic.

The research also had to be extended to the official

websites and tutorials of relevant CTs. This resulted

in a set of around 50 sources.

The next step was to pre-select, group, install and

test the identified CTs with basic use cases. The goal

was to get a basic understanding how to use these

CTs.

As domain experts, we performed a qualitative

analysis and selected relevant categories and criteria

for our domain. The CT candidates were then evalu-

ated qualitatively.

3 STATE OF THE ART

Rapidly evolving technologies make CTs a constantly

changing field. Recent industry studies conducted by

Sumo Heavy Industries (2019), Olson & Kemery

(2019), and Siddike et al. (2018) provide an insight of

commercial CT representatives. Voice commerce is

becoming increasingly important for companies. The

market leaders in this field are Amazon’s Alexa and

Google Assistant with the highest amount of market

share (Sumo Heavy Industries, 2019). The most fre-

quently mentioned commercial representatives are

Amazon’s Alexa, Google Assistant, Apple’s Siri, Mi-

crosoft’s Cortana and Samsung’s Bixby. In addition

to these commercial providers, there are open source

providers on the market like Mycroft, Snips and

Rhasspy. In contrast to commercial providers, these

open source solutions can be analyzed more thor-

oughly (Kinsella & Mutchle, 2019).

In addition, there are open-source text-to-speech

and speech recognition libraries such as CMUSphinx,

MaryTTS, and Kaldi that can be used by developers

to create CTs from scratch (Gaida et al., 2014). A dis-

advantage of these libraries is that they do not offer

any development environment nor predefined func-

tionalities. Therefore, although a comparison with in-

tegrated CTs is incomplete with respect to some cri-

teria, we include them in our analysis.

Only few research papers are focusing the devel-

oper’s point of view on a CT. Studies conducted by

SUMO Heavy Industries (2019), Siddike et al. (2018)

or Olson & Kemery (2019) pay special attention to

the end-user's perspective. But from the developer’s

point of view, many factors of a CT can influence the

selection, implementation and deployment process.

In the context of our literary research we could not

find any work which focuses on the application do-

main and the end-user needs and follows a criteria-

based approach to select an appropriate CT. Besides

security and privacy also aspects like the required end

user language, trust in the provider or costs are rele-

vant. This paper contributes to fill this gap and shows

criteria which are applicable and evaluable for the ap-

plication domain of AAL support for the elderly.

4 ANALYSIS CRITERIA

In order to find suitable criteria, the criteria from

other studies (e.g. Bedford-Strohm (2017), Candid

(2017), Kinsella & Mutchle (2019), Olson & Kemery

(2019), Siddike et al. (2018) and SUMO Heavy In-

dustries (2019)), and the criteria for health applica-

tions (Fraunhofer-Gesellschaft, 2020) were gathered

In Search of a Conversational User Interface for Personal Health Assistance

725

and combined to remove duplicates. These criteria

were then clustered into three main categories: usa-

bility for the users, alignment with the needs of the

target group and implemented or available security

and privacy mechanisms. After briefly introducing

the category, each criterion is described in the follow-

ing in more detail. The resulting list of criteria is not

exhaustive, but serves as a tool for selecting CTs for

the problem domain.

4.1 Usability

Criteria that focus on the usability of the complete ap-

plication life cycle are necessary to measure how con-

venient the interactions between the user and the ap-

plication are. Without a high usability, any interaction

with it will turn out to be tedious and not optimal for

any user. This category focuses on the convenience

and comfort that is given by the different voice assis-

tants. As such all the criteria, except for “Intelli-

gence” (U6), were deduced with the help of existing

literature (Capgemini, 2018; Claessen et al., 2017;

Kinsella & Mutchle, 2019; SUMO Heavy Industries,

2019; Wolters, 2015). These, to some extent, already

implement such categories, but never consider all of

them at the same time.

U1 – Ease of Installation: This criterion checks

the effort of commissioning a voice assistant by ana-

lyzing both the installation as well as then configura-

tion of the device. The easier the process is designed,

the better the voice assistant is rated compared to the

others.

U2 – Extensibility: After the initial configuration

and installation have been finished, users may want to

extend the existing functionality (e.g. stores) in an in-

tuitive and easy way. When comparing the CTs, the

intended ways of adding functionality are analyzed

and ranked according to their difficulty.

U3 – Comfort of Dialogues: The biggest benefit

of a conversational user interface is that they enable

interaction through voice, whereas “classical” CT is

based on text, which has to be typed. For the voice

interaction to be as natural and intuitive as a normal

human-to-human conversation, certain characteristics

have to be achieved.

U4 – Dynamic Interaction: Without flexibility in

the process of developing an application, voice assis-

tants cannot be as dynamic as they need to be. An ex-

ample for such a restriction is that only the user – and

not the voice assistant – is able to start a conversation.

The more difficult it is to achieve the desired out-

comes, the less likely a voice assistant will turn out to

be the most usable from the user’s point of view.

U5 – Accuracy of Queries: When users interact

with the conversational user interface, accuracy is

necessary to enable the best experience. Especially, in

a field like eHealth, if spoken information is misinter-

preted, this can cause severe problems that eventually

threaten life.

U6 – Intelligence: Intelligence in this context is

understood as the ability of a conversational user in-

terface is to react correctly to a wide range of situa-

tions. In this case, intelligence was measured in a

way, where assistants were asked a set of questions.

The results are evaluated with respect to the correct

understanding of the sentence and if a suitable answer

was provided.

U7 – Multimodality: With this criterion the capa-

bility of a voice assistant to handle more than one way

of interaction and representation is evaluated. Graph-

ical user interfaces (GUIs) will not be completely re-

placed by CUIs in the foreseeable future. On the con-

trary, user tests (Këpuska & Bohouta, 2018) made it

obvious that many things can be controlled much

more easily with the help of a GUI than only with

speech input and output. Multiple forms of interaction

(e.g. by means of a microphone, a camera or a dis-

play) can significantly increase the usability of a

voice assistant, specifically in the case of entering

larger amounts of data or by providing additional vis-

ual aid (Këpuska & Bohouta, 2018) or for applying

different authentication methods. Therefore, the score

increases with the number of ways of interaction.

4.2 Target Group: The Elderlies

Criteria for this category were derived from a persona

that was specific to the mentioned target group. The

underlying observations are described here shortly.

Based on the demographic changes, the amount of el-

derly people in the population increases constantly.

Additionally, the amount of age-related diseases rises

and therefore also the number of caretakers which are

needed. With many positions in this field being unoc-

cupied or underpaid, this could create a situation in

which a shortage of caretakers becomes inevitable.

Patients want to stay autonomous and want to stay in

their own homes as long as possible. Providing assis-

tive technology can help to fulfill this desire and even

more can enable the patients to take actively part in

the care-taking process. With the help of unobtru-

sively integrated voice assistants, it is possible to give

the elderlies more control and a way to naturally par-

ticipate in eHealth related fields. With this infor-

mation it was possible to select the following criteria:

T1 – Costs: The aspect of costs must be covered.

The target group is defined in a way that they strife

HEALTHINF 2021 - 14th International Conference on Health Informatics

726

for a reasonable price for the assistant. To ensure, that

all costs are compared on the same basis, they will be

analyzed over a timespan of a year, as reoccurring

costs need to be considered as well.

T2 – Language Support: Speaking and under-

standing the language most commonly spoken in the

target group is very important concerning the ac-

ceptance of voice assistant technology. In the best

case, this allows the target group an unproblematic in-

teraction with no misunderstandings because of lan-

guage barriers.

T3 – The Companies’ Trustworthiness: The

trust, users have in the company deploying the CUI

has found a lot of attention (Kinsella & Mutchle,

2019). Without trust, the users are reluctant to pur-

chasing or even using a specific voice assistant. As it

is difficult to measure how trustful a company is in

the eyes of a user of the target group, this criterion

focuses on the active contributions of the companies

to reduce trust issues.

T4 – Smart Home Integration: Although, CUIs

come with a wide array of functionality, the imple-

mentation itself can be limiting. As such it is of inter-

est, if the voice assistant technology is capable of be-

ing integrated in already existing smart homes, or if it

is possible to build a smart home with the assistant as

a base. This benefits the target group, as other smart

devices can be integrated as well.

T5 – Number of Applications: The number of

applications reflects the range of different uses the

voice assistant can have. The term “applications” in

this context means software that can be installed be-

forehand or after setting up the voice assistant. For

analysis, the number of predefined applications in the

corresponding app-market is counted. Besides the

number of applications, the variety of these is the sec-

ond factor positively influencing the score.

T6 – Support of Other Devices: This criterion

covers all the possible ways for deploying the CUI.

Every form of end-device is considered, such as

smartwatches, smart TVs, smartphones and proprie-

tary devices. The greater the diversity, the better the

assistant is rated in terms of this criterion.

4.3 Security and Privacy

As applications in the area of eHealth and AAL pro-

cess personal, and especially medical data (i.e. critical

data with respect to GDPR, (European Parliament,

2016, Article 9), special attention must be paid to the

security and the privacy of this data. Besides laws and

regulations, security and privacy of medical infor-

mation have a high priority, as users do not want their

data to be leaked, misused or even sold (Kinsella &

Mutchle, 2019). In order to address these concerns,

the following criteria focus on some of the main is-

sues gathered throughout the literature research (e.g.,

Alepis & Patsakis (2017), Candid (2017), Gong &

Poellabauer (2018), Kang et al. (2019), Kinsella &

Mutchle (2019), Olson & Kemery (2019) and Tiro-

panis et al. (2015)). Again, these criteria are to some

extent already observed by these papers and are reoc-

curring across them. Nonetheless, never are all cate-

gories considered at the same time.

S1 – Domain Specific Infrastructure: To ensure

the highest control over the processed data, the use of

personal infrastructure is recommended. This adds a

layer of security, as permissions, authentication, and

hardware can be controlled by the user itself, or a

trusted admin. The necessary interfaces must be pro-

vided by the voice assistants, as well as no restrictions

by the system must be mitigating the overall control

of the infrastructure. Specifically, the location of the

database is checked and evaluated.

S2 - Location of Voice Commands: When a user

issues a command, this command may be stored in the

process. Possible locations for the stored data may be

locally on the device, on a server or in a (vendor-spe-

cific) cloud. As it is personal/medical data, the pre-

ferred outcome would be that information is not

stored permanently. Ideally, the temporary data is de-

leted, after the request has been processed.

S3 - Location of Intelligence: Obviously, natural

language queries issued by the user must be digitized

to be processed. To achieve this, a voice assistant can

either directly process the query on the device or

transmit the voice data to an external infrastructure,

where an additional way for attacks is created.

S4 – Security of Data Transmission: Ideally, no

data is sent to remote services, and all data is pro-

cessed in the local infrastructure or the CUI itself. But

if data is sent over the (untrusted) internet to remote

devices (e.g. cloud-based infrastructure), it must be

protected in terms of confidentiality, integrity and au-

thenticity. Otherwise, the data can be exfiltrated, ma-

nipulated or forged by attackers.

S5 – Access to Stored Information: This crite-

rion focuses on the data intruders would have access

to, if they are able to physically seize the assistant.

Forms of extracting data would be to use the inter-

faces that allow direct access to stored data or by

speaking to the assistant and receiving information

that way.

S6 – Authentication: As not every person should

have complete access to all a voice assistant’s func-

tionality, this criterion checks the ways of authentica-

tion, a voice assistant offers. Typically, passwords or

In Search of a Conversational User Interface for Personal Health Assistance

727

PINs are used, but it is also possible for a voice assis-

tant to distinguish between different voices.

S7 – Compliance with GDPR: With the GDPR

taking effect in 2018, it is necessary to consider if the

voice assistants align with the current law.

Although not every aspect of a technology can be

covered with these criteria, a basis for domain-spe-

cific evaluation has been established with these crite-

ria. The extension by more criteria could increase the

benefits even further. In addition, it may be necessary

to weight different criteria for specific use cases,

which is explained in more detail in Chapter 6.

5 CT IN THE EHEALTH DOMAIN

Elderlies typically have moderate digital literacy, are

often not very financially resilient, sensitive to their

personal data and do not want to constantly change

the used technology.

Before analyzing the CTs mentioned in chapter 3

to the criteria stated in chapter 4, we checked them

against the hard criteria “costs” and “sustainability on

the market”. As not every technology fits into the

scope of this paper, this selection was necessary. In

addition, not enough information could be obtained

on all the technologies mentioned, so that a repre-

sentative set of CTs had to be found for an analysis:

Amazon’s Alexa and Google Assistant prove to

have the highest usage rate. They therefore had to be

considered as they are sustainably represented in the

market, affordable and quite well documented.

Apple’s Siri and Microsoft Cortana make up a

high volume of devices compared to other providers.

But Cortana is not further analyzed in depth as Mi-

crosoft announced in early 2019, that they will not

compete with other voice assistant providers any

longer (Novet, 2019) and have no intention to develop

their voice assistant in the future. Siri on the other

hand has a different drawback. Apple is still compet-

ing with other CT providers but restricts the devices

Siri and her extensible functionality can be developed

and deployed on (Apple, 2020b). With a modular and

open system in mind, the users cannot be constrained

to use a singular form of device type. Apple devices

also tend to be relatively expensive, whereas Google

and Alexa both offer cheaper solutions (Amazon,

2020a; Apple, 2020a; Apple, 2020b; Google, 2020b).

Samsung’s Bixby only runs on Samsung devices

and occupies only a very small market share. This

limits the applicable end devices too much and there-

fore Bixby is not further investigated in this study.

Can open source CTs compete with commercial

providers? To answer this question Mycroft and

Rhasspy were added to the study, though they occupy

only a small market share. Mycroft was chosen, as it

is one of the only open-source solutions, which comes

with an already build-in hardware. The special feature

of Mycroft is that all the natural language components

are modular and allow the developer to change them

independently (Mycroft, 2020b).

Rhasspy on the other hand offers no prebuilt

hardware and only comes in the form of a set of ser-

vices which can be freely changed. The advantage of

Rhasspy is that it is open source software that allows

the developer to create an assistant that runs entirely

locally on an end device. This fact makes it interest-

ing for use in eHealth and thus included in the com-

parison. Note that Snips was previously analyzed in

the work of Jesse (2019) but is not freely available

anymore and not considered anymore.

In addition, to these complete technology pack-

ages for the development of CUIs, also the speech-

recognition toolkit CMUSphinx and the text-to-

speech library MaryTTS have to be mentioned. Both

offer a lot of freedom, but they are not integrated de-

velopment technologies and must be configured and

built into a solution by a developer. MaryTTS itself is

not recommended to be run on low-end computers

(e.g. Raspberry Pi) and therefore must be installed

separately on a server (CMUSphinx, 2020a; DFKI,

2020).

It has turned out that Kaldi, CMUSphinx and

MaryTTS are used as a component of Rhasspy. Hence

Kaldi was not pursued further, and the focus was laid

on CMUSphinx and MaryTTS as to minimize redun-

dancies in the study (Coucke et al., 2018).

In order to be able to compare CMUSphinx and

MaryTTS in combination with the complete technol-

ogy solutions, a demo assistant was developed to sim-

ulate how the two libraries would perform. The spec-

ification of the demo voice assistant was taken from

Jesse (2019) and used to evaluate the criteria.

6 EVALUATION

This chapter describes the evaluation results of the se-

lected voice assistants based on the set of criteria pro-

posed in Chapter 4. Important results are described in

more detail in order to provide the necessary infor-

mation why a certain result was achieved. More gen-

eral findings are only discussed briefly but depicted

in the Tables 1, 2 and 3. The evaluation is based on

the work of Jesse (2019) and put into a new perspec-

tive. Note, that for some criteria an implementation of

CMUSphinx and MaryTTS was used (combination).

This is because without the fictional application no

HEALTHINF 2021 - 14th International Conference on Health Informatics

728

evaluation of CMUSphinx and MaryTTS could have

been conducted. Overall, the same basis is used, and

a smart speaker is compared where necessary.

A scoring system was applied, which ranked the

different voice assistants on a scale from 1 (optimal

solution) to 5 (being the worst out of the observed

technologies). Criteria, where no ranking was needed,

were evaluated on the criterion being meet (✓) or not

(⨯). If no suitable information was obtainable or

could be found a dash (-) was used.

6.1 Results on Usability

Table 1: Results on usability.

Usability U1 U2 U3 U4 U5 U6 U7

Alexa ✓ ✓ 1 2 2 2 3

Google Assistant ✓ ✓ 1 2 2 1 2

Mycroft ✓ ✓ 3 2 - - 1

Rhasspy ⨯ ⨯ 2 1 1 - 1

Combination - ⨯ - 1 - - 1

U1 – Ease of Installation: Alexa, Google Assistant,

and Mycroft offer a more user-oriented approach in

comparison to the other assistants. Rhasspy has a

quite complicated installation process, which has to

be performed by a technically skilled person. Con-

cerning CMUSphinx and MaryTTS, no reasonable

evaluation was possible because the libraries do not

define the process of installation.

U2 – Extensibility: As seen before, Alexa,

Google Assistant and Mycroft try to be as user-cen-

tered as possible and therefore allow them to add

functionality. On the other hand, Rhasspy and

CMUSphinx and MaryTTS do not allow for any func-

tionality to be added after the initial setup.

U3 – Comfort of Dialogue: Alexa and Google

Assistant offer a high level of comfort during conver-

sations. Rhasspy offers a similar functionality but has

no follow-up mode, so that the wake-word must be

repeated for every step in a dialog. Mycroft misses a

multitude of these features. Concerning the combina-

tion, no reasonable evaluation was possible, as it is

defined through the implementation of the developer.

U4 – Dynamic Interaction: As Klade (2019)

showed, all commercial CTs come with the same

amount of restrictions. They do not allow an out-of-

the-box initiation of a conversation by the assistant

itself. The user must trigger each conversations and

reminder become cumbersome. Rhasspy and the two

libraries have an advantage, as they do allow for the

application to be designed by the developer. There-

fore, it is possible to initiate a conversation by those.

U5 – Accuracy of Queries Understood: Finding

a comprehensive study that covers all selected CTs

was not possible, therefore the one covering the most

assistants were selected. The study of Coucke et al.

(2018) covered the NLP components offered by

Snips, Google, Alexa, Facebook and Microsoft. The

results point towards Snips being the most accurate,

which is also implemented in Rhasspy.

U6 – Intelligence: The study conducted by

Loupventure (2018) ranked the Google Assistant as

the CTs with the highest intelligence. They tested the

assistants on the correct interpretation of a phrase and

on providing suitable answers. A study testing all ob-

served CUIs could not be found.

U7 – Multimodality: All mentioned CTs allow

for multimodality. Open source approaches offer a

higher number of possible interfaces than proprietary

ones. Actually, open source solutions can be used to

create any interaction the developer can imagine, pro-

vided the developer is able implement it. If the natural

language components of the Google Assistant are

used, they also enable a multitude of possible interac-

tions. Alexa on the other hand is more restrictive in

the offered ways of interaction.

6.2 Results on the Target Group

Table 2: Results on target group.

Target group T1 T2 T3 T4 T5 T6

Alexa 1 ✓ 3 1 1 1

Google Assistant 2 ✓ 3 1 2 1

Mycroft 4 ⨯ 2 2 3 3

Rhasspy 3 ✓ 1 2 4 2

Combination 5 ✓ 1 2 - 2

T1 – Costs: The commercially available low-end

smart speakers are cheaper than any open-source

smart speaker solution. Rhasspy and the two libraries

require a Raspberry Pi and additional hardware. This

hardware cumulates a significant amount of money

and they are therefore ranked lower.

T2 – Language Support: In order to fulfill this

criterion in this study, German must be supported.

Studies using this analysis may adjust the selected

language according to their needs. All conversational

interfaces, except for Mycroft, work with German.

Mycroft does not officially support German.

T3 – The Companies’ Trustworthiness: In com-

parison to Alexa and Google Assistant, which collect

information on the user to improve their services, My-

croft, Rhasspy, and CMUSphinx and MaryTTS do

not per default. However, Mycroft offers an opt-in

policy that allows users to decide, whether their data

can be used to train the technology.

T4 – Smart Home Integration: When a CT is

deployed on an end device, it is possible to combine

In Search of a Conversational User Interface for Personal Health Assistance

729

and connect other technology (applications) to it. My-

croft, Rhasspy, and CUI libraries enable this tenden-

tially better, as they are developed with an open-

source approach in mind. There can be changes made

to the main application and the interfaces

(CMUSphinx, 2020a; Mycroft, 2020b, Rhasspy,

2020). This enables the creation of CTs that are much

more flexible than proprietary solutions.

T5 – Number of Applications: The better known

a technology is, the more applications are in its mar-

kets. Hence, the market leaders offer a higher variety

and number of applications to be installed than other

open-source or less known solutions.

T6 – Support of Other Devices: In terms of sheer

amount of deployable end devices, Alexa and Google

Assistant offer the biggest variety of already finished

solutions. As they sell a variety of different devices,

they outperform the other CUIs in that regard. These

other CUIs (Mycroft, Rhasspy, etc.) mainly focus on

providing the means for developers to create a CT.

6.4 Results on Security and Privacy

Table 3: Results on security and privacy.

Security S1 S2 S3 S4 S5 S6 S7

Alexa ✓ 4 3 2 1 1 ✓

Google Assistant ✓ 3 4 2 1 1 ✓

Mycroft ✓ 2 2 2 2 2 ✓

Rhasspy ✓ 1 1 1 2 2 ✓

Combination ✓ 2 2 1 2 2 ✓

S1 – Domain Specific Infrastructure: All observed

CUIs allow the integration of infrastructure. Open-

source solutions pose an inherently more difficult

process than the other ones, as everything must be im-

plemented by the developer.

S2 – Location of Voice Commands: Rhasspy is

the only solution that can fully processes the voice

commands on the device and never transmits or stores

requests. Mycroft and CMUSphinx and MaryTTS do

not store voice commands explicitly, but do transfer

them to a remote system for processing. Alexa and

Google Assistant use the received voice commands

and store them for later use and training.

S3 – Location of Intelligence: As seen before,

Rhasspy can process everything locally, which makes

it the best choice for this criterion. Mycroft needs to

send data to a remote system for certain NLP compo-

nents to work. Alexa provides basic functionality

without using remote services, but otherwise sends all

voice data to a cloud. Google Assistant needs to send

all information to a server because it cannot process

any of them on the device.

S4 – Security of Data Transmission: Like the

two previous criteria, Rhasspy does not require any

security measures for data transmission because no

information is transmitted. The other CTs encrypt and

secure their data transfers appropriately (SSL/TLS),

but still do not have the ability to work fully on the

device to remove a potential point of attack. There-

fore, they must be ranked lower than Rhasspy for this

criterion.

S5 – Access to Stored Information: Alexa and

Google Assistant mainly use proprietary devices with

no accessible local storage. Any application associ-

ated with such a commercial solution must be author-

ized by the user. Open source solutions have the dis-

advantage that they have a base like a Raspberry Pi.

S6 – Authentication: Concerning Authentica-

tion, Alexa and Google Assistant stand out, because

they both have a built-in way of authentication but

can be extended with others (Auth0, 2020). Hence,

they are rated higher. The open-source technologies

do not offer any built-in authentication, but it can be

implemented by developers.

S7 – Compliance with GDPR: All CTs claim to

align with GDPR and other laws concerning the pro-

tection of personal data. It can be emphasized that

Rhasspy mentions its offline functionality directly on

their landing page, whereas other technologies do not

mention if they align to the GDPR. Therefore, it was

necessary to analyze their security statements and

policies, in which this information could be found.

7 SUMMARY AND CONCLUSION

In this article, we have defined evaluation criteria to

select the most appropriate dialogue-oriented AI tech-

nology for the area of personal health management of

the elderly. Based on these criteria, we evaluated and

ranked the leading commercial and open source CTs

for this application domain.

At a first glance, Alexa and Google Assistant per-

form very well in this evaluation. But in terms of sen-

sitive health data, security and data protection in the

application domain under consideration, they are not

an alternative for our problem domain. Overall,

Rhasspy showed to be superior in regard to the crite-

ria of “Accuracy of understood queries” (U5), “The

companies’ trustworthiness” (T3), “Location of voice

commands” (S2), “Location of intelligence” (S3) and

“Security of transmission” (S4). This suggests that

Rhasspy is the best choice, especially in terms of se-

curity and data protection. Since the technology is

able to work completely “out of the cloud”, no other

analyzed technology can outperform Rhasspy in our

HEALTHINF 2021 - 14th International Conference on Health Informatics

730

eHealth domain. This makes Rhasspy the best CT

choice to create a voice assistant focused on working

with critical health data.

Nevertheless, criteria like “Ease of installation”

(U1), “Extensibility” (U2) and “Number of applica-

tions” (T5) show the downside of Rhasspy. Here, it is

clearly outperformed by Alexa and Google Assistant.

However, since the functionality and the functioning

of these technologies are constantly changing, it is

necessary to carry out regular re-evaluations.

The research presented in this paper was carried

out as part of the AYUDO project (FFG Projektdaten-

bank, 2020), which is intended to help the elderly im-

proving or preserving their health using a CUI. Be-

yond that project the criteria can also form a basis for

the evaluation of CTs to be used in other application

domains. Here, some specific criteria will have to be

added to the existing categories, and/or the weight of

single criteria has to be adapted to the requirements

of the application domain.

REFERENCES

Alepis, E., Patsakis, C. 2017. Monkey Says, Monkey Does:

Security and Privacy on Voice Assistants. In IEEE Ac-

cess 5, pp. 17841–17851. DOI: 10.1109/AC-

CESS.2017.2747626.

Amazon.com, Inc. (2020a, Nov. 20). Built-in Devices.

www.amazon.com/b?node=15443147011.

Amazon.com, Inc. (2020b, 11/2020). Create Skills for

Alexa-Enabled Devices with a Screen. https://devel-

oper.amazon.com/de/docs/custom-skills/create-skills-

for-alexa-enabled-devices-with-a-screen.html.

Apple Inc. (2020a, Nov. 20). Apple HomePod. www.ap-

ple.com/de/shop/buy-homepod/homepod.

Apple Inc. (2020b, Nov. 20). Apple Homepage. www.ap-

ple.com.

Auth0. (2020, Nov. 20). Auth0 Overview.

https://auth0.com/docs/getting-started/overview.

Candid W., (2017). A guide to the security of voice-acti-

vated smart speakers. Symantec Corporation.

Capgemini. (2018). Conversational Commerce. Why Con-

sumers Are Embracing Voice Assistants in Their Lives.

Claessen, V., Schmidt, A., Heck, T. 2017. Virtual

Assistants. In: Humboldt-Universität zu Berlin.

CMUSphinx (2020a, Nov. 20). Tutorial For Developers.

https://cmusphinx.github.io/wiki/tutorial/.

Coucke, A., Saade, A., Ball, A., Bluche, T., Caulier, A.,

Leroy, D. et al., (2018). Snips Voice Platform: an em-

bedded Spoken Language Understanding system for

private-by-design voice interfaces.

DFKI GmbH. (2020, Nov. 20). Mary Text To Speech

http://mary.dfki.de/documentation/overview.html.

European Parliament 2016. General Data Protection Regu-

lation. eur-lex.europa.eu/eli/reg/2016/679/oj, pp. 1–88.

FFG Projektdatenbank. (2020, Nov. 20). AYUDO.

https://projekte.ffg.at/projekt/3311832.

Fraunhofer-Gesellschaft. (2020, Nov. 20). Appkri –

Kriterien für Gesundheits-Apps. https://ehealth-ser-

vices.fokus.fraunhofer.de/BMG-APPS/catego-

ries/Alle.

Gaida, C., Lange, P., Petrick, R, Proba, P., Malatawy, A.,

Suendermann-Oeft, D., (2014). Comparing open-

source speech recognition toolkits. In: Technical Report

of the Project OASIS.

Gong, Y., Poellabauer, C. 2018. An Overview of Vulnera-

bilities of Voice Controlled Systems.

Google LLC. (2020a, Nov. 20). Development. https://de-

velopers.google.com/actions/overview.

Google LLC. (2020b, Nov. 20). Google Store.

https://store.google.com.

Hoy, M., (2018). Alexa, Siri, Cortana, and More: An Intro-

duction to Voice Assistants. In Medical Reference Ser-

vices Quarterly 37. DOI:

10.1080/02763869.2018.1404391.

Huang, X., Acero, A., Hon, H., (2001). Spoken Language

Processing: A Guide to Theory, Algorithm, and System

Development. 1st. Upper Saddle River, NJ, USA: Pren-

tice Hall PTR.

Jesse, M. W. 2019. Analysis of voice assistants in eHealth.

Master Thesis. DOI: 10.13140/RG.2.2.14691.37924.

Kang, B. B., Jang, J., Cho, G., Choi, J., Kim, H., Hyun, S.,

Ryoo, J. 2019. Threat Modeling and Analysis of Voice

Assistant Applications. In Information Security Appli-

cations. Cham: Springer International Publishing.

Këpuska, V., Bohouta, G., (2018). Next-generation of vir-

tual personal assistants. 2018 IEEE 8th Annual Compu-

ting and Communication Workshop and Conference.

Kinsella, B., Mutchle, A., (2019). Smart Speaker Consumer

Adoption Report.

Klade, J. 2019. Sprachassistenz zur Patientenunterstützung.

Master thesis.

Loupventure. (2018). Annual Smart Speaker IQ Test.

Mycroft AI, Inc. (2020a, Nov. 20). Privacy Policy.

https://Mycroft.ai/embed-privacy-policy/.

Mycroft AI, Inc. (2020b, Nov. 22). Software and Hardware.

mycroft-ai.gitbook.io/docs/mycroft-technologies.

Novet, J. (2020, Nov. 20). Newsreport

www.cnbc.com/2019/02/05/google-no-longer-consid-

ers-microsoft-cortana-a-competitor.html.

Olson, C., Kemery, K., (2019). Voice report. From answers

to action: customer adoption of voice technology and

digital assistants. Microsoft, Bing.

https://about.ads.microsoft.com/en-us/insights/2019-

voice-report.

Rhasspy. (2020, Nov. 24). Documentation.

https://rhasspy.readthedocs.io/en/v2.4.20/.

Siddike, Md. A., Spohrer, J., Demirkan, H., Kohda, Y.,

(2018). People’s Interactions with Cognitive Assistants

for Enhanced Performances. SKIM: Voice Tech Trends

2018.

SUMO Heavy Industries. (2019). 2019 Voice Commerce

Survey.

Tiropanis, T., Vakali, A., Sartori, L., Burnap, P., McMillan,

D., Loriette, A. 2015. Living with Listening Services:

In Search of a Conversational User Interface for Personal Health Assistance

731

Privacy and Control in IoT. Internet Science. Cham:

Springer International Publishing.

Wolters, M. K., Kelly, F., Kilgour, J., (2015). Designing a

spoken dialogue interface to an intelligent cognitive as-

sistant for people with dementia. In Health Informatics

Journal 22 (4), pp. 854–866.

HEALTHINF 2021 - 14th International Conference on Health Informatics

732