AH-CID: A Tool to Automatically Detect Human-Centric Issues in App

Reviews

Collins Mathews

, Kenny Ye

, Jake Grozdanovski

, Marcus Marinelli

, Kai Zhong

Hourieh Khalajzadeh

, Humphrey Obie

and John Grundy

Faculty of Information Technology, Monash University, Melbourne, Australia

HumaniSE Lab, Monash University, Melbourne, Australia

{hourieh.khalajzadeh, humphrey.obie, john.grundy}@monash.edu

Keywords:

Human-centric Issues, App Reviews, Machine Learning, End-user, Human-centred Design.

Abstract:

In modern software development, there is a growing emphasis on creating and designing around the end-user.

This has sparked the widespread adoption of human-centred design and agile development. These concepts

intersect during the user feedback stage in agile development, where user requirements are re-evaluated and

utilised towards the next iteration of development. An issue arises when the amount of user feedback far

exceeds the team’s capacity to extract meaningful data. As a result, many critical concerns and issues may fall

through the cracks and remain unnoticed, or the team must spend a great deal of time in analysing the data that

can be better spent elsewhere. In this paper, a tool is presented that analyses a large number of user reviews

from 24 mobile apps. These are used to train a machine learning (ML) model to automatically generate the

probability of the existence of human-centric issues, to automate and streamline the user feedback review

analysis process. Evaluation shows an improved ability to ﬁnd human-centric issues of the users.

1 INTRODUCTION

Software developers aim to deliver efﬁcient and satis-

factory solutions to their end-users. However, fulﬁll-

ing the expectations of their diverse end-users is not

a straightforward task. Software systems are prone

to security and data breaches, massive cost over-

runs and project slippage, hard-to-deploy, hard-to-

maintain, and even dangerous solutions and hard-to-

use software (Grundy et al., 2020). These issues can

unintentionally arise due to the lack of understand-

ing of human-centric issues during the software en-

gineering process (Hartzel, 2003; Miller et al., 2015;

Stock et al., 2008; Wirtz et al., 2009). These human-

centric issues (HCIs) include the issue diverse users

face due to the lack of consideration of their age, gen-

der, culture, physical and mental impairments, socio-

economic status, and so on. When HCIs, such as age,

gender, disability and language are ignored, the abil-

ity of affected users to interact with the system may

be severely impacted (Grundy et al., 2020).

Software engineers are typically very different

from most end users - dominated by men; relatively

young; afﬂuent; mostly proﬁcient in English; having

less severe physical and mental impairments, and so

on (Grundy et al., 2020; Grundy et al., 2021). The

gap between developers and end-users leads to the

lack of understanding of the human-centric issues by

the developers. Some app users share their concerns

through app reviews, and therefore many mobile ap-

plications receive a huge number of user reviews.

These reviews collectively provide a lot of useful in-

formation to the development team from the end users

of the product. One set of insights that can be drawn

from user app reviews is the human-centric issues ex-

perienced by the users. However, the time and effort

required to extract meaningful insights from such a

large data source may exceed the capacity of a de-

velopment team (Mao et al., 2005). Therefore, a tool

to analyse and quantify HCIs, using Machine Learn-

ing (ML) would greatly aid developers designing and

improving systems around the end user (Mao et al.,

2005).

We collected a large number of reviews from 24

apps in different categories such as parking, social

media, COVID 19, education, ﬁtness, and apps de-

veloped for people with dyslexia. The reviews were

classiﬁed by using eight human-centric tags identiﬁed

during our analysis as: Disability, Age, Emotional

(emotional impacts of the app), Language, Gender,

386

Mathews, C., Ye, K., Grozdanovski, J., Marinelli, M., Zhong, K., Khalajzadeh, H., Obie, H. and Grundy, J.

AH-CID: A Tool to Automatically Detect Human-Centric Issues in App Reviews.

DOI: 10.5220/0010576503860397

In Proceedings of the 16th International Conference on Software Technologies (ICSOFT 2021), pages 386-397

ISBN: 978-989-758-523-4

 2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reser ved

Location (or culture), Privacy, and Socio-economic

Status. The app reviews were initially tagged using

a semi-automated keyword-based tool (Obie et al.,

2021). They were then manually checked, revised

and used in training an ML model. We adopted a

binary relevance (BR) transformation method with a

base classiﬁer of support vector machine (SVM) to

determine the percentage likelihood of the text input

to contain any of the 8 speciﬁed labels. Utilising this

ML model yields promising results and our perfor-

mance evaluations indicate a positive trend toward au-

tomating the user-feedback process as a viable option

to manual analysis of reviews.

The remainder of this paper is organised as fol-

lows. Section 2 details the motivation behind our

study. Section 3 provides an overview of our ap-

proach and tool. Section 4 shows the tool usage ex-

ample, and Section 5 discusses our evaluation results.

Section 6 discusses and reﬂects on the key ﬁndings.

Section 7 summarises key differences to the related

research. Finally, Section 8 draws conclusions and

proposes future work.

2 MOTIVATION

Software systems are created through production

pipelines and techniques aiming to efﬁciently deliver

solutions to the users. However, many software solu-

tions are developed by professionals who do not inti-

mately understand the human-centric needs of their

users and do not have the tools to ﬁnd and access

this information (Grundy et al., 2020). This results in

software solutions not meeting the users needs, often

leading to dissatisfaction and extra costs if the client

wishes to resolve these issues. From a software pro-

duction perspective, it is in their best interest to gauge

HCIs quickly and accurately, in order to improve the

quality of their software system.

Human-centred Design: Human-centred design

is a methodology in which product developers create

solutions in conjunction with the people in which they

want to impact (Farooqui et al., 2019). Having a tool

that can gather HCIs from reviews could greatly im-

prove the efﬁciency of human-centred design work-

ﬂows. Thus, instead of creating a product for people,

the goal is to create the product with people through-

out continuous iterations. Farooqui et. al. stated

that most developers underestimate the importance of

user experience and usability and merely focus on

developing more features or content. Many of the

human-centred design workﬂows discussed by (Fa-

rooqui et al., 2019) used iterative production where

prototypes were built and users were asked to give

their feedback. This continuous research allowed de-

signers to create in-depth personas of its end users

giving developers guidance on understanding the real

user requirements (Farooqui et al., 2019).

Diversity of Users: End user diversity is often

not sufﬁciently considered in the development pro-

cess by many software development teams (Grundy

et al., 2021). Spending time with the potential users

and understanding the context in which they use soft-

ware can greatly improve the understanding of prob-

lems and use cases when developing solutions. How-

ever, for general use software, it is not always clear

who the users are. Online reviews provide a diverse

dataset, as users express their concerns and opinions

regarding the system. This provides a platform for

users that were not originally considered in the de-

sign process to effectively communicate their needs.

The challenging aspect for developers is synthesising

large amounts of reviews into meaningful data for im-

proving their original solution. In addition to this, de-

velopers must ensure that the HCIs of minority users

do not get lost in the process. Having a tool which can

condense this data into usable information regarding

HCIs would aid in the inclusion of a diverse set of

users (Farooqui et al., 2019).

Developer Empathy: Empathy is an extremely

important skill for developers in order to create effec-

tive software solutions. The ability to show concern,

empathy, and a positive attitude may inﬂuence an en-

gineer’s overall career success and technical compe-

tence (Levy and Hadar, 2018). A reciprocal relation-

ship between emotional intelligence or social cogni-

tion and reasoning about the mechanical properties of

a system is shown by (Jack et al., 2013). Being able to

connect developers and users consistently and effec-

tively is critical to achieving technical excellency. De-

velopers who can visualise their impacts on users are

more likely to stay motivated and develop a more tai-

lored software experience. As a result, software teams

can build their emotional intelligence in tandem with

technical expertise.

Process & Cost: Both developers and stakehold-

ers are consistently looking for ways to improve the

effectiveness of the development process along with

avoiding costly changes that come with large scale re-

structuring (Farooqui et al., 2019). Farooqui et. al de-

scribes the evaluation of human-centred design and its

associating beneﬁts compared to a conventional soft-

ware development process. A tool to consolidate user-

based feedback without requiring a manual review

process will ensure that developing software systems

will meet the requirements of the end user by high-

lighting HCIs. This can be achieved effectively dur-

ing every stage of development which will greatly re-

AH-CID: A Tool to Automatically Detect Human-Centric Issues in App Reviews

387

duce the time consuming process of such a review.

3 OUR APPROACH

We have developed a tool, Automated Human-Centric

Issues Detector (AH-CID), for detecting human-

centric discussion in app reviews. Based on our litera-

ture review, BR problem transformation method with

a SVM multi-label learning approach was selected as

the best approach for our problem. AH-CID develop-

ment process is shown in Figure 1. It is composed on

a machine learning component and a software devel-

opment component conducted in parallel. Details of

the stages are summarised below.

Dataset: For the training of the model, a multil-

abel binary marked dataset was required. A total of

171,048 user reviews were collected from 24 apps:

Firefox Browser, parking (Cellopark, Paystay), so-

cial media (Pinterest, Tiktok), COVID 19 (COVID-

safe, Aarogya Setu India, NZ Covid Tracer), ed-

ucation (Moodle), ﬁtness (Fitbit), and apps devel-

oped for people with dyslexia (Dyslexia Reading

Test, Eye games Dyslexia, Speechify, SmartyNote

Notepad, Augmenta11y, Mighty Fonts, Omoguru,

Ridit, Dyslexia learn letters, Nessy Learning, KOBI,

TintVision, Dyslexic Font for FlipFont, Cool Fonts

Text Free, HexaDyslexia).

To cover human-centric issues related to the end-

users’ age, gender, language, culture, physical and

mental impairments, emotions, and privacy concerns,

we considered eight human-centric tags as:

• Disability: reviews related to the issues with the

app usage due to the end-users physical and men-

tal impairments

• Age: issues related to the age of the end-users,

including elderly and very young users

• Emotional: the negative emotional impacts that an

app have on the end-users

• Language: not considering the language of dif-

ferent users accessing the app, such as translating

the app into different languages and considering

the issues due to translation

• Gender: not taking into account gender speciﬁc

features in the design of the app

• Location (or culture): issues related to the loca-

tion that the users access the app from

• Privacy: privacy related concerns that the end-

users raise through app reviews

• Socio-economic Status: issues related to the cost

and technology required to be able to use the app

Due to the large number of app reviews, a semi-

automated keyword-based tool (Obie et al., 2021)

was ﬁrst utilised to ﬁlter any samples that did not

contain at least one keyword that was in relation to

the labels. Keywords include all the synonyms and

antonyms of the labels and can be accessed online

through (Mathews et al., 2021). A total sample size

of 8,965 was then remained from all the apps. This

combined dataset was then manually labelled by ﬁve

of the authors to ensure the reviews discuss HCIs only

related to the 8 categories. A total of 1,315 sam-

ple was then detected as app reviews discussing these

human-centric issues. A sample subset app reviews

discussing HCIs can be seen in Table 1.

Text Processing: We then carried out pre-

processing tasks including stop words removal and

stemming to each sample in our dataset to provide

cleaned text that can be vectorised for the multi-label

classiﬁer. The CountVectorizer and TﬁdfTransformer

classes from the Scikit-Learn Python library were

used to vectorise the data into Term-Frequency and

then Term Frequency-Inverse Document Frequency

features, respectively.

Model Training: The training of the model was

handled by Scikit-Multilearn in conjunction with

Scikit-Learn Python libraries. Scikit-Multilearn’s Bi-

naryRelevance classiﬁer was instantiated with Scikit-

Learn’s svm.SVC learning method to provide the basis

for the model to be trained from. The vectorised data

was placed into a 3:1 train/test split and the training

dataset was passed into the classiﬁer. The resultant

learning model to be used for AH-CID was pickled

via the use of Pandas into a pickle (.pkl) ﬁle to be

loaded into the tool. The learning model was then

tested on the testing dataset via metrics from Scikit-

Learn’s metrics package.

User Interface: A web-based user interface was

built with Django, which follows the model-template-

view architectural pattern. The user can choose both

text entry and CSV ﬁle upload. If they need to check a

single sample, they may use the text entry, however if

they would like to investigate multiple samples, they

may upload a CSV ﬁle with these samples. User input

is validated and read into a Pandas DataFrame object,

where the pre-processing and vectorisation is han-

dled as above. The trained BR model is used to pre-

dict from the user input the resultant probabilities for

the multi-label classiﬁcation using Scikit-Multilearn’s

predict proba method. If the user entry consisted of

a CSV ﬁle, a new results ﬁle is saved with the ini-

tial dataset given by the user, concatenated with the

resultant probabilities. The resulting prediction vari-

ables are then passed as Django context variables to

the frontend.

ICSOFT 2021 - 16th International Conference on Software Technologies

388

Dataset

extraction

(n=171,048)

Dataset filtering with

keyword-based tool

Dataset selection

(n=8,965)

Dataset manual

labelling

Vectorisation

Text

preprocessing

Model training

Saved model

Metrics analysis

User interface

creation

User interface

update

Prototype

development

Final working

tool

Deployment

EC2 Instance

Requirements

clarification

Machine Learning Component

Website development component

Human-centric

Dataset

(n=1,315)

Figure 1: AH-CID tool development process.

The ChartJS JavaScript package is utilised to dis-

play the results of the multilabel classiﬁcation to the

user with dynamic bar chart visualisations. If the

user entry consisted of a CSV ﬁle, multiple bar charts

would be displayed dependent on the user input, along

with a download link to the aforementioned results

ﬁle. Any caught errors in user input validation would

be handled by a new HTML page displaying the na-

ture and description of the error, communicated be-

tween frontend-backend via Django context variables.

Deployment: The deployment of the Django full-

stack application was done on an AWS EC2 instance.

Port 80 was then allowed as an inbound rule to all

sources which allows HTTP request over the internet,

making AH-CID publicly accessible. Nginx, Guni-

corn and Supervisor frameworks were used to host the

Django application. Collectively, Supervisor would

control a Gunicorn process (on launch) that would tar-

get the Django application and bind it to a Unix socket

located at the root directory. Nginx would then point

to the same Unix socket and host it as a HTTP server,

allowing public internet access to the entire applica-

tion.

4 AH-CID TOOL USAGE

EXAMPLE

Upon loading of AH-CID, the user is displayed a

home page, as shown in Figure 2, and is then able

to select between two options in the form of tabs, i.e.,

“Text Entry” and “CSV Entry”. In the “Text Entry”

tab, the user may enter latin1 encoded text into the

text area and select the GENERATE button. The tool

will then display another page showing the resulting

label probabilities for the text that the user has entered

as shown in Figure 3.

In the “CSV Entry” tab, the user may enter a CSV

ﬁle of the data that they wish to obtain label proba-

bilities for. There are three optional form parameters

and two required form parameters for the user to ﬁll.

The two required form parameters are Samples Col

and CSV File. The three optional form parameters are

Likes Col, Dislikes Col and Rating Col. Descriptions

for each form parameter can be seen in Figure 4.

An example of ‘covidsafereviewsresults.csv’

dataset can be seen in Table 2. We take the content

column (6th indexed column) as Samples Col param-

eter, thumbsUpCount column (8th indexed column)

AH-CID: A Tool to Automatically Detect Human-Centric Issues in App Reviews

389

Table 1: Example of app reviews discussing human-centric issues.

User review Human-centric issue

I downloaded the app while another colleague downloaded it nearby. I was very

shocked to see that his app was pre-populated with all my personal details de-

spite different emails addresses and bluetooth disabled. Total breach of my pri-

vacy and I am deleting the app. Absolutely hopeless.

Privacy

I m disappointed I can t install it on an older Android phone. That must cut out a

portion of the population

Socio-Economic

Where is input choice for women s cycles of women with hysterectomy Where

are choices for DISABLED WOMEN who can do partial body workouts only This

is NOT an inclusive app offers little variety at all for those outside the fully

physically able bodied normal people INCREDIBLY DISAPPOINTING as it s

such a HUGELY MISSED OPPORTUNITY

Gender, Disability, Emotional

App cannot be found if your Google account is registered in another country, so

how are you tracing visitors?

Location

Does not accept norwegian keyboard as default when LL THE apps. Have to use

spaceward to swap. Worked ﬁne before. EXTREMLY FRUSTRATERING????.

Language, Emotional

I can’t log in because by the time I ﬁnd the code on the phone it’s too late to enter

it. This is too quick for older people!

Age

Table 2: Example data of covidsafereviewsresults.csv.

reviewId userName userImage content score

thumbs

UpCount

review

Created

Version

xx:XXxxXX John Smith

https://example.com/

userimage

The new UI is terrible!

Not user friendly at all.

Especially for older folks who

I would install it for on their

devices. I want to revert to

older versions on all of my

and client devices and hope

it does not pose a security risk.

1 97 68.10.1

asLikes Col parameter and score column (7th indexed

column) as Rating Col parameter. The user may then

select the GENERATE button once they are satisﬁed

with the form parameters. AH-CID will then display

another page showing different metrics about the data

in the CSV ﬁle as shown in Figure 5.

Dependent on the active tab selected underneath

Options, bar charts will be shown representing the

composition of the data based on the probabilities

given by the metric outlined in the active tab. In Fig-

ure 5, a bar chart of total label probabilities is shown

based on the count of samples. In Figure 6, a cate-

gorical bar chart of total label probabilities is shown

based on the rating it was given. The user may also

select the ‘Use Threshold?’ checkbox to instead have

the bar charts be based on a probability threshold such

that samples that have a label that is above or equal to

the threshold are only considered.

The user may also download the probability la-

belled data by selecting the ‘Download labelled’ link,

which will prompt the user to save a CSV ﬁle. An ex-

ample of this downloaded CSV dataset can be seen in

Table 3. Any errors using AH-CID, such as inputting

incorrect column parameters in ‘CSV Entry’, will re-

sult in an error page being shown with a description of

the error, such as ”Column values are out of bounds”.

5 EVALUATION

The contents of our dataset included labeled reviews

of 24 apps with a combined sample size of n=1,315.

The dataset was put in a 3:1 train(n=986)/test(n=329)

split. The model was then trained 100 times with

different training/testing dataset splits with the aver-

aged metrics of hamming loss, accuracy and micro F1

score used to evaluate the effectiveness of the model.

Accuracy refers to how well the set of labels pre-

dicted (for the test subset) exactly matched with the

corresponding set of true labels for the test subset.

Hamming loss refers to the fraction of labels that are

incorrectly predicted. F1 score can be interpreted as

the weighted average between precision and recall.

Micro F1 score is an extension of F1 score which

involves calculating metrics globally by counting the

ICSOFT 2021 - 16th International Conference on Software Technologies

390

Figure 2: AH-CID user interface.

Figure 3: Text entry results page for AH-CID.

total true positives, false negatives and false positives.

Given our model and both training and testing

datasets, the following metrics were found:

accuracy score = 0.914

hamming loss = 0.021

micro F1 score = 0.63

Based on these metrics, the following results can be

concluded. Given a sample from the test dataset:

• the model has a probability of ≈91.0% of assign-

ing all labels (as either 1 or 0) correctly

• the model has a probability of ≈2.1% of assigning

a single label (as either 1 or 0) incorrectly

• the new model has a moderate-high effectiveness

in assigning the samples with the correct label (as

While the accuracy score and hamming loss dis-

play positive connotations for the effectiveness of the

learning model, the micro F1 score also indicates that

the dataset is relatively balanced and contains no in-

herent biases as used in the model. The results for

four apps are shown in Figure 7 with the X axis show-

ing the 8 human-centric categories. Dependent on the

active tab selected underneath Options, bar charts will

represent the composition of the data based on the

probabilities given by the metric outlined in the ac-

tive tab. In Figure 7, for COVIDSafe and Firefox, the

total label probabilities is shown based on the count

of samples (Y axis). For Fitbit and Paybyphone apps,

the ‘Use Threshold?’ checkbox with a threshold rat-

ing of 0.5 is selected to instead have the bar charts

be based on a probability threshold such that number

of app reviews that have a label above or equal to the

threshold (Y axis). The results show that the model

is effective in predicting human-centric discussion in

text.

AH-CID: A Tool to Automatically Detect Human-Centric Issues in App Reviews

391

Figure 4: CSV entry example for AH-CID.

Figure 5: CSV entry results page with ‘By Count’ option selected.

6 DISCUSSION

6.1 What are the Common HCIs in App

Reviews?

As Figure 7 indicates, the human-centric issues vary

among different apps. For example, privacy and loca-

tion were the main concerns among users of COVID

19 and Firefox Browser mobile apps, that conse-

quently affects their emotions, making them frus-

trated or disappointed. COVIDSafe, being an app

created and maintained by the governments, incites

conversation amongst the storage and handling of the

user’s location and data by the government. Fitbit and

Paybyphone, on the other hand, seem to be developed

in a gender and language biased way. These indicate

the need for more research on human-centric aspects

in app development, that we aim our AH-CID tool

would facilitate.

COVIDSafe, being an app created and maintained

by the Australian Government (AustralianGovern-

ment, 2020), incites conversation amongst the stor-

age and handling of the user’s location and data by

the government. This was expressed by many users

within the dataset, as seen in Table 4. The Firefox

Browser app is marketed as a leading browser in pri-

vacy protection. A segment of the app description on

the Google Play store is captured below.

“Firefox for Android browser gives you effortless

privacy protection with lighting-fast page loads.

Enhanced Tracking Protection automatically blocks

over 2000 known online trackers from invading your

privacy and slowing down your pages.” (GooglePlay,

2020)

ICSOFT 2021 - 16th International Conference on Software Technologies

392

Table 3: Example of Results CSV ﬁle.

Text Likes Ratings Disability Age Emotional Language Location Privacy

Socio-

economic

Status

The new UI is terrible!

Not user friendly at all.

Especially for older

folks who I would

install it for on their

devices. I want to

revert to older versions

on all of my and client

devices and hope it does

not pose a security risk.

97 1

0.01184

3356798

142966

0.9939

560908

192195

0.06591

875754

51285

0.00169

9474286

364846

0.00493

300658

4214907

0.47160

126164

387484

0.002770

9282290

950773

Figure 6: CSV entry results page with ’By Rating’ option selected.

Table 4: Subset of COVIDSafe app reviews concerning pri-

vacy.

Content

Now you require my location This was my biggest

fear about downloading the app I need to protect my

whereabouts So you have made this app useless to me

Uninstalling

After the latest update this APP has now become a

TRACKING APP It no longer just requires Bluetooth

It now requires Vocation Data The government is now

tracking you so if you don’t want that which you

shouldn’t then uninstall this app It is now a risk to

your privacy and your security

As such, Firefox could be assumed to attract users

more concerned with their privacy, and updates that

threaten to take away this security provoke reviews

detailing users’ frustrations, as seen with the subset

shown in Table 5.

6.2 How Can We Create a Tool That

Accurately Detects and Categorises

Human-centric Issues in Text?

AH-CID is an initial attempt to create a tool that ac-

curately detects and categorises human-centric issues

in text. The web-based user interface provides a user-

friendly interface that can be used by developers. The

model in which the tool is based on is deemed effec-

tive in categorising human-centric discussion in text.

With the metrics from our model, it may be inferred

that our dataset has no bias towards any of its labels,

hence the model is effective in predicting human-

centric discussion in text. This research provides a

baseline and aims to encourage the future research on

more effective and accurate automatic tools to detect

human-centric issues in app reviews.

AH-CID: A Tool to Automatically Detect Human-Centric Issues in App Reviews

393

Figure 7: Results of the training dataset passed through AH-CID for four different apps, with and without setting the threshold.

Table 5: Subset of Firefox Browser app reviews concerning

privacy.

Content

AWU NEW UPDATE!!! ONE REBOOTS EVE

TIME I The AND SERC OR ANYTHING!!!!! Up-

dated review after your response: It had nothing to do

with my speed, this app shut down my phone every

time I tried to open it after the last update and was do-

ing unsettling things to my phone which I did not like

or felt safe with. Moved to another web browser as I

no longer trust your app after the latest update.

Never gave camera access to this app and I am us-

ing pour camera phone, all of sudden camera popped

up and closed for a second while browsing. Then

checked app permission, it had camera access which

I never gave. No privacy in this app. It is not safe.

Installed and formatted mobile.

Doses as a browser, but spams in notiﬁcations with

political, pro-censorship agenda, even if notiﬁcations

are disabled. Ca not be trusted anymore.

6.3 How AH-CID Can Help Developers

and End-users?

Results from (Alshayban et al., 2020) reﬂect the im-

portance of accessibility-related awareness to make

app developers becoming ambassadors of accessibil-

ity in their organisations. As discussed in the intro-

duction, the differences between developers and end-

users in terms of HCIs, including age, gender, culture,

and disabilities make it difﬁcult for developers to un-

derstand the end-users’ needs (Grundy et al., 2021).

AH-CID aims to ﬁll in this gap by facilitating the au-

tomatic detection of end-users’ human-centric issues

reported through app reviews by developers. This

would ultimately help end-users to get access to easy

to use software tailored based on their human-centric

needs.

6.4 Threats to Validity

Our study has several limitations, as reported below.

6.4.1 Internal Validity

There are internal threats to the validity of our results.

The ﬁrst one is the selection of our 24 studied apps.

We considered a wide range of apps with large-scale

and vulnerable end-users to be able to cover as many

human-centric issues as possible. Moreover, since the

manual labeling of all the app reviews was not possi-

ble, we adopted a keywords-based tool to select the

relevant reviews, and therefore this introduces bias

toward our keywords list. However, we have made

our keywords list available to encourage further future

work on it. Furthermore, manually analysis of the app

ICSOFT 2021 - 16th International Conference on Software Technologies

394

reviews may have introduced some bias. However,

the main objective of this paper is to build a dataset

that can be used to train our machine learning tool,

which can be trained on any kind of datasets in fu-

ture. Finally, we selected BR transformation method

with a base classiﬁer of SVM, as the suitable ma-

chine learning algorithm for our problem, based on

the review of the existing algorithms. This paper is

the ﬁrst attempt in using a machine learning tool to

detect human-centric issues in app reviews and in fu-

ture, other algorithms can be explored and be com-

pared with our method as the baseline.

6.4.2 External Validity

Our ﬁndings may not be generalised to all differ-

ent types of app reviews. Moreover, not all users

share their human-centric issues through app reviews,

and therefore we might have missed some essential

human-centric issues. Furthermore, the identiﬁed cat-

egories of human-centric issues are exclusive to the

apps we selected and the app reviews we analysed.

This can encourage future research on exploring other

apps and looking into other human-centric aspects

that may not have been covered in our work.

7 RELATED WORK

There have been several works on the mining and

classiﬁcation of user reviews to understand the feed-

back from users and to provide information to soft-

ware developers to help in the evolution and mainte-

nance of their software applications.

Li et al. developed a framework for analysing

user reviews to understand user satisfaction as an in-

put to support software evolution (Li et al., 2010).

A related study utilised topic modelling and senti-

ment analysis to extract useful topics for the purpose

of requirements engineering (Carre

no and Winbladh,

2013). To minimise the effort required in analysing

user reviews, Di Sorbo et al. introduced a tool for

summarising app reviews into explanatory summaries

for developers (Di Sorbo et al., 2016).

Moreover, in the area of reviews classiﬁcation,

Panichella et al. introduced a taxonomy for clas-

sifying reviews using a hybrid of natural language

processing techniques, text analysis, and sentiment

analysis (Panichella et al., 2015). Similarly, another

study applied probabilistic methods to classify re-

views into four categories, namely, bug reports, fea-

ture requests, user experiences, and ratings (Maalej

and Nabil, 2015). Finally, the reﬂection and violation

of human values in app reviews is explored by (Obie

et al., 2021). Their results show that a quarter of the

22,119 analysed app reviews contain perceived viola-

tion of human values in mobile apps, supporting the

recommendation for the use of app reviews as a poten-

tial source for mining values requirements in software

projects (Obie et al., 2021). Our work complements

the studies discussed above, as we also aim to stream-

line the feedback review analysis process. However,

we focus on the automatic detection and classiﬁcation

of human-centric issues in app reviews.

When it comes to multi-label learning, there pri-

marily exists two main methods to solve the prob-

lem: problem transformation methods and algorithm

adaptation methods. Problem transformation meth-

ods transform multi-label problems into single-label

problem(s) which are then used against models pri-

marily suited for the classiﬁcation of multi-class la-

bels. Conversely, algorithm adaptation methods use

algorithms directly extended for the use of multi-label

learning (Madjarov et al., 2012). However, problem

transformation methods are simpler and more gener-

alised when categories are stochastically independent

of each other. The human-centric discussion labels

that we have elected to use share no or minor de-

pendencies between them giving warrant to preferring

problem transformation methods.

(Belyakov et al., 2020) utilised several ML tech-

niques to classify bug reports and service support re-

quests. In their study, support vector machine yielded

the highest accuracy based off the proportion of true

positive and true negatives over the total number of

instances. In the study carried out by (Rahmawati and

Khodra, 2015), the authors compared multiple com-

binations of feature selection and multi-label classi-

ﬁcation approaches to determine which combination

was most effective in categorising Indonesian news

articles. They found that the problem transforma-

tion algorithms, binary relevance and calibrated la-

bel ranking, using support vector machine (SVM) as

its base learning algorithm, obtained the highest F1

scores. Algorithm adaption models were signiﬁcantly

improved using feature selection methods which in-

cluded information gain, symmetrical uncertainty and

correlation coefﬁcient while problem transformation

approaches only saw small increases.

Despite large improvements in the algorithm

adaption models, problem transformation approaches

exhibited greater F1 scores (Rahmawati and Khodra,

2015). binary relevance (BR), used by (Cherman

et al., 2011), has low computational complexity in

comparison with other problem transformation meth-

ods. BR scales linearly with the number of binary

classiﬁers, meaning it is more effective on datasets

with a small number of binary classiﬁers. However,

AH-CID: A Tool to Automatically Detect Human-Centric Issues in App Reviews

395

the BR method is limited by its strong assumption of

independent labels.

While we have discussed different approaches to

multi-label learning, the main contribution of this

work is not the technical approach applied, but rather

we draw attention to the critical human-centric issues

affecting the users’ experience of software applica-

tions, and the use of app reviews as a valuable proxy

to detect these issues.

8 CONCLUSION

This paper presents AH-CID, a novel tool that has the

means of detecting and analysing human-centric is-

sues in text, to allow developers to ascertain which

issues are adversely affecting their diverse user base.

Using a machine learning approach, a model was con-

structed and deployed with 91.4% accuracy and a

2.1% hamming loss. Also, the training data was bal-

anced resulting in a moderate-high F1 score.

In the future, we plan to investigate a larger set of

apps reviews and human-centric issues using our tool.

Additionally, an empirical study with users on their

perception of human-centric issues is another area for

our future work. We also plan to extend the tool and

add a user review feature for the end-users to use the

tool and detect HCIs in different apps to be able to se-

lect and download apps more wisely, from a human-

centric aspect.

ACKNOWLEDGEMENTS

Support for this work from ARC Laureate Program

FL190100035 and ARC Discovery DP200100020 is

gratefully acknowledged.

REFERENCES

Alshayban, A., Ahmed, I., and Malek, S. (2020). Acces-

sibility issues in android apps: State of affairs, sen-

timents, and ways forward. In Proceedings of the

ACM/IEEE 42nd International Conf. on Software En-

gineering, ICSE ’20, page 1323–1334, New York,

USA. Association for Computing Machinery.

AustralianGovernment (2020). Background to covidsafe. In

https://covidsafe.gov.au/background.html.

Belyakov, S., Bozhenyuk, A., Kacprzyk, J., and Rozenberg,

I. (2020). Intelligent planning of spatial analysis pro-

cess based on contexts. In International Conf. on In-

telligent and Fuzzy Systems, pages 10–17. Springer.

Carre

no, L. V. G. and Winbladh, K. (2013). Analysis of

user comments: An approach for software require-

ments evolution. In ICSE.

Cherman, E. A., Monard, M. C., and Metz, J. (2011). Multi-

label problem transformation methods: a case study.

CLEI Electronic Journal, 14(1):4–4.

Di Sorbo, A., Panichella, S., Alexandru, C. V., Shima-

gaki, J., Visaggio, C. A., Canfora, G., and Gall,

H. C. (2016). What would users change in my app?

summarizing app reviews for recommending software

changes. In FSE.

Farooqui, T., Rana, T., and Jafari, F. (2019). Impact of

human-centered design process (hcdp) on software

development process. In 2019 2nd International Conf.

on Communication, Computing and Digital systems

(C-CODE), pages 110–114. IEEE.

GooglePlay (2020). Firefox browser: fast, private & safe

web browser - google play.

Grundy, J., Khalajzadeh, H., and Mcintosh, J. (2020). To-

wards human-centric model-driven software engineer-

ing. In ENASE, pages 229–238.

Grundy, J., Khalajzadeh, H., McIntosh, J., Kanij, T., and

Mueller, I. (2021). Humanise: Approaches to achieve

more human-centric software engineering. In Eval-

uation of Novel Approaches to Software Engineering:

15th International Conf., ENASE 2020, Prague, Czech

Republic, May 5–6, 2020, Revised Selected Papers 15,

pages 444–468. Springer International Publishing.

Hartzel, K. (2003). How self-efﬁcacy and gender issues

affect software adoption and use. Communications of

the ACM, 46(9):167–171.

Jack, A. I., Dawson, A. J., Begany, K. L., Leckie, R. L.,

Barry, K. P., Ciccia, A. H., and Snyder, A. Z. (2013).

fmri reveals reciprocal inhibition between social and

physical cognitive domains. NeuroImage, 66:385–

401.

Levy, M. and Hadar, I. (2018). The importance of empathy

for analyzing privacy requirements. In 2018 IEEE 5th

International Workshop on Evolving Security & Pri-

vacy Requirements Engineering (ESPRE), pages 9–

13. IEEE.

Li, H., Zhang, L., Zhang, L., and Shen, J. (2010). A user

satisfaction analysis approach for software evolution.

In PIC, volume 2.

Maalej, W. and Nabil, H. (2015). Bug report, feature re-

quest, or simply praise? on automatically classifying

app reviews. In RE.

Madjarov, G., Kocev, D., Gjorgjevikj, D., and D

zeroski,

S. (2012). An extensive experimental comparison of

methods for multi-label learning. Pattern recognition,

45(9):3084–3104.

Mao, J.-Y., Vredenburg, K., Smith, P. W., and Carey, T.

(2005). The state of user-centered design practice.

Communications of the ACM, 48(3):105–109.

Mathews, C., Ye, K., Grozdanovski, J., Marinelli, M.,

Zhong, K., Khalajzadeh, H., Obie, H., and Grundy,

J. (2021). AH-CID: A Tool to Automatically Detect

Human- Centric Issues in App Reviews [Data set].

https://doi.org/10.5281/zenodo.4475066.

Miller, T., Pedell, S., Lopez-Lorca, A. A., Mendoza, A.,

Sterling, L., and Keirnan, A. (2015). Emotion-led

ICSOFT 2021 - 16th International Conference on Software Technologies

396

modelling for people-oriented requirements engineer-

ing: the case study of emergency systems. Journal of

Systems and Software, 105:54–71.

Obie, H. O., Hussain, W., Xia, X., Grundy, J., Li, L.,

Turhan, B., Whittle, J., and Shahin, M. (2021). A

ﬁrst look at human values-violation in app reviews.

In ICSE-SEIS.

Panichella, S., Di Sorbo, A., Guzman, E., Visaggio, C. A.,

Canfora, G., and Gall, H. C. (2015). How can I im-

prove my app? classifying user reviews for software

maintenance and evolution. In ICSME.

Rahmawati, D. and Khodra, M. L. (2015). Automatic mul-

tilabel classiﬁcation for indonesian news articles. In

2015 2nd International Conf. on Advanced Informat-

ics: Concepts, Theory and Applications (ICAICTA),

pages 1–6. IEEE.

Stock, S. E., Davies, D. K., Wehmeyer, M. L., and Palmer,

S. B. (2008). Evaluation of cognitively accessi-

ble software to increase independent access to cell-

phone technology for people with intellectual dis-

ability. Journal of Intellectual Disability Research,

52(12):1155–1164.

Wirtz, S., Jakobs, E.-M., and Zieﬂe, M. (2009). Age-

speciﬁc usability issues of software interfaces. In Pro-

ceedings of the IEA, volume 17.

AH-CID: A Tool to Automatically Detect Human-Centric Issues in App Reviews

397