Towards More Reliable Text Classiﬁcation on Edge Devices via a

Human-in-the-Loop

Jakob Smedegaard Andersen and Olaf Zukunft

Hamburg University of Applied Sciences, Department of Computer Science, Hamburg, Germany

Keywords:

Text Classiﬁcation, Interactive Machine Learning, Time Efﬁciency.

Abstract:

Reliably classifying huge amounts of textual data is a primary objective of many machine learning applica-

tions. However, state-of-the-art text classiﬁers require extensive computational resources, which limit their

applicability in real-world scenarios. In order to improve the application of lightweight classiﬁers on edge

devices, e.g. personal work stations, we adapt the Human-in-the-Loop paradigm to improve the accuracy

of classiﬁers without re-training by manually validating and correcting parts of the classiﬁcation outcome.

This paper performs a series of experiments to empirically assess the performance of the uncertainty-based

Human-in-the-Loop classiﬁcation of nine lightweight machine learning classiﬁers on four real-world classiﬁ-

cation tasks using pre-trained SBERT encodings as text features. Since time efﬁciency is crucial for interactive

machine learning pipelines, we further compare the training and inference time to enable rapid interactions.

Our results indicate that lightweight classiﬁers with a human in the loop can reach strong accuracies, e.g.

improving a classiﬁer’s F1-Score from 90.19 to 97% when 22.62% of a dataset is classiﬁed manually. In

addition, we show that SBERT based classiﬁers are time efﬁcient and can be re-trained in < 4 seconds using

a Logistic Regression model.

1 INTRODUCTION

Maximizing the accuracy of automatic classiﬁers is

a key goal of machine learning (LeCun et al., 2015).

State-of-the-art text classiﬁers reach remarkable ac-

curacy across many domains (Devlin et al., 2018;

Sachan et al., 2019; Yang et al., 2019). Especially,

transformer based classiﬁers such as BERT (Devlin

et al., 2018) or XLNet (Yang et al., 2019) have

demonstrated to be the best performing approaches in

many text classiﬁcation tasks. However, such strong

classiﬁers are usually highly complex and consist

of millions of parameters limiting their applicability

on weak computational infrastructure, i.e. edge de-

vices. The increasing energy consumption of state-of-

the-art classiﬁers also creates environmental concerns

(Strubell et al., 2019; Schwartz et al., 2020). If such

strong models are not applicable, practitioners are ex-

cluded from their application and have to switch to

less resource-intensive models that come at the cost

of less reliable outcomes. Corazza et al. (Corazza

et al., 2020) for example report a F1-Score of 82%

for detecting hate speech in online forums using a

traditional Word-Embedding-based classiﬁer, which

might not satisfy the demand of forum providers.

The need for reliable and trustful classiﬁers has

recently risen in attention (Kendall and Gal, 2017;

Holzinger, 2016; Sacha et al., 2015). Human-in-

the-Loop machine learning (Holzinger, 2016) aims to

overcome the obstacles of pure automatic classiﬁers

by involving domain experts into the machine learn-

ing loop. Letting experts’ correct classiﬁcation out-

comes during their daily work, e.g. Journalist-in-the-

Loop (Karmakharm et al., 2019), is a promising way

to increase the accuracy of classiﬁcation outcomes

without re-training (Pavlopoulos et al., 2017). In par-

ticular, uncertainty-based approaches have shown to

be capable of detecting highly unreliable outcomes

which are worth checking manually (He et al., 2020;

Hendrycks and Gimpel, 2016).

The success of Human-in-the-Loop classiﬁcation

approaches do not only depend on a model’s ini-

tial performance (e.g. F1-Score). An uncertainty-

based semi-automated text classiﬁcation approach re-

quires accurate uncertainty estimations able to in-

dicate misclassiﬁcations. Estimating reliable uncer-

tainty scores in classiﬁcation models is difﬁcult, es-

pecially using Neural Networks (Hern

andez-Lobato

and Adams, 2015). The question arises whether sim-

pler models provide more accurate uncertainty esti-

636

Andersen, J. and Zukunft, O.

Towards More Reliable Text Classiﬁcation on Edge Devices via a Human-in-the-Loop.

DOI: 10.5220/0010980600003116

In Proceedings of the 14th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2022) - Volume 2, pages 636-646

ISBN: 978-989-758-547-0; ISSN: 2184-433X

mations, which lead to higher F1-Scores when a cer-

tain number of the most uncertain instances are de-

cided by a human rather than an automatic classiﬁer.

Furthermore, Human-in-the-Loop machine learning

pipelines require rapid interaction cycles to e.g. re-

train the model from time to time when additional

human feedback is available (Amershi et al., 2014).

Since the applicability of strong classiﬁers, i.e. BERT,

is very limited on time dependent tasks and on weak

computational infrastructure, we aim for a time efﬁ-

cient use of computational resources to enable rapid

Human-in-the-Loop interactions. Especially, as clas-

siﬁers beneﬁt from being frequently re-trained when

additional labeled data-instances are available (Arnt

and Zilberstein, 2003; Haering et al., 2021).

In this paper, we empirically examine the quality

of predicted probabilities and the macro F1-Score of

nine commonly used and lightweight machine learn-

ing text classiﬁcation models when a certain amount

of the data is decided by humans instead of a ma-

chine. We perform several experiments on four pub-

licly available benchmark datasets in the domain of

text classiﬁcation. As feature representations, we

use semantic meaningful SBERT encodings (Reimers

and Gurevych, 2019), which have shown to be ef-

ﬁciently computable while outperforming other re-

cent pre-trained language models such as the Univer-

sal Sentence Encoder (Cer et al., 2018) or averaged

GloVe embeddings (Pennington et al., 2014). To en-

sure rapid interaction cycles, we additionally compare

the time needed to perform training and inference on

a weak computational infrastructure. We focus on the

following research questions:

• RQ1: How accurate do different lightweight clas-

siﬁers estimate predicted probabilities?

• RQ2: Which lightweight classiﬁer can capture

the highest proportion of misclassiﬁcations via

uncertainty-sampling, which after removal leads

to the highest macro F1-Score?

• RQ3: How much of the most uncertain classiﬁ-

cation outcomes have to be manually annotated to

reach a certain level of macro F1-Score?

• RQ4: How efﬁcient are different classiﬁers re-

garding their training and inference time?

The remainder of the paper is structured as fol-

lows. Section 2 outlines the task of classiﬁcation and

its extension to the Human-in-the-Loop paradigm.

Further, several classiﬁcation models and techniques

to estimate the uncertainty of individual classiﬁca-

tions are described. In Section 3 we outline our re-

search design and Section 4 reports our experimental

results. Section 5 discusses our ﬁndings and Section

6 states related work. Finally, in Section 7 we draw

our conclusions.

2 SEMI-AUTOMATIC TEXT

CLASSIFICATION

We ﬁrst outline the task of text classiﬁcation (Ras-

mussen and Williams, 2006) and afterwards introduce

the uncertainty-based semi-automatic classiﬁcation of

text.

The objective of classiﬁcation is to predict class

labels y ∈ Y ⊂ N for new data instances x ∈ X ⊂ R

e.g. text encodings, which are related according to

an unknown conditional class probability p(y = c|x).

Classiﬁcation models aim to learn a function of the

form f : X → Y or f : X → p(Y |X) from a set of

labeled training examples D ⊂ X × Y . Given an in-

stance x, a probability based model f reports the label

which receives the highest conditional class probabil-

ity y

∗

= f (x) = argmax

p(y = c|x) over all classes c.

Since not all classiﬁers are able to report probabili-

ties, fractions of majority votes or scaling techniques

are carried out to transform classiﬁcation outcomes,

e.g. distance functions, into probability distributions

(Platt et al., 1999).

A common method to assess the uncertainty of

classiﬁers is by calculating Shannon’s Entropy (Shan-

non, 2001) of the conditional class probabilities, that

is:

H(x) = −

∑

P(y = c|x)log

P(y = c|x) (1)

Shannon’s Entropy estimates uncertainty as a lack

of conﬁdence in all class outcomes. The most uncer-

tain instance u can be identiﬁed as u = argmax

H(x).

A prediction f (x) maximizes H(x) when all class out-

comes are equally certain, e.g. p(0|x) = p(1|x) = 0.5

in a binary classiﬁcation task and minimizes H(x)

when either p(0|x) or p(1|x) are equal to 1. Sampling

a subset of the most uncertain data instances is com-

monly referred as uncertainty sampling (Lewis and

Gale, 1994).

Manually annotating text is a typical labeling task,

where humans are asked to manually infer labels for

some data instances. Since manual labeling is cost

intensive and time-consuming, it makes sense to let

humans only observe instances where a model pro-

vides unreliable and probable wrong outcomes. Hu-

man efforts should be focused on the most uncertain

predictions to maximize the efﬁciency of their partic-

ipation (Hendrycks and Gimpel, 2016). Especially,

since uncertainty inherent in data instances cannot be

explained by classiﬁers causing unreliable model be-

haviour (Kendall and Gal, 2017). In order to spend

Towards More Reliable Text Classiﬁcation on Edge Devices via a Human-in-the-Loop

637

human efforts most rewarding and efﬁcient, a classi-

ﬁer has to provide a decent ranking of misclassiﬁca-

tions in regard to the reported uncertainty scores.

2.1 Machine Learning Classiﬁers

Several classiﬁcation approaches are successfully

used to classify text documents (Lai and Tsai, 2004;

Liu and Chen, 2017; Stanik et al., 2019). However,

previous work mostly focuses on pure automatic ap-

proaches and does not cover the objective of semi-

automated classiﬁcation. It remains unclear which

model is most efﬁcient when humans are involved in

the classiﬁcation process while saving time and com-

putational costs. In our experiments, we consider the

following lightweight machine learning models for

classiﬁcation (Bishop, 2006; Hastie et al., 2009) and

outline how to obtain conditional class probabilities

for the assessment of uncertainties.

These are: (1) a Decision Tree (DT) which esti-

mates its conditional class probabilities by reporting

pre-calculated fractions of correct class outcomes of

each leaf node during training (Neville et al., 2003).

(2) A Random Forest (RF) which reports conditional

class probability as the fraction of trees voting for a

certain class outcome. Further, we consider (3) a k-

Nearest Neighbour (kNN) classiﬁer where new doc-

uments are classiﬁed according to the k most sim-

ilar documents of the training dataset. A majority

vote is carried out to determine the ﬁnal class out-

come. Analogously to a Random Forest, we consider

the fraction of votes as the conditional class probabil-

ity. Naive Bayes classiﬁers are a family of conditional

probability models which use Bayes rule to infer con-

ditional class probabilities. Since SBERT encodings

consist of continuous and also negative attributes, we

apply (4) Gaussian Native Bayes (GNB), a variation

which makes the assumption that attributes of the fea-

ture vector are distributed according to a normal dis-

tribution. (5) A Support Vector Machine (SVM)

classiﬁes data by searching an optimal linear hyper-

plane which separates features with a maximal mar-

gin. The classiﬁcation rule is based on which side of

the hyperplane a data point occurs. In this paper, we

apply Platt scaling (Platt et al., 1999) to obtain condi-

tional probabilities from SVM outcomes. (6) Logistic

Regression (LR) is a commonly used classiﬁer which

is capable of additionally predicting conditional class

probabilities. A Logistic Regression model uses a sig-

moid function to squeeze the output of a linear pre-

dictor function between 0 and 1 to represent class

probabilities. (7) A Multilayer Perceptron (MLP)

is a Neural Network-based classiﬁer consisting of lay-

ers of interconnected computational units performing

summation and thresholding. Similar to Logistic Re-

gression, the class activation scores are normalized to

obtain pseudo class probabilities.

Further, we consider a Bayesian approach to en-

able rich uncertainty interpretations (Gal and Ghahra-

mani, 2016; Siddhant and Lipton, 2018). A Bayesian

classiﬁer replaces the models’ weights ω with distri-

butions, i.e. a Gaussian prior ω ∼ N(0, 1). Since

the posterior probability p(ω|X ,Y ) cannot be eval-

uated analytically, several approximation techniques

are used in practise (Blundell et al., 2015; Gal and

Ghahramani, 2016). Sample-based approximations

aim to ﬁt the posterior p(ω|X,Y ) with a simple to

compute distribution q

∗

(ω). The conditional class

probability can then be approximated by averaging T

Monte Carlo samples over possible weights. In this

paper, we consider (8) a Bayesian variation of the

Multilayer Perceptron (B-MLP). Bayesian models

are of particular interest since they also capture uncer-

tainty inherent in the models parameters (Kendall and

Gal, 2017), while conventional deterministic classi-

ﬁers do only assess uncertainties inherent in the data.

A holistic uncertainty assessment of Bayesian classi-

ﬁers can be carried out by calculating Shannon’s En-

tropy (Eq. 1) on the mean conditional class proba-

bilities obtained by averaging the results of multiple

model runs.

3 BENCHMARK DESIGN

To answer our research questions, we ﬁrst assess the

quality of predicted probabilities provided by the clas-

siﬁers outlined in Section 2.1. We measure the Brier

score (Brier et al., 1950) of each classiﬁer applied

to each dataset. The Brier score is a proper scoring

rule to measure the accuracy of predicted probabili-

ties. It is calculated as the squared error of the pre-

dicted probabilities and true class outcomes, that is:

BS = |Y |

−1

∑

y∈Y

∑

c∈C



p(y = c|x) − I(ˆy = c)



(2)

where I( ˆy = c) = 1 if the true class of x represented

by ˆy is equal to c else 0. The lower the Brier score

the better are the conditional class probabilities cal-

ibrated. Calibrated class probabilities are desired to

reliably assess the true probability of predictions lead-

ing to more accurate quantiﬁcation of predictive un-

certainties. Second, we compare the macro F1-Score

of these classiﬁers when a certain amount of the most

uncertain data instances, in our case 0, 10, 20, and

30%, are removed from the test dataset. We use the

macro-average since some used datasets are highly

imbalanced and we aim to treat all classes equally

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

638

important. Third, we estimate the amount of man-

ual effort a human has to spend, i.e. by correcting

classiﬁer outcomes, in order to reach a speciﬁc target

macro F1-Score. We measure human efforts in terms

of instances a human has to decide manually. In our

experiments, we simulate human annotations by se-

lecting the ground truth label for each annotation re-

quest, a common approach when evaluating interac-

tive machine learning approaches (Siddhant and Lip-

ton, 2018). Since human annotations are known to be

noisy, we simulate three different human noise levels.

We assign a randomly selected class label with a prob-

ability of 0, 5, or 10% respectively to each annotation

request instead of the ground truth label. The macro

F1-Score of a combined human-automatic classiﬁer

is calculated based on the uniﬁed sets of manually

corrected and automatically inferred labels. Fourth,

we measure the training and inference time to assess

the computational efﬁciency of different text classi-

ﬁers. All experiments are run on an Intel

Xeon

Gold 5115 CPU @ 2.40GHz using 1 core and 4 GB of

memory. All reported measurements are the mean of

ﬁve stratiﬁed cross fold data sets with a 50% training-

test split. In the following, we use the shortened term

”F1-Score” to refer to the macro F1-Score.

3.1 Datasets

We consider four different publicly available real-

world datasets covering heterogeneous classiﬁcation

tasks in our experiments. Key statistics of the datasets

are summarized in Table 1.

Table 1: Statistics of the datasets including size, number of

classes and its distribution as well as the mean number and

standard deviation of words per text instance.

Dataset Size |C| Class Distribution #Words (µ ± σ)

IMDB 50000 2 25000:25000 234 ± 173

App Store 5752 3 3472:1286:994 24 ± 29

Reuters 8614 8

3930:2319:527:499:

458:425:290:166

117 ± 129

Hate Speech 24783 2 19190:5593 15 ± 7

First, we use the IMDB dataset (Maas et al.,

2011), a commonly used benchmark for sentiment

analysis. The dataset consists of highly polarized

ﬁlm reviews, which are either labeled as positive or

negative. Second, we consider a corpus of app re-

views from the domain of participatory requirements

engineering. The App Store dataset (Maalej et al.,

2016) contains user reviews, which are manually la-

beled as feature request, bug report or praise. Third,

we take a dataset collected from the Reuters ﬁnancial

newswire service (Lewis et al., 2004). Documents are

labeled regarding their topic. In our experiments, we

use a subset of the 8 most frequent topics with unam-

biguous labels. Lastly, we consider the Hate Speech

dataset (Davidson et al., 2017) which comes with the

task of identifying toxic tweets (hate speech or offen-

sive language). For each dataset, we apply a stratiﬁed

split of 50% for training and the remaining for testing.

3.2 Document Features

Text documents consist of sequences of characters

and have to be transformed to a vector space before

passing them to machine learning models. As the fea-

ture representation for text documents, we consider

Sentence-BERT (SBERT) (Reimers and Gurevych,

2019) encodings. SBERT is a modiﬁcation of the pre-

trained Bidirectional Encoder Representations from

Transformers (BERT) (Devlin et al., 2018) model and

provides semantically meaningful encodings for unla-

beled text documents without the need of domain spe-

ciﬁc pre-training and ﬁne-tuning. Studies show that

SBERT encodings outperform out-of-the-box BERT

encoding in several text classiﬁcation tasks (Reimers

and Gurevych, 2019). Furthermore, SBERT encod-

ings are resource efﬁcient to compute.

SBERT, like other BERT variations, encodes a

document d as an n-dimensional vector of continuous

attributes x = (a

, ..., a

). We employ the pre-trained

bert-base-nli-mean-tokens

model, which computes

encodings of length n = 768. Since BERT uses sub-

word tokenization, BERT encodings are limited to

512 tokens, which are around 300 to 400 words.

Therefore, we use the mean SBERT encoding of each

individual sentence for the IMDB dataset to avoid

truncation. In a preliminary investigation, we found

that mean SBERT encodings have a positive effect

only on the F1-Score of the IMDB classiﬁers.

3.3 Classiﬁer Implementations

For the majority of classiﬁers, we rely on the default

implementation provided by the Scikit-learn library

since these are commonly used for machine learning

experiments. For the Random Forest classiﬁer, we

use T = 100 Decision Trees and set k = 25 for the

kNN classiﬁer. The structure of the MLP takes the

shape [768, 500, 500, C]. We do not perform hyper-

parameter tuning. Since Scikit-learn does not offer a

Bayesian-MLP, we employ Tensorﬂow

version 2.4.1

for the implementation (B-MLP

). We approximate

the posterior using a dropout variational distribution

https://huggingface.co/sentence-transformers/

bert-base-nli-mean-tokens

https://scikit-learn.org/stable/index.html

https://www.tensorﬂow.org/

Towards More Reliable Text Classiﬁcation on Edge Devices via a Human-in-the-Loop

639

(Gal and Ghahramani, 2016) and apply T = 100 for-

ward passes. Since an identical recreation of Scikit-

learn’s MLP in Tensorﬂow is difﬁcult, we addition-

ally develop a (9) conventional non-Bayesian-MLP

(MLP

) to compare the impact of Bayesian mod-

elling. In comparison to Scikit-learn’s MLP imple-

mentation, our MLP model applies dropout similarly

to the Bayesian MLP, but only during training. Fur-

ther, we select 10% of the training data as valida-

tion data for all MLP implementations to enable early

stopping. The source code of all models, parameters

and experiments are publicly available.

4 RESULTS

In this section, we present the results of our experi-

ments and answer the four research questions.

4.1 Quality of Predicted Probabilities

(RQ1)

The Brier scores of our experiments covering nine

classiﬁers and four datasets are shown in Table 2. The

lower the Brier score, the more accurate the predicted

conditional class probabilities. A Brier score of 0 in-

dicates a perfectly accurate classiﬁer, whereas a score

of 1 indicates a highly inaccurate one.

Table 2: Brier scores of different classiﬁers and datasets

measuring the accuracy of the predicted conditional class

probabilities. The lower the Brier score the better are the

conditional class probabilities calibrated.

Classiﬁer IMDB App Store Reuters Hate Speech AVG

DT 0.4259 0.5443 0.4869 0.4659 0.4806

RF 0.2059 0.2629 0.2247 0.2221 0.2289

kNN 0.2076 0.2629 0.1794 0.2302 0.2200

GNB 0.3174 0.4615 0.3919 0.4962 0.4168

SVM 0.1460 0.2435 0.1043 0.1934 0.1718

LR 0.1464 0.2398 0.0954 0.1919 0.1684

MLP 0.1592 0.2098 0.1029 0.2045 0.1691

MLP* 0.1542 0.2060 0.0939 0.1879 0.1605

B-MLP* 0.1513 0.2036 0.0920 0.1819 0.1572

The table reveals huge differences between the

classiﬁers regarding their quality of predicted prob-

abilities. A DT and GNB provide the worst calibrated

probabilities with an average Brier score of 0.48 and

0.42 respectively. RF and kNN reach nearly equally

calibrated probabilities, with an average of > 0.22.

SVM, LR and MLP as well as its variations receive

the best Brier scores. LR followed by SVM ob-

tains the best scores on the IMDB dataset, whereas

MLP* and B-MLP* receive the best scores on the

App Store, Reuters and Hate-Speech Dataset. Over-

all, a Bayesian MLP (B-MLP*) followed by a dropout

https://github.com/jsandersen/MRTviaHIL

based MLP (MLP*) provide the most accurate proba-

bilities with an average Brier score of ∼ 0.16.

4.2 Classiﬁer Performance under

Stepwise Removal of Uncertain

Instances (RQ2)

Table 3 lists the F1-Scores of the classiﬁers applied

to each of the datasets. The columns represent the

F1-Scores which are obtained when a certain num-

ber (0, 10, 20 and 30%) of the most uncertain in-

stances are removed from the test set. Each cell addi-

tionally states the relative improvement of F1-Score

in relation to the previous removal ratio. For exam-

ple, a SVM on the IMDB dataset reaches a F1-Score

of 90.24% on the entire test dataset. If 10% of the

most uncertain instances are removed, the F1-Score

increases to 93.60% which is a relative improvement

of 3.72%.

Our experiment shows that on the whole test

dataset, i.e. using a removal ratio of 0%, a DT and

GNB provide the worst F1-Score followed by the

kNN and RF classiﬁers. LR, SVM and the MLP

reach the highest initial F1-Scores. Scikit-learn’s

MLP implementation provides a worse performance

compared to our Tensorﬂow implementation. In our

setting, Bayesian modelling (B-MLP

) shows no im-

provement in F1-Score compared to a deterministic

MLP. Overall, SVM and LR classiﬁers perform best

on the IMDB and Reuters datasets, whereas a deter-

ministic MLP with dropout performs best on the App

Store and Hate Speech datasets.

When a certain number of the most uncertain in-

stances are removed from the test dataset, the F1-

Score generally increases. Only the uncertainty es-

timates of a Decision Tree classiﬁer are unable to de-

tect misclassiﬁcations since the F1-Score is not im-

proving when removing highly uncertain instances.

Further, the relative F1-Score improvements decrease

with larger removal ratios indicating a decreasing hu-

man efﬁciency when large amounts of removed data

are passed to human annotators. Overall, classiﬁers

with high initial F1-Scores also reach the best F1-

Scores after removing uncertain instances from the

test dataset. Only the initially better performing LR

gets outperformed by the MLP

when removing un-

certain instances on the Reuters dataset.

4.3 Semi-automated Classiﬁcation

Performance (RQ3)

Table 4 shows how much of the most uncertain in-

stances from the unseen test set have to be classiﬁed

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

640

Table 3: F1-Scores of different classiﬁers when a certain number of the most uncertain predictions were removed from the

test dataset.

IMDB App Store Reuters Hate Speech

Classiﬁer 0% 10% 20% 30% 0% 10% 20% 30% 0% 10% 20% 30% 0% 10% 20% 30%

78.70 78.75 78.77 78.77 64.80 65.02 64.94 65.00 58.11 58.30 58.15 58.30 66.99 66.95 67.03 67.05

+0.05 +0.03 -0.00 +0.33 -0.11 +0.09 +0.34 -0.27 +0.27 -0.01 +0.11 +0.02

86.37 89.69 92.33 94.34 76.02 79.41 81.94 82.61 75.47 81.41 85.42 89.80 73.47 74.07 73.23 70.91

+3.84 +2.94 +2.18 +4.47 +3.18 +0.82 +7.87 +4.93 +5.13 +0.82 -1.14 -3.16

85.54 88.88 91.61 93.71 76.65 80.70 83.85 86.69 76.46 81.36 83.66 78.27 68.48 67.54 67.06 67.26

kNN

+3.91 +3.07 +2.30 +5.28 +3.90 +3.38 +6.41 +2.82 -6.43 -1.37 -0.72 +0.30

83.95 87.25 90.16 92.55 72.86 76.26 79.51 82.66 69.06 73.98 80.37 87.62 69.95 72.82 75.76 78.95

GNB

+3.93 +3.33 +2.65 +4.66 +4.26 +3.96 +7.12 +8.63 +9.03 +4.11 +4.03 +4.21

90.24 93.60 95.74 97.04 76.84 80.83 84.84 88.45 88.38 93.98 96.77 98.44 79.51 82.16 84.42 86.15

SVM

+3.72 +2.29 +1.35 +5.19 +4.96 +4.26 +6.34 +2.97 +1.73 +3.33 +2.76 +2.05

90.19 93.53 95.71 97.02 78.89 83.33 87.59 91.19 88.75 94.98 97.56 98.42 80.17 83.54 86.44 88.42

+3.70 +2.33 +2.37 +5.62 +5.12 +4.11 +7.02 +2.72 +0.88 +4.20 +3.47 +2.29

89.55 92.96 95.32 96.85 80.09 84.65 88.55 92.10 87.77 94.32 97.57 98.64 80.06 83.53 86.54 88.82

MLP

+3.81 +2.53 +1.61 +5.69 +4.62 +4.00 +7.46 +3.44 +0.11 +4.33 +3.60 +2.64

89.94 93.36 95.59 96.96 81.08 86.12 89.89 92.97 88.68 95.24 97.87 99.15 81.12 84.77 88.24 90.68

MLP

+3.80 +2.39 +1.43 +6.21 +4.38 +3.42 +7.39 +2.76 +1.31 +4.50 +4.10 +2.77

89.95 93.39 95.66 97.08 80.99 86.02 89.74 92.97 88.76 95.14 98.09 99.24 81.12 84.80 87.99 90.42

B-MLP

+3.83 +2.43 +1.48 +6.21 +4.32 +3.61 +7.19 +3.10 +1.17 +4.54 +3.76 +2.76

Table 4: Performance of Human-in-the-Loop classiﬁers. Each cell shows how much of the most uncertain classiﬁcation

outcomes in percent have to be manually annotated in order to reach a certain F1-Score given a speciﬁc noise level of the

human annotator. Unobtainable F1-Scores due to a high number of committed human errors are marked as ”-”.

IMDB App Store Reuters Hate Speech

Classiﬁer .93 .95 .97 .99 .89 .91 .93 .95 .93 .95 .97 .99 .89 .91 .93 .95

DT 67.18 76.55 85.94 95.12 68.46 74.03 80.16 86.01 83.49 87.76 92.76 97.42 66.58 72.76 78.56 84.68

RF 17.00 24.84 35.94 57.04 22.53 28.41 33.82 40.43 14.21 18.34 24.40 38.91 25.67 30.76 37.44 48.14

kNN 18.87 26.94 38.44 60.21 21.96 26.35 31.41 37.48 15.37 18.50 23.31 34.50 23.31 26.97 31.75 38.55

GNB 22.87 30.97 42.78 64.06 29.79 34.01 38.61 44.06 28.47 32.39 37.87 51.78 41.68 48.01 54.70 62.52

SVM 6.33 12.71 22.46 44.42 18.43 22.34 27.28 32.42 5.11 8.94 13.47 24.75 15.03 19.60 24.95 32.46

LR 6.60 12.88 22.62 43.95 15.21 19.24 23.56 29.22 3.90 7.24 12.19 23.91 14.17 18.56 23.72 30.88

MLP 8.00 14.60 24.18 46.25 13.49 17.68 21.62 27.00 5.32 8.31 12.91 22.31 13.83 18.24 23.55 31.25

MLP

7.08 13.40 23.19 45.91 11.42 15.39 19.77 25.34 4.16 6.99 11.75 20.64 12.19 16.33 21.26 28.74

B-MLP

7.02 13.17 22.77 44.16 11.67 15.68 20.06 25.56 4.23 7.43 11.52 19.80 12.23 16.28 21.06 27.95

0% human noise

DT 75.81 86.38 97.79 - 7.75 84.07 90.30 96.53 - - - - 73.77 80.46 87.39 94.54

RF 18.34 28.06 44.64 - 26.91 33.54 40.93 51.50 18.11 27.58 - - 23.48 28.30 35.02 46.43

kNN 20.73 30.62 48.88 - 24.31 29.22 36.23 43.37 18.20 24.96 - - 25.06 29.64 36.20 48.12

GNB 25.23 35.41 55.46 - 33.01 37.92 43.80 53.35 33.78 - - - 46.49 54.17 63.57 76.75

SVM 6.77 14.05 28.26 - 20.18 25.53 31.04 40.08 5.94 10.68 23.17 - 16.50 21.76 28.70 41.51

LR 7.06 14.18 27.76 - 16.49 20.96 26.44 33.95 4.27 8.80 19.13 - 15.41 20.34 27.28 38.63

MLP 8.55 16.23 29.50 - 14.52 19.15 23.84 31.48 6.11 10.47 18.53 - 15.03 20.34 27.33 39.24

MLP

7.54 14.79 28.60 - 12.48 16.71 22.59 29.32 4.99 9.01 18.02 - 13.15 17.87 24.35 35.26

B-MLP

7.51 14.56 27.40 - 12.77 17.21 22.68 29.76 4.97 8.61 16.32 - 13.14 17.88 24.09 34.36

5% human noise

DT 88.38 - - - 87.58 95.03 - - - - - - 83.17 91.45 99.62 -

RF 20.64 33.76 - - 29.79 37.17 46.46 - - - - - 25.70 32.11 42.97 -

kNN 23.60 37.63 - - 27.32 34.39 42.77 - - - - - 27.36 33.37 44.30 -

GNB 28.74 44.38 - - 36.42 42.99 53.82 - - - - - 51.97 63.28 82.17 -

SVM 7.34 16.02 - - 22.40 28.60 37.67 - 7.53 - - - 18.13 24.54 35.16 -

LR 7.51 16.00 - - 18.18 23.37 30.48 - 5.32 13.49 - - 16.71 22.67 32.15 -

MLP 9.17 18.06 - - 16.46 21.96 30.16 - 7.22 14.60 - - 16.20 22.43 32.46 -

MLP

8.19 16.61 - - 14.05 19.46 27.00 - 5.71 13.03 - - 14.34 19.83 29.56 -

B-MLP

8.10 16.50 - - 14.49 19.52 26.47 47.47 5.99 12.03 - - 14.39 19.78 28.04 -

10% human noise

manually in order to raise the semi-automatic classiﬁ-

cation outcomes to a certain F1-Score. Each sub-table

represents a different human noise level as introduced

in Section 3. For example, on the IMDB dataset

12.71% of the most uncertain prediction have to be

manually corrected to improve the model’s F1-Score

(from initial 90.24%) to 95% using a SVM. The ta-

ble indicates that models with a high initial F1-Score

require less manual efforts to raise the F1-Score to a

certain target level. Overall, models with lower initial

F1-Score scores do rarely overtake better performing

classiﬁers in regard to the ﬁnal F1-Score when human

annotators are in the loop.

Involving humans with higher noise levels re-

quires more manual efforts to reach a speciﬁc F1-

Score, which is straightforward, since more misclassi-

ﬁcations are committed. However, our results indicate

that Human-in-the-Loop text classiﬁcation can reach

a higher F1-Score compared to its pure machine and

human parts on their own. For example, an LR classi-

Towards More Reliable Text Classiﬁcation on Edge Devices via a Human-in-the-Loop

641

ﬁer with an initial F1-Score of 90.19% on the IMDB

dataset and a 10% noisy human can reach an F1-Score

of >95% (max. 96.77%) when >18.6% of the dataset

is classiﬁed manually.

Our results reveal that lightweight classiﬁers can

reach strong accuracies with a human in the loop

even if the annotator commits several errors. Us-

ing the best performing classiﬁer, an F1-Score of

95% (+4.81), 91% (+9.92), 95%(+6.24), 91% (+9.88)

can be reached with a manual effort of 16.02, 19.46,

12.03, and 19.78% respectively considering a human

noise level of 10%. Compared to a 100% accurate hu-

man annotator, this is an increase in manual efforts of

24.37%, 26.45%, 61.91%, and 21.50% respectively.

Our results also demonstrate that top F1-Scores, e.g.

95-99%, are not reachable in all Human-in-the-Loop

settings. If the human annotations are too noisy, the

F1-Score is starting to decrease after a certain amount

of human assistance. As less uncertain predictions are

more often annotated by humans during larger work-

loads, the accuracy decreases. This phenomenon oc-

curs because noisy humans incorrectly annotate in-

stances that the machine would have correctly decided

by itself reducing the overall accuracy of the semi-

automated classiﬁcation outcomes.

4.4 Runtime Comparison (RQ4)

The training time of the classiﬁers is illustrated in Fig-

ure 1. The x-axis lists the classiﬁers and the y-axis

represents the average training-time measured in sec-

onds in log-scale. The kNN classiﬁer is not listed

since it is a memory-based learning algorithm that

requires no training. A GNB has the fastest train-

ing time taking an average of 0.2 Seconds for 25000

instances (IMDB). The LR has the second-shortest

training time being 3.7 seconds. A DT, MLP, and

RF perform much slower with around 64.28, 73.42,

and 91.14 seconds respectively. The dropout based-

MLP

implementation took 138.66 seconds which is

nearly double the time of MLP. MLP

and B-MLP

require the same time for training as they share the

same training procedure. The SVM is the only clas-

siﬁer which shows an exponential growth in training

time in regard to the size of the training data ranging

from 12.81 seconds for the App Store (size 2876) to

27.18 minutes for the IMDB (size 25000) dataset.

Figure 2 depicts the time needed to perform in-

ference. The DT, RF, GNB, LR and the MLP classi-

ﬁers take less than one second to infer the labels for

25000 instances (IMDB). The MLP

implementation

is slightly slower with an average of 1.33 seconds on

the same dataset. The kNN classiﬁer is much slower

with an inference time of 48.87 seconds (IMDB).

Figure 1: Total training time of the experiments.

Figure 2: Total inference time of the experiments.

The inference time of a SVM and kNN grows expo-

nentially in regarding the number of predicted texts.

The SVM needs 1.15 seconds for the App Store and

123.84 seconds for the IMDB dataset. The kNN clas-

siﬁer requires 0.91 seconds for the App Store and

48.87 seconds for the IMDB dataset. Sampling-based

Bayesian approximations require more time for infer-

ence since multiple forward passes have to be car-

ried out to approximate the condition class probabil-

ity. By performing 100 forward passes, a Bayesian

MLP takes 176.41 seconds (IMDB).

5 DISCUSSION

Our results indicate that manually annotating parts

of the outcomes of lightweight text classiﬁers using

SBERT can lead to substantial improvements with a

manageable manual effort. A Human-in-the-Loop ap-

proach can increase the F1-Score to at least 95% on

all datasets by manually validating less than 28% of

the data. Our ﬁndings are especially important for do-

mains in which annotation tasks are still carried out

purely manually in case applicable automatic classi-

ﬁers do not provide a required F1-Score out of the

box. While solving a classiﬁcation task by hand can

be an alternative solution, many domains are con-

fronted with an overwhelming amount of data exceed-

ing human capabilities. Human-in-the-Loop classiﬁ-

cation aims to overcome the accuracy limitations of

pure automatic classiﬁcation with the cost of human

involvement. Human efforts are usually wasted when

used to perform tasks a cheap artiﬁcial model can

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

642

perform equally. The effectiveness of Human-in-the-

Loop emerges by focusing human efforts on instances

where an automatic classiﬁer mostly fails. Overall,

the applicability of Human-in-the-Loop text classiﬁ-

cation depends on whether human efforts are afford-

able during a classiﬁers operational use and whether

more reliable classiﬁcation outcomes are needed.

In our experiments, we observed large variations

between different classiﬁers regarding their suitabil-

ity for Human-in-the-Loop text classiﬁcation. We

show that the quality of uncertainty estimates of sim-

ple models such as Decision Trees, Gaussian Naive

Bayes, Random Forest and k-Nearest Neighbour clas-

siﬁer are fare less accurate compared to an Logistic

Regression model, Support Vector Machine or Mul-

tilayer Perceptron, limiting their suitability for uncer-

tainty assessments. We also show, that these simple

models do not provide any advantages compared to

a Logistic Regression model in Human-in-the-Loop

classiﬁcation settings, since they reach a lower F1-

Score or require much more computational costs.

The Multilayer Perceptron and a Support Vec-

tor Machine have shown to provide similar or even

stronger performance scores compared to Logistic

Regression, but require much more computational re-

sources. Although no classiﬁer consistently outper-

forms the others in our experiments, a Multilayer Per-

ceptron with dropout reaches on average the high-

est performance across all datasets. Overall, classi-

ﬁers which reach a higher F1-Score pure automati-

cally also require less effort to reach an even higher

F1-Score when placing a Human-in-the-Loop. Fur-

ther, our results indicate that Bayesian modelling i.e.

Monte Carlo Dropout (Gal and Ghahramani, 2016)

does slightly improve the quality of uncertainty es-

timated, but has not a great impact on the resulting

F1-Score of Human-in-the-Loop classiﬁcation using

a small MLP and SBERT encodings as text features.

Since classiﬁers with the highest F1-Score also pro-

vide the best Brier scores it is not necessary using one

classiﬁer to estimate uncertainties and another classi-

ﬁer to provide the classiﬁcation decision.

To enable rapid or even real-time Human-in-the-

Loop processing, a Logistic Regression model is the

fastest approach in inference and training while pro-

viding a decent initial as well as human in the loop

performance. It only requires < 4 seconds for train-

ing and inference on 25000 data instances. A Support

Vector Machine is less applicable due to its compa-

rable slow training time and it does not scale well to

large datasets. A dropout-based Multilayer Percep-

tron has shown to provide on average a better perfor-

mance, but comes with higher computational efforts

of a total of < 139 seconds for training and inference.

We also demonstrate that humans and machines

can work together to achieve even greater accuracy

than their individual parts. Highly uncertain instances

are most likely to be misclassiﬁed automatically, and

even noisy human annotators have the potential to

provide more accurate labels. By simulating different

kinds of human behaviour, we demonstrate the per-

formance of Human-in-the-Loop text classiﬁcation

across multiple domains and human performances.

Practitioners in the loop have to judge about their own

behaviour to draw insides about how much effort is

worth to spend in the loop. Our study provides guide-

lines to support practitioners in choosing the most ef-

ﬁcient classiﬁer when strong classiﬁers are not appli-

cable because of high computational costs, and hu-

mans are willing to label some part of the classiﬁca-

tion results.

SBERT-based classiﬁers clearly underperform

state-of-the-art text classiﬁers such as BERT. For ex-

ample, a ﬁne-tuned BERT model has shown to reach a

F1-Score of 93.46% (Sanh et al., 2019) on the IMDB

dataset. However, BERT requires huge computational

resources and takes multiple hours to days to be ﬁne-

tuned on a CPU. In comparison, a Logistic Regres-

sion model employed on the same task using SBERT

encodings takes a few seconds on a CPU for training

and inference to reach an F1-Score of 90.2%, which

is a higher score than recent Word-Embedding based

approaches (He et al., 2020; Hendrycks and Gim-

pel, 2016). As shown by our results, manually an-

notating 12.70% of the data leads to an F1-Score of

95%, which outperforms BERT’s performance. Thus,

SBERT-based classiﬁers with a human in the loop are

an alternative or even a substitute of BERT if training

and inference have to be carried out efﬁciently and

human efforts are arrangeable.

This paper investigates Human-in-the-Loop clas-

siﬁcation with on-device training and inference. Al-

ternatively to our approach, practitioners can also

train classiﬁers on more powerful machines if avail-

able and afterwards transfer the parameters to weak

edge devices in order to maintain applicability and

save computational costs. However, the inference

of state-of-the-art classiﬁers such as BERT is still

very slow on weak computational infrastructure e.g.

edge devices due to their high resource consump-

tion. With our research, we follow a more person-

alized approach, where practitioners are capable to

reach strong classiﬁcation performances on weak in-

frastructure. Hereby, we aim to support practitioners

to rapidly extract desired information from their tex-

tual data using classiﬁcation on their own work sta-

tions.

Towards More Reliable Text Classiﬁcation on Edge Devices via a Human-in-the-Loop

643

6 RELATED WORK

The rising demand for interactive real-time process-

ing (Amershi et al., 2014; Dudley and Kristensson,

2018; Zanzotto, 2019) and resource efﬁcient machine

learning (Al-Jarrah et al., 2015; Zhang et al., 2018)

upraise the need of additional evaluation perspectives.

In contrast to traditional performance driven bench-

mark studies (Chauhan and Singh, 2018; Luu et al.,

2020; Stanik et al., 2019), we focus on the accuracy

and time efﬁciency of semi-automatic and lightweight

text classiﬁers.

Rattigan et al. (Rattigan et al., 2007) initially in-

vestigate the objective of maximizing the accuracy of

classiﬁers while limiting human efforts. On the one

side, related work in the domain of text classiﬁcation

focus on approaches based on estimating thresholds

of conditional class probabilities to separate unreli-

able class outcomes similar to our work. Pavlopou-

los et al. (Pavlopoulos et al., 2017) suggest identi-

fying upper and lower class probability thresholds to

determine a ﬁxed sized slice of data instances which

maximizes the accuracy of a model when manually

annotated. In contrast, our approach is based on un-

certainty estimates and does not require solving an

optimization task. He et al. (He et al., 2020) seek

to improve the quality of uncertainty estimates to en-

able a more efﬁcient annotation process. However,

their approach is only applicable to Deep Neural Net-

works while ours can be applied to any classiﬁer. On

the other side, training an additional reject function is

another common approach to delegate unreliable in-

stances to humans (Cortes et al., 2016; Geifman and

El-Yaniv, 2017). An abstain option can either be mod-

elled as an additional class outcome or is achieved

by training a separate classiﬁer leading to additional

computational costs and effort.

Manually annotating classiﬁer outcomes can also

be considered as a special case of Algorithm-in-

the-Loop Decision Making (Green and Chen, 2019),

where humans rather than algorithms are making the

ﬁnal classiﬁcation decision. In contrast, our approach

seeks to only involve humans when the model is un-

able to provide reliable classiﬁcation outcomes. An-

other closely related ﬁeld to Human-in-the-Loop clas-

siﬁcation is Active Learning (Lewis and Gale, 1994).

Active learning seeks to minimize human efforts in

the creation of training data to reach highly accurate

classiﬁers. In Active Learning a machine actively

queries labels from human annotators to improve a

model’s learning behaviour. Similar to Human-in-the-

Loop classiﬁcation, both approaches can utilize un-

certainty sampling to guide human involvement. In

contrast, Active Learning is applied during the train-

ing step of the initial model while our approach seeks

to further raise the accuracy of an already trained

model during its operational use. Human-in-the-Loop

classiﬁcation aims to exceed the maximum achievable

accuracy (Baram et al., 2004) of a pre-trained classi-

ﬁer with the cost of human participation during the

classiﬁcation process.

7 CONCLUSION

In this paper, we conduct several experiments to iden-

tify best performing and time efﬁcient semi-automatic

text classiﬁers using SBERT encodings. We investi-

gate the quality of uncertainty estimates as well as the

F1-Score of lightweight text classiﬁers, when a cer-

tain amount of the most uncertain classiﬁcation out-

comes is manually validated and corrected. Further,

we assess the time needed to perform training and in-

ference to assess a model’s applicability on edge de-

vices as well as enabling rapid human interaction cy-

cles. Our study consists of nine different classiﬁcation

models and four real-world text classiﬁcation tasks.

Our results indicate that the initially best performing

automatic classiﬁers (without human involvement) re-

quire less manual effort to achieve a strong F1-Score

compared to initially weaker classiﬁers. We also show

that SBERT-based classiﬁers are time efﬁcient and

only take seconds to a few minutes to be trained, en-

abling rapid interactive machine learning cycles. Our

research provides guidelines for semi-automatic text

classiﬁcation approaches when conventional state-of-

the-art classiﬁers are not applicable due to time con-

straints. As further work, we plan to perform more

user experiments and investigate the acceptance of

using Human-in-the-Loop text classiﬁcation in real-

world domains.

REFERENCES

Al-Jarrah, O. Y., Yoo, P. D., Muhaidat, S., Karagiannidis,

G. K., and Taha, K. (2015). Efﬁcient machine learning

for big data: A review. Big Data Research, 2(3):87–

93.

Amershi, S., Cakmak, M., Knox, W. B., and Kulesza, T.

(2014). Power to the people: The role of humans in in-

teractive machine learning. Ai Magazine, 35(4):105–

120.

Arnt, A. and Zilberstein, S. (2003). Learning to per-

form moderation in online forums. In Proceedings

IEEE/WIC International Conference on Web Intelli-

gence (WI 2003), pages 637–641. IEEE.

Baram, Y., Yaniv, R. E., and Luz, K. (2004). Online choice

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

644

of active learning algorithms. Journal of Machine

Learning Research, 5(Mar):255–291.

Bishop, C. M. (2006). Pattern recognition and machine

learning. springer.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wier-

stra, D. (2015). Weight uncertainty in neural net-

work. In International Conference on Machine Learn-

ing, pages 1613–1622. PMLR.

Brier, G. W. et al. (1950). Veriﬁcation of forecasts ex-

pressed in terms of probability. Monthly weather re-

view, 78(1):1–3.

Cer, D., Yang, Y., Kong, S.-y., Hua, N., Limtiaco, N., John,

R. S., Constant, N., Guajardo-Cespedes, M., Yuan, S.,

Tar, C., et al. (2018). Universal sentence encoder for

english. In Proceedings of the 2018 Conference on

Empirical Methods in Natural Language Processing:

System Demonstrations, pages 169–174.

Chauhan, N. K. and Singh, K. (2018). A review on con-

ventional machine learning vs deep learning. In Inter-

national Conference on Computing, Power and Com-

munication Technologies (GUCON), pages 347–352.

IEEE.

Corazza, M., Menini, S., Cabrio, E., Tonelli, S., and Vil-

lata, S. (2020). A multilingual evaluation for online

hate speech detection. ACM Transactions on Internet

Technology (TOIT), 20(2):1–22.

Cortes, C., DeSalvo, G., and Mohri, M. (2016). Learning

with rejection. In International Conference on Algo-

rithmic Learning Theory, pages 67–82. Springer.

Davidson, T., Warmsley, D., Macy, M., and Weber, I.

(2017). Automated hate speech detection and the

problem of offensive language. In Proceedings of

the International AAAI Conference on Web and Social

Media, volume 11.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2018). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. arXiv preprint

arXiv:1810.04805.

Dudley, J. J. and Kristensson, P. O. (2018). A review of

user interface design for interactive machine learning.

ACM Transactions on Interactive Intelligent Systems

(TiiS), 8(2):1–37.

Gal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian

approximation: Representing model uncertainty in

deep learning. In international conference on machine

learning, pages 1050–1059. PMLR.

Geifman, Y. and El-Yaniv, R. (2017). Selective classiﬁca-

tion for deep neural networks. In Proceedings of the

31st International Conference on Neural Information

Processing Systems, pages 4885–4894.

Green, B. and Chen, Y. (2019). Disparate interactions: An

algorithm-in-the-loop analysis of fairness in risk as-

sessments. In Proceedings of the Conference on Fair-

ness, Accountability, and Transparency, pages 90–99.

Haering, M., Andersen, J. S., Biemann, C., Loosen, W.,

Milde, B., Pietz, T., Stoecker, C., Wiedemann, G.,

Zukunft, O., and Maalej, W. (2021). Forum 4.0: An

open-source user comment analysis framework. In

Proceedings of the 16th Conference of the European

Chapter of the Association for Computational Lin-

guistics: System Demonstrations, pages 63–70.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The el-

ements of statistical learning: data mining, inference,

and prediction. Springer Science & Business Media.

He, J., Zhang, X., Lei, S., Chen, Z., Chen, F., Alhamadani,

A., Xiao, B., and Lu, C. (2020). Towards more

accurate uncertainty estimation in text classiﬁcation.

In Proceedings of the 2020 Conference on Empirical

Methods in Natural Language Processing (EMNLP),

pages 8362–8372.

Hendrycks, D. and Gimpel, K. (2016). A baseline for de-

tecting misclassiﬁed and out-of-distribution examples

in neural networks. arXiv preprint arXiv:1610.02136.

Hern

andez-Lobato, J. M. and Adams, R. (2015). Probabilis-

tic backpropagation for scalable learning of bayesian

neural networks. In International Conference on Ma-

chine Learning, pages 1861–1869. PMLR.

Holzinger, A. (2016). Interactive machine learning for

health informatics: when do we need the human-in-

the-loop? Brain Informatics, 3(2):119–131.

Karmakharm, T., Aletras, N., and Bontcheva, K. (2019).

Journalist-in-the-loop: Continuous learning as a ser-

vice for rumour analysis. In Proceedings of the 2019

Conference on Empirical Methods in Natural Lan-

guage Processing and the 9th International Joint Con-

ference on Natural Language Processing (EMNLP-

IJCNLP): System Demonstrations, pages 115–120.

Kendall, A. and Gal, Y. (2017). What uncertainties do

we need in bayesian deep learning for computer vi-

sion? In Proceedings of the 31st International Con-

ference on Neural Information Processing Systems,

pages 5580–5590.

Lai, C.-C. and Tsai, M.-C. (2004). An empirical perfor-

mance comparison of machine learning methods for

spam e-mail categorization. In Fourth International

Conference on Hybrid Intelligent Systems (HIS’04),

pages 44–48. IEEE.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learn-

ing. nature, 521(7553):436–444.

Lewis, D. D. and Gale, W. A. (1994). A sequential algo-

rithm for training text classiﬁers. In SIGIR’94, pages

3–12. Springer.

Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. (2004).

Rcv1: A new benchmark collection for text catego-

rization research. Journal of machine learning re-

search, 5(Apr):361–397.

Liu, Z. and Chen, H. (2017). A predictive performance

comparison of machine learning models for judicial

cases. In IEEE Symposium Series on Computational

Intelligence (SSCI), pages 1–6. IEEE.

Luu, S. T., Nguyen, H. P., Van Nguyen, K., and Nguyen,

N. L.-T. (2020). Comparison between traditional ma-

chine learning models and neural network models for

vietnamese hate speech detection. In International

Conference on Computing and Communication Tech-

nologies (RIVF), pages 1–6. IEEE.

Maalej, W., Kurtanovi

c, Z., Nabil, H., and Stanik, C.

(2016). On the automatic classiﬁcation of app reviews.

Requirements Engineering, 21(3):311–331.

Towards More Reliable Text Classiﬁcation on Edge Devices via a Human-in-the-Loop

645

Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y.,

and Potts, C. (2011). Learning word vectors for sen-

timent analysis. In Proceedings of the 49th annual

meeting of the association for computational linguis-

tics: Human language technologies, pages 142–150.

Neville, J., Jensen, D., Friedland, L., and Hay, M. (2003).

Learning relational probability trees. In Proceedings

of the ninth ACM SIGKDD international conference

on knowledge discovery and data mining, pages 625–

630.

Pavlopoulos, J., Malakasiotis, P., and Androutsopoulos, I.

(2017). Deep learning for user comment moderation.

In Proceedings of the First Workshop on Abusive Lan-

guage Online, pages 25–35.

Pennington, J., Socher, R., and Manning, C. D. (2014).

Glove: Global vectors for word representation. In

Proceedings of the 2014 conference on empirical

methods in natural language processing (EMNLP),

pages 1532–1543.

Platt, J. et al. (1999). Probabilistic outputs for support vec-

tor machines and comparisons to regularized likeli-

hood methods. Advances in large margin classiﬁers,

10(3):61–74.

Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian

Processes for Machine Learning. MIT Press.

Rattigan, M. J., Maier, M., and Jensen, D. (2007). Exploit-

ing network structure for active inference in collective

classiﬁcation. In Seventh IEEE International Con-

ference on Data Mining Workshops (ICDMW 2007),

pages 429–434. IEEE.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-

tence embeddings using siamese bert-networks. In

Proceedings of the 2019 Conference on Empirical

Methods in Natural Language Processing. Associa-

tion for Computational Linguistics.

Sacha, D., Senaratne, H., Kwon, B. C., Ellis, G., and Keim,

D. A. (2015). The role of uncertainty, awareness, and

trust in visual analytics. IEEE transactions on visual-

ization and computer graphics, 22(1):240–249.

Sachan, D. S., Zaheer, M., and Salakhutdinov, R. (2019).

Revisiting lstm networks for semi-supervised text

classiﬁcation via mixed objective function. In Pro-

ceedings of the AAAI Conference on Artiﬁcial Intelli-

gence, volume 33, pages 6940–6948.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019).

Distilbert, a distilled version of bert: smaller, faster,

cheaper and lighter. arXiv preprint arXiv:1910.01108.

Schwartz, R., Dodge, J., Smith, N. A., and Etzioni, O.

(2020). Green ai. Communications of the ACM,

63(12):54–63.

Shannon, C. E. (2001). A mathematical theory of commu-

nication. ACM SIGMOBILE mobile computing and

communications review, 5(1):3–55.

Siddhant, A. and Lipton, Z. C. (2018). Deep bayesian ac-

tive learning for natural language processing: Results

of a large-scale empirical study. In Proceedings of

the 2018 Conference on Empirical Methods in Natu-

ral Language Processing, pages 2904–2909.

Stanik, C., Haering, M., and Maalej, W. (2019). Classify-

ing multilingual user feedback using traditional ma-

chine learning and deep learning. In 27th Interna-

tional Requirements Engineering Conference Work-

shops (REW), pages 220–226. IEEE.

Strubell, E., Ganesh, A., and McCallum, A. (2019). En-

ergy and policy considerations for deep learning in

nlp. In Proceedings of the 57th Annual Meeting of

the Association for Computational Linguistics, pages

3645–3650.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,

R. R., and Le, Q. V. (2019). Xlnet: Generalized au-

toregressive pretraining for language understanding.

Advances in neural information processing systems,

32.

Zanzotto, F. M. (2019). Human-in-the-loop artiﬁcial intel-

ligence. Journal of Artiﬁcial Intelligence Research,

64:243–252.

Zhang, Q., Yang, L. T., Chen, Z., and Li, P. (2018). A survey

on deep learning for big data. Information Fusion,

42:146–157.

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

646