Why Do We Need Domain-Experts for End-to-End Text Classiﬁcation?

An Overview

Jakob Smedegaard Andersen

Department of Computer Science, Hamburg University of Applied Science, Germany

Keywords:

Text Classiﬁcation, Human-in-the-Loop, Hybrid Intelligent Systems.

Abstract:

The aim of this study is to provide an overview of human-in-the-loop text classiﬁcation. Automated text

classiﬁcation faces several challenges that negatively affect its applicability in real-world domains. General

obstacles are a lack of labelled examples, limited held-out accuracy, missing user trust, run-time constraints,

low data quality and natural fuzziness. Human-in-the-loop is an emerging paradigm to continuously support

machine processing, i.e. text classiﬁcation, with prior human knowledge, aiming to overcome the limitations of

purely artiﬁcial processing. In this survey, we review current challenges of pure automated text classiﬁers and

outline how a human-in-the-loop can overcome these obstacles. We focus on end-to-end text classiﬁcation and

feedback of domain-experts, which do not process technical knowledge about the algorithms used. Further,

we discuss common techniques to guide human attention and efforts within the text classiﬁcation process.

1 INTRODUCTION

Involving domain experts in text classiﬁcation can

bridge the gap between machine learning (ML) re-

search and real-world applications. While recent au-

tomated text classiﬁers have achieved great success

on many benchmarks (Devlin et al., 2019; Yang et al.,

2019), the applicability of automated text classiﬁers

in real-world environments is still limited and faces

many challenges. Even state-of-the-art text classi-

ﬁers, such as BERT (Devlin et al., 2019), are not able

to consistently reach desirable classiﬁcations results

on arbitrary datasets. Automated text classiﬁers gen-

erally lack reliability, explainability, interpretability

and human trust.

Semi-automated approaches have lately gained

in prominence (Keim et al., 2008; Amershi et al.,

2014; Holzinger, 2016), in which human background

knowledge, abilities, and expertise are tightly coupled

with ML models. Allowing humans to interact with

ML models, aims to overcome the obstacles of purely

artiﬁcial approaches and ultimately increases the

applicability of artiﬁcial assisted decision-making.

The term human-in-the-loop (HiL) (Holzinger, 2016)

emerged to describe a semi-automated process char-

acterized by the continuous support of machine pro-

cessing by human feedback. HiL problem-solving

aims to achieve what neither a human nor a machine

can achieve on their own. In this work, we survey HiL

for text classiﬁers with a focus on non-ML expert sys-

tems, where domain experts support the machine pro-

cessing with their domain-speciﬁc prior knowledge

and expertise. We analyse the main types of HiL

implementations for text classiﬁcation and highlight

how these solve various challenges related to pure au-

tomated approaches.

Text classiﬁcation is a widespread research chal-

lenge with high practical demands. It describes the

process of assigning predeﬁned class labels to natu-

ral language texts. Since large-scale manual labelling

of text documents is a tedious, time-consuming and

expensive task, there is a high demand for automa-

tion. In order to increase the applicability of auto-

mated classiﬁers, the question arises how domain-

experts and artiﬁcial classiﬁers can work together ef-

ﬁciently. In particular, explicit uncertainty informa-

tion (Der Kiureghian and Ditlevsen, 2009) and ex-

planations (Adadi and Berrada, 2018) have shown

to provide valuable insights into automated decision-

making that help to effectively spend human efforts.

To our best knowledge, there is no survey, focusing

on overcoming the limitations of automated text clas-

siﬁcation via domain-experts. Wu et al. (Wu et al.,

2022) provide a general survey of human-in-the-loop

in conjunction with a range of ML tasks. Wang et al.

(Wang et al., 2021) survey how several natural lan-

guage processing (NLP) tasks can beneﬁt from hu-

man feedback.

Andersen, J.

Why Do We Need Domain-Experts for End-to-End Text Classiﬁcation? An Overview.

DOI: 10.5220/0011605900003393

In Proceedings of the 15th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2023) - Volume 3, pages 17-24

ISBN: 978-989-758-623-1; ISSN: 2184-433X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

The remainder of the paper is structured as fol-

lows: Section 2 introduces the task of automated text

classiﬁcation and outlines current challenges in its

application. Then, Section 3 deﬁnes the human-in-

the-loop approaches and introduces common types

of human feedback. Section 4 looks at approaches

to efﬁciently bring humans in the classiﬁcation loop,

and Section 5 reviews generic applications of the HiL

paradigm to support text classiﬁcation. Finally, Sec-

tion 6 discusses open challenges and Section 7 con-

cludes the paper.

2 AUTOMATED TEXT

CLASSIFICATION

Text classiﬁcation is about assigning text documents

to predeﬁned classes (Sebastiani, 2002). It has be-

come a major challenge for research, as large amounts

of textual data are produced daily in many real-world

applications. However, fully automated classiﬁcation

algorithms remain imperfect and have several limita-

tions that negatively impact their applicability in real-

world domains. In the following, we highlight general

limitations of purely automated text classiﬁers.

Lack of Knowledge. State-of-the-art classiﬁers are

usually deep neural networks (Minaee et al., 2021)

consisting of millions of parameters. Such classiﬁers

require a lot of training data to efﬁciently model the

prediction function. However, labelled data are typi-

cally scarce and represent a signiﬁcant bottleneck in

classiﬁcation. Human labelling is time-consuming,

labour-intensive, and costly, especially when domain-

experts are needed. Therefore, the limited availability

of training data is a serious obstacle to the application

of automated text classiﬁers.

Further, it is generally assumed that the samples a

model faces during deployment come from the same

distribution as the training data set. However, the dis-

tribution of new in-domain data is generally unknown,

and the underlying distribution is prone to shift over

time (Ovadia et al., 2019). This makes it difﬁcult

to maintain a representative and meaningful training

dataset, which is essential for training a well perform-

ing classiﬁer.

Lack of Performance and Reliability. A primary

goal of text classiﬁcation is to achieve the highest

possible accuracy. However, classiﬁcation algorithms

are inherently uncertain and misclassiﬁcations must

be expected (Der Kiureghian and Ditlevsen, 2009).

The exact relationship between class labels and text

inputs remains unknown and can only be approxi-

mated by classiﬁers. Misclassiﬁcations usually occur

because of missing training data, inappropriate selec-

tion of the classiﬁcation algorithm, input noise, lim-

ited number of processable features (Gao et al., 2021),

overﬁtting (Roelofs et al., 2019) or because unknown

classes in the training data are mistakenly perceived

as different labels (Zhao et al., 2021). Even when a

large amount of training data is available, the most ad-

vanced text classiﬁers rarely achieve 100% accuracy

on the test dataset and even more rarely on unseen

data. Typically, measured held-out accuracies over-

estimate the real performance of classiﬁers on real-

world data (Ribeiro et al., 2020). Given a labelled

text corpus, the performance of a classiﬁer usually

converges to a maximum achievable accuracy that the

model itself cannot exceed. Benchmarks indicate that

an accuracy of around 90% on well scoped tasks can

be expected (Devlin et al., 2019; Yang et al., 2019).

Lack of Transparency. Classiﬁcation algorithms,

especially deep learning approaches, are considered

“black-boxes” for humans because they do not pro-

vide comprehensible insights into their decision-

making. Practitioners are confronted with classiﬁca-

tion results without being told why and how these pre-

dictions were made. Without any human-readable ex-

planations, practitioners can hardly be convinced of

the classiﬁcation result. Practitioners do generally

not trust artiﬁcial classiﬁcation results if they can-

not understand why and how these decisions were

made. Mechanisms for transparency aim to increase

the trustworthiness, conformity and ultimately the ap-

plicability of text classiﬁers in practice (Adadi and

Berrada, 2018). However, it is not enough to just ex-

plain the classiﬁers internal behaviour, humans must

also understand them. Explainability can only be

achieved through the interaction between humans and

ML models (Adadi and Berrada, 2018).

Computational Complexity. The increasing acces-

sibility of powerful computational instructions is

paving the way for more complex classiﬁcation al-

gorithms that continue to push the boundaries of the

state-of-the-art. Recent classiﬁers consist of more

than 100 million trainable parameters (Devlin et al.,

2019; Yang et al., 2019). Such complex models re-

quire a lot of computing time and resources, which

precludes their use in product environments and on

large datasets. Excluding many practitioners from

their application. Long training or inference times

also negatively impact the user experience when hu-

mans are involved in the classiﬁcation loop, as long

waiting times occur between interactions. A common

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

workaround is to use simple linear classiﬁcation mod-

els instead, such as FastText (Joulin et al., 2017) or

less complex neural networks (Kim, 2014), which re-

duce training, testing, and inference time but compro-

mise accuracy.

Data Quality. Data quality can be viewed as the de-

gree to which a data set is ﬁt for a particular purpose

(Gudivada et al., 2017), i.e. analytical task. Miss-

ing, biassed or insufﬁcient data are a common source

of misclassiﬁcations, affecting the overall accuracy.

Raw text is usually noisy, inconsistent, heterogeneous

and has to be initially cleaned to reduce quality is-

sues. Furthermore, classiﬁers are easily disrupted by

an imbalanced data distributions, i.e. some classes

occur much more often than others, causing a bias to-

wards selecting the majority classes (Sun et al., 2009).

In general, it is assumed that the more labelled data

available, the better a model can be trained. However,

it has been shown that equally high accuracies can

be achieved with few but high quality data instances

(Lewis and Gale, 1994). Overall, learning powerful

classiﬁers requires good quality data.

Fuzzy Classiﬁcation Objective. Text is naturally

fuzzy, and its interpretation is highly subjective.

Boundaries between different classiﬁcation objectives

are commonly ﬂuid and cannot be sufﬁciently decided

by black and white thinking. Ambiguous borderline

cases will arise that cannot be decided objectively.

Studies show that even experts disagree on the detec-

tion of hate speech (Waseem, 2016). It may happen

that some texts appear as valid examples for several

or all classes. In general, it is assumed that if domain

experts cannot agree on a certain class membership,

an algorithm will not be able to do better (Boguslav

and Cohen, 2017).

3 FROM AUTOMATED TO HiL

TEXT CLASSIFICATION

This section deﬁnes and scopes the HiL approach, and

outlines common types of human feedback to support

the collaboration between humans and text classiﬁers.

3.1 Scope and Deﬁnition

The phrase human-in-the-loop (HiL) (Holzinger,

2016) describes a computational ML paradigm and

ﬁeld of research characterized by the adaptation of

machine processing by human skills, background

knowledge and expertise. HiL systems aim to facil-

itate problem-solving with the cost of human involve-

ment. At the time of the survey, there is no gener-

ally accepted deﬁnition for HiL. Many attempts have

been made covering different aspects and use-cases

of human-machine collaboration. Fails and Olsen

Jr (Fails and Olsen Jr, 2003) were the ﬁrst to use

the term interactive Machine Learning (iML) to de-

scribe a continuous train-feedback-correct loop for

interactively training a model. In their framework,

humans continuously provide additional training data

to a model until an acceptable level of accuracy is

reached. This approach is also called active learning

(Settles, 1995). Amershi et al. (Amershi et al., 2014)

argue for the importance of an extended user-centric

perspective in iML, focusing on human factors and

the rapid and incremental nature of interaction cycles.

They see iML as an opportunity for domain-experts to

incorporate their knowledge directly into ML models.

Another deﬁnition of HiL is provided by Holzinger

(Holzinger, 2016). They use the term HiL to describe

a concept which looks for algorithms which inter-

act with agents and can optimize their learning be-

haviour through this interaction. This deﬁnition fo-

cuses mainly on the machine centred aspects of HiL,

where models actively ask for feedback to support

their behaviour during the learning phase. In gen-

eral, human-machine cooperation can be both user-

and machine-centred. In a machine-centred approach,

a model asks the human directly for information. In

a human-centred approach, humans select informa-

tion themselves and make them available to classi-

ﬁers. The later viewpoint is outlined by Dudley and

Kristensson (Dudley and Kristensson, 2018) which

deﬁne iML as “a co-adaptive process, driven by the

user, but inherently dynamic in nature as the model

and user evolve together during training”. This deﬁ-

nition focuses on the user and illuminates the process

of knowledge generation that occurs during the pro-

gressive interaction between humans and the model.

Humans gain insight and knowledge about their data

by observing its structure and model results, while

machines learn from human feedback (Sacha et al.,

2014). The knowledge acquired can help to further

improve the quality of subsequent feedback. HiL does

not just focus on training. It has also been shown

that involving domain-experts during inference can

increase the accuracy of text classiﬁers (Kivlichan

et al., 2021; Andersen and Maalej, 2022). Endert et al.

(Endert et al., 2014) take a step further and advocate

a “Human-is-the-Loop” methodology in exploratory

settings to highlight the importance of seamlessly in-

tegrating human capabilities in the process of knowl-

edge discovery. The application of HiL is not limited

Why Do We Need Domain-Experts for End-to-End Text Classiﬁcation? An Overview

to model development. HiL can also be applied in de-

ployment to further reﬁne a model in the ﬁeld. For the

purposes of this survey, we deﬁne HiL as “a generic

semi-automated process in which models and humans

interact and learn from each other to improve the out-

comes or applicability of ML algorithms.”

3.2 Human Feedback

Users of HiL systems should not require a deep under-

standing of the model they are interacting with (Fails

and Olsen Jr, 2003). Therefore, interactions have

to concentrate on the exchange of domain-speciﬁc

knowledge. In the following, we discuss different

types of human feedback to support end-to-end text

classiﬁcation, where no manual text-processing and

features-engineering is performed.

Feedback from domain-experts within text classi-

ﬁcation is mostly limited to label existing text docu-

ments to be added to the training data. The classiﬁer

is then re-trained on the extended training dataset and

possibly improved (Lewis and Gale, 1994). Bernard

et al. (Bernard et al., 2018) distinguish between two

labelling scenarios to support the training process of a

classiﬁer. In pre-labelling, training data is collected

to build the ﬁrst batch of training data, which is used

to initially train a classiﬁer. In incremental learning,

a model is re-trained when additional training data is

available. In this case, humans continuously provide

new labelled data to reﬁne an already trained model.

Re-training a model is important to strengthen its ro-

bustness and prevent it from deteriorating over time

(Ovadia et al., 2019). Human labels can also be ob-

tained by letting humans agree or disagree with artiﬁ-

cially derived labels (Andersen et al., 2021). Domain-

experts can also provide new text instances to support

a classiﬁcation model. Textual feedback can be used

to reduce blind spots or misconceptions, such as pro-

viding missing evidence that is not included in the

current training data (Attenberg et al., 2011). Here,

users are asked not only to provide labels, but also to

provide new or modiﬁed text examples.

4 ENABLERS

HiL aims for fast, efﬁcient, continuous and beneﬁcial

interactions between ML models and humans. While

artiﬁcial decision-making is considerably cheap and

fast, human involvement is usually the bottleneck of

HiL systems. To keep human involvement efﬁcient

and sparse, it is desirable to obtain high-quality feed-

back while avoiding redundant and unnecessary in-

teractions. In the following, we discuss three general

techniques that enable and support the exchange of

high-quality feedback. These are predictive uncer-

tainties, explanations and visualizations.

4.1 Predictive Uncertainties

Classiﬁcation algorithms are inherently imperfect due

to their probabilistic nature. Artiﬁcial predictions

are corrupted by uncertainties which emerge dur-

ing the classiﬁcation process (Der Kiureghian and

Ditlevsen, 2009). Misconceptions, corruptions, ambi-

guities, noise, a lack of evidence, limited representa-

tiveness, conﬂicting evidence within the training data,

or out-of-distribution inputs might cause highly unre-

liable predictions which are probably wrong. Unfor-

tunately, automated classiﬁers are incapable of rec-

ognizing when they fail to provide reliable outcomes.

Thus, it is difﬁcult for humans to reason about the re-

liability and trustfulness of predictions. In the worst

case, a prediction is considered as correct even though

it is not. Quantifying predictive uncertainties could

help to handle these difﬁculties and is a ﬁrst step to-

wards more accountable and transparent predictions.

Therefore, classiﬁers are required to additionally re-

port uncertainty scores beside the usual class out-

come, when a certain level of safety is needed. While

uncertainty can arise and be passed on in any part of

the ML pipeline and in human interaction with them

(Sacha et al., 2015), recent research focuses on the

automatic estimation of classiﬁcation uncertainty in

individual classiﬁcation results (Blundell et al., 2015;

Gal and Ghahramani, 2016).

Uncertainty is generally considered as a lack of

conﬁdence in a prediction (Li et al., 2012). It can also

be seen as an indicator of unpredictability, indicating

that instances contain much information the model

might need (Lewis and Gale, 1994). Being aware of

uncertainties helps to take special care of unreliable

predictions and reduces the risk of trusting incorrect

model behaviour, i.e., misclassiﬁcations (Hendrycks

and Gimpel, 2017; Andersen and Maalej, 2022; An-

dersen and Zukunft, 2022). Uncertainty also facili-

tates the detection of out-of-distribution examples to

which classiﬁers typically do not generalize well (Hu

and Khan, 2021; Hendrycks and Gimpel, 2017). In-

teractions guided by uncertainty seek to spend human

efforts most efﬁciently by focusing feedback on ma-

chine misconceptions and information needs.

4.2 Explanations

State-of-the-art classiﬁers are more complex and less

interpretable than ever. Especially deep learning ap-

proaches are considered black-boxes since humans

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

can not reason about their internal decision-making.

The lack of transparency makes it challenging to un-

derstand how and why decisions were made. It stays

unknown what the model has learned. Understanding

the rationale for decisions would increase the conﬁ-

dence, trust, and applicability of artiﬁcial decisions.

Especially since, users are usually not willing to ap-

ply artiﬁcial models when they do not trust them.

Explainable ML (Adadi and Berrada, 2018) aims

to open the black-box of classiﬁers and ensure that

humans can understand and justify why certain class

results were delivered. Explanations enable us to dis-

cover what a model has learned and how to further

improve it (Adadi and Berrada, 2018). Making arti-

ﬁcial reasoning explicit helps humans to make their

decision-making more efﬁcient. Generally, expla-

nation can enhance a classiﬁers robustness and user

trust, knowledge transferability, as well as prevent

faulty behaviour, weak points, undesired biases, un-

fairness, and discrimination (Arrieta et al., 2020; Con-

falonieri et al., 2021). Explanations are also used to

provide better cognitive support and enhance the col-

laboration between humans and models.

In order to explain text classiﬁers, human com-

prehensible interpretations of classiﬁcation results are

needed. Explanations of classiﬁers are usually mod-

elled as an additional output alongside the ﬁnal class

predictions. For deep learning based classiﬁcation,

typically local or introspective explanations (Con-

falonieri et al., 2021) are provided to explain input-

output pairs based on a subset of input features that

justify the classiﬁcation result. For text classiﬁcation,

these are the words of the input text that contribute

most to a particular class outcome.

4.3 Visualizations

Visual perception is one of the most important skills

that enables humans to discover and understand local

patterns and relationships in visual representations of

problems statements (Tropmann-Frick and Andersen,

2019). Information visualization is about mapping

data in a visual context so that it is easier for humans

to understand and draw insights from it. Visual inter-

faces are essential for HiL since these enable domain-

experts to cooperate with ML model without requir-

ing any additional programming (Fails and Olsen Jr,

2003). Just displaying the labels, predicted by artiﬁ-

cial classiﬁers, greatly improves human accuracy and

speed of manually labelling (Desmond et al., 2021).

A common visualization technique for exploring

large textual data is to embed their high-dimensional

feature vectors, i.e. semantic meaningful text repre-

sentations, into a typically two-dimensional vector-

space using dimensionality reduction techniques (Be-

nato et al., 2020). The reduced vectors can then easily

be visualized via a scatter-plot.

5 GENERIC HiL FRAMEWORKS

In this section, we present three common HiL imple-

mentations that aim to overcome some major limita-

tions in the application of automatic text classiﬁers.

5.1 Training Data Acquisition

While obtaining raw text instances usually is not a

problem nowadays, the lack of labelled examples is

the bottleneck of text classiﬁcation. Generally, la-

belling must be done manually, which requires a lot of

human labour. To reduce the effort required to manu-

ally label a sufﬁcient training dataset is an important

task in text classiﬁcation. Approaches are needed to

efﬁciently provide knowledge to classiﬁers.

Active Learning (AL). (Lewis and Gale, 1994; Set-

tles, 1995) describes an incremental process in which

a classiﬁer accumulates knowledge by soliciting feed-

back from human annotators for the purpose of train-

ing. An actively trained model continuously improves

its learning behaviour by querying human knowledge

until the model reaches the desired accuracy. In the

simplest case, a human is prompted over several iter-

ations to specify the correct class labels for selected

data instances. In several iterations, the potential

training instances are ranked according to their uncer-

tainty. Human annotators are asked to manually la-

bel instances that are believed to have the greatest im-

pact on the model’s learning behaviour. Then, the la-

belled examples are added to the training dataset and

the model is re-trained. AL has proven to be very suc-

cessful in various text classiﬁcation tasks (Lewis and

Gale, 1994). A general survey of AL is provided by

Settles (Settles, 1995).

5.2 Moderation of Classiﬁcation

Outcomes

Highly accurate classiﬁers are required to adequately

automate information retrieval. However, during

training, the accuracy of text classiﬁers may con-

verge to a level that does not meet the requirements

of an application domain. Faulty and unreliable pre-

dictions are likely to occur, which might stay unno-

ticed. To improve the accuracy or prevent classiﬁ-

cation mistakes of an already trained classiﬁcation

model, humans can also be involved during inference

post model training.

Why Do We Need Domain-Experts for End-to-End Text Classiﬁcation? An Overview

Classiﬁer Moderation (CM). (Kivlichan et al., 2021;

Andersen and Maalej, 2022) aims at increasing the

applicability and reliability of an already trained

model. Trained and deployed classiﬁers generally

perform well, with only a small fraction of the data

responsible for incorrect and unreliable model be-

haviour. CM seeks to maintain a superior level of ac-

curacy by involving humans to prevent unreliable pre-

dictions as the model is used in practice (Karmakharm

et al., 2019; Kivlichan et al., 2021; Andersen and

Maalej, 2022). Humans are responsible for manually

checking highly unreliable, i.e. uncertain, and fuzzy

instances and correcting their labelling accordingly.

If the model is certain of its prediction, no human in-

volvement is required. Although not all misclassiﬁ-

cations can be prevented (Attenberg et al., 2011), CM

has the potential to lead to much better decision out-

comes at the expense of human labour (Zhang et al.,

2019; Andersen and Zukunft, 2022).

Since it might be impractical to let humans check

all unreliable outcomes, Andersen and Maalej (An-

dersen and Maalej, 2022) suggest a saturation-based

stop-criterion for CM. They aim to maximize the

overall classiﬁcation accuracy while spending human

labour highly efﬁcient. Pavlopoulos et al. (Pavlopou-

los et al., 2017) suggest an approach which aims to

maximize the performance of a classiﬁer with respect

to a given moderation effort, e.g. 10% of the data.

Geifman and El-Yaniv (Geifman and El-Yaniv,

2017) propose an approach to guarantee a certain risk

level. Their approach is based on selective classiﬁ-

cation, where classiﬁers reject predictions which then

have to be made by a human.

5.3 Interactive Labelling and Data

Exploration

Interactive Labelling (IL). (Knaeble et al., 2020) is a

user-centred variation of the AL process. Like AL, IL

aims to reduce the number of labelled examples to ad-

equately train a model. In contrast, humans are tasked

with selecting the instances to be labelled. IL is based

on the assumption that humans can select represen-

tative data instances more efﬁciently than automatic

query strategies. For example, uncertainty sampling

strategies are prone to sample outliers that contribute

little or nothing to the learning behaviour (Lewis and

Gale, 1994) of the models and tend to be overconﬁ-

dent (Guo et al., 2017).

To enable human-centred data sampling, the la-

bel query problem is reformulated into a visual an-

alytics problem (Keim et al., 2008). Users are pro-

vided with adaptable visual-interactive interfaces to

strategically select and label data instances. IL draws

strength from extensive use of human expertise, back-

ground knowledge and visual perception. A key ad-

vantage over AL is that users generate additional and

expanded knowledge and insights about their data

through the exploratory nature of IL (Sacha et al.,

2014). The accumulated knowledge can then further

support the labelling process. As with AL, the model

is interactively re-trained until a desired accuracy is

achieved. Preliminary research show that IL can come

close or even complement AL in terms of achieved

accuracy (Bernard et al., 2018).

6 OPEN CHALLENGES

Previous studies demonstrate the usefulness of HiL

text classiﬁcation compared to a pure automatic anal-

ysis (Lewis and Gale, 1994; Kivlichan et al., 2021;

Andersen and Maalej, 2022). However, the cost of

using human labour is usually very high, whether in

terms of money or time. Practitioners have to decide

whether a pure automated approach is applicable and

can solve a task appropriately, or whether a human

in the loop is actually required and affordable. Fur-

ther, human annotations should be taken with a grain

of salt, as they can also be wrong. When human an-

notations are not reliable, biassed or too noisy, it neg-

atively impacts the interaction between humans and

machines (Andersen and Zukunft, 2022).

HiL approaches place special time requirements

on the underlying classiﬁcation model. Very short

waiting times and few interruptions between human

interactions are required, to maintain user experience.

The need for fast interactions and model updates often

makes it necessary to trade-off speed with accuracy

(Amershi et al., 2014). Reducing iterative feedback

latency is critical for HiL systems.

Also, uncertainties and explanations are not per-

fect and remain an active ﬁeld of research. In partic-

ular, uncertainty estimation is challenging, especially

using deep neural networks, since they do not provide

an inherent indicator of uncertainty (Gal and Ghahra-

mani, 2016). Inadequate measurements can inadver-

tently mislead humans into making a false assump-

tion or blindly trusting artiﬁcial decision-making, e.g.

unknown-unknowns (Attenberg et al., 2011).

7 CONCLUSION

Human-in-the-loop describes a collaborative process

for improving the results and applicability of ML pro-

cedures through human feedback. We emphasize the

need to involve domain experts in the text classiﬁca-

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence

tion process in order to increase or even enable the ap-

plicability of machine-assisted text classiﬁcation. We

survey the current challenges in pure automated text

classiﬁcation and outline techniques to efﬁciently in-

volve humans in the classiﬁcation process. This in-

cludes the importance of uncertainty-based interac-

tions to effectively guide humans providing feedback,

building trust and focusing attention through expla-

nations, and incorporating models into visual ana-

lytics environments. Additionally, we light on cur-

rent human-in-the-loop implementations covering ac-

tive learning, classiﬁer moderation and interactive la-

belling.

REFERENCES

Adadi, A. and Berrada, M. (2018). Peeking inside the

black-box: a survey on explainable artiﬁcial intelli-

gence (xai). IEEE access, 6:52138–52160.

Amershi, S., Cakmak, M., Knox, W. B., and Kulesza, T.

(2014). Power to the people: The role of humans in in-

teractive machine learning. Ai Magazine, 35(4):105–

120.

Andersen, J. S. and Maalej, W. (2022). Efﬁcient,

uncertainty-based moderation of neural networks text

classiﬁers. In Findings of the Association for Compu-

tational Linguistics: ACL 2022, pages 1536–1546.

Andersen, J. S. and Zukunft, O. (2022). Towards more

reliable text classiﬁcation on edge devices via a

human-in-the-loop. In International Conference on

Agents and Artiﬁcial Intelligence 2022, pages 636–

646. SciTePress.

Andersen, J. S., Zukunft, O., and Maalej, W. (2021). Rem:

Efﬁcient semi-automated real-time moderation of on-

line forums. In Proceedings of the 59th Annual Meet-

ing of the Association for Computational Linguistics

and the 11th International Joint Conference on Nat-

ural Language Processing: System Demonstrations,

pages 142–149.

Arrieta, A. B., D

ıaz-Rodr

ıguez, N., Del Ser, J., Bennetot,

A., Tabik, S., Barbado, A., Garc

ıa, S., Gil-L

opez, S.,

Molina, D., Benjamins, R., et al. (2020). Explainable

artiﬁcial intelligence (xai): Concepts, taxonomies, op-

portunities and challenges toward responsible ai. In-

formation fusion, 58:82–115.

Attenberg, J. M., Ipeirotis, P. G., and Provost, F. (2011).

Beat the machine: Challenging workers to ﬁnd the un-

known unknowns. In Workshops at the Twenty-Fifth

AAAI Conference on Artiﬁcial Intelligence.

Benato, B. C., Gomes, J. F., Telea, A. C., and Falc

ao,

A. X. (2020). Semi-automatic data annotation guided

by feature space projection. Pattern Recognition,

109(10761):2.

Bernard, J., Zeppelzauer, M., Sedlmair, M., and Aigner, W.

(2018). Vial: A uniﬁed process for visual interactive

labeling. Vis. Comput., 34(9):1189–1207.

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,

D. (2015). Weight uncertainty in neural network. In

International conference on machine learning, pages

1613–1622. PMLR.

Boguslav, M. and Cohen, K. B. (2017). Inter-annotator

agreement and the upper limit on machine perfor-

mance: Evidence from biomedical natural language

processing. Studies in health technology and infor-

matics, 245:298–302.

Confalonieri, R., Coba, L., Wagner, B., and Besold, T. R.

(2021). A historical perspective of explainable arti-

ﬁcial intelligence. Wiley Interdisciplinary Reviews:

Data Mining and Knowledge Discovery, 11(1):e1391.

Der Kiureghian, A. and Ditlevsen, O. (2009). Aleatory

or epistemic? does it matter? Structural safety,

31(2):105–112.

Desmond, M., Muller, M., Ashktorab, Z., Dugan, C.,

Duesterwald, E., Brimijoin, K., Finegan-Dollak, C.,

Brachman, M., Sharma, A., Joshi, N. N., et al. (2021).

Increasing the speed and accuracy of data labeling

through an ai assisted interface. In 26th International

Conference on Intelligent User Interfaces, pages 392–

401.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. In Proceed-

ings of the 2019 Conference of the North American

Chapter of the Association for Computational Lin-

guistics: Human Language Technologies, volume 1,

pages 4171–4186.

Dudley, J. J. and Kristensson, P. O. (2018). A review of

user interface design for interactive machine learning.

ACM Transactions on Interactive Intelligent Systems

(TiiS), 8(2):1–37.

Endert, A., Hossain, M. S., Ramakrishnan, N., North, C.,

Fiaux, P., and Andrews, C. (2014). The human is the

loop: new directions for visual analytics. Journal of

intelligent information systems, 43(3):411–435.

Fails, J. A. and Olsen Jr, D. R. (2003). Interactive machine

learning. In Proceedings of the 8th international con-

ference on Intelligent user interfaces, pages 39–45.

Gal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian

approximation: Representing model uncertainty in

deep learning. In international conference on machine

learning, pages 1050–1059. PMLR.

Gao, S., Alawad, M., Young, M. T., Gounley, J., Schaef-

ferkoetter, N., Yoon, H. J., Wu, X.-C., Durbin, E. B.,

Doherty, J., Stroup, A., et al. (2021). Limitations of

transformers on clinical text classiﬁcation. IEEE jour-

nal of biomedical and health informatics, 25(9):3596–

3607.

Geifman, Y. and El-Yaniv, R. (2017). Selective classiﬁca-

tion for deep neural networks. Advances in neural in-

formation processing systems, 30.

Gudivada, V., Apon, A., and Ding, J. (2017). Data quality

considerations for big data and machine learning: Go-

ing beyond data cleaning and transformations. Inter-

national Journal on Advances in Software, 10(1):1–

20.

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017).

On calibration of modern neural networks. In Interna-

tional conference on machine learning, pages 1321–

1330. PMLR.

Why Do We Need Domain-Experts for End-to-End Text Classiﬁcation? An Overview

Hendrycks, D. and Gimpel, K. (2017). A baseline for de-

tecting misclassiﬁed and out-of-distribution examples

in neural networks. ICLR.

Holzinger, A. (2016). Interactive machine learning for

health informatics: when do we need the human-in-

the-loop? Brain Informatics, 3(2):119–131.

Hu, Y. and Khan, L. (2021). Uncertainty-aware reli-

able text classiﬁcation. In Proceedings of the 27th

ACM SIGKDD Conference on Knowledge Discovery

& Data Mining, pages 628–636.

Joulin, A., Grave,

E., Bojanowski, P., and Mikolov, T.

(2017). Bag of tricks for efﬁcient text classiﬁcation.

In Proceedings of the 15th Conference of the Euro-

pean Chapter of the Association for Computational

Linguistics: Volume 2, Short Papers, pages 427–431.

Karmakharm, T., Aletras, N., and Bontcheva, K. (2019).

Journalist-in-the-loop: Continuous learning as a ser-

vice for rumour analysis. In Proceedings of the 2019

Conference on Empirical Methods in Natural Lan-

guage Processing and the 9th International Joint Con-

ference on Natural Language Processing (EMNLP-

IJCNLP): System Demonstrations, pages 115–120.

Keim, D., Andrienko, G., Fekete, J.-D., G

org, C., Kohlham-

mer, J., and Melanc¸on, G. (2008). Visual analytics:

Deﬁnition, process, and challenges. In Information

visualization, pages 154–175. Springer.

Kim, Y. (2014). Convolutional neural networks for sentence

classiﬁcation. In Proceedings of the 2014 Conference

on Empirical Methods in Natural Language Process-

ing (EMNLP), pages 1746–1751, Doha, Qatar. Asso-

ciation for Computational Linguistics.

Kivlichan, I., Lin, Z., Liu, J., and Vasserman, L. (2021).

Measuring and improving model-moderator collabo-

ration using uncertainty estimation. In Proceedings

of the 5th Workshop on Online Abuse and Harms

(WOAH 2021), pages 36–53.

Knaeble, M., Nadj, M., and Maedche, A. (2020). Oracle or

teacher? a systematic overview of research on inter-

active labeling for machine learning. In Wirtschaftsin-

formatik (Zentrale Tracks), pages 2–16.

Lewis, D. D. and Gale, W. A. (1994). A sequential algo-

rithm for training text classiﬁers. In SIGIR’94, pages

3–12. Springer.

Li, Y., Chen, J., and Feng, L. (2012). Dealing with un-

certainty: A survey of theories and practices. IEEE

Transactions on Knowledge and Data Engineering,

25(11):2463–2482.

Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N.,

Chenaghlu, M., and Gao, J. (2021). Deep learning–

based text classiﬁcation: A comprehensive review.

ACM Computing Surveys (CSUR), 54(3):1–40.

Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D.,

Nowozin, S., Dillon, J., Lakshminarayanan, B., and

Snoek, J. (2019). Can you trust your model’s uncer-

tainty? evaluating predictive uncertainty under dataset

shift. Advances in neural information processing sys-

tems, 32.

Pavlopoulos, J., Malakasiotis, P., and Androutsopoulos, I.

(2017). Deep learning for user comment moderation.

ACL 2017, page 25.

Ribeiro, M. T., Wu, T., Guestrin, C., and Singh, S. (2020).

Beyond accuracy: Behavioral testing of nlp models

with checklist. In Proceedings of the 58th Annual

Meeting of the Association for Computational Lin-

guistics, pages 4902–4912.

Roelofs, R., Shankar, V., Recht, B., Fridovich-Keil, S.,

Hardt, M., Miller, J., and Schmidt, L. (2019). A meta-

analysis of overﬁtting in machine learning. Advances

in Neural Information Processing Systems, 32.

Sacha, D., Senaratne, H., Kwon, B. C., Ellis, G., and Keim,

D. A. (2015). The role of uncertainty, awareness, and

trust in visual analytics. IEEE transactions on visual-

ization and computer graphics, 22(1):240–249.

Sacha, D., Stoffel, A., Stoffel, F., Kwon, B. C., Ellis, G., and

Keim, D. A. (2014). Knowledge generation model for

visual analytics. IEEE transactions on visualization

and computer graphics, 20(12):1604–1613.

Sebastiani, F. (2002). Machine learning in automated

text categorization. ACM computing surveys (CSUR),

34(1):1–47.

Settles, B. (1995). Active learning literature survey. Sci-

ence, 10(3):237–304.

Sun, Y., Wong, A. K., and Kamel, M. S. (2009). Classiﬁca-

tion of imbalanced data: A review. International jour-

nal of pattern recognition and artiﬁcial intelligence,

23(04):687–719.

Tropmann-Frick, M. and Andersen, J. S. (2019). To-

wards visual data science-an exploration. In Interna-

tional Conference on Human Interaction and Emerg-

ing Technologies, pages 371–377. Springer.

Wang, Z. J., Choi, D., Xu, S., and Yang, D. (2021). Putting

humans in the natural language processing loop: A

survey. In Proceedings of the First Workshop on

Bridging Human–Computer Interaction and Natural

Language Processing, pages 47–52.

Waseem, Z. (2016). Are you a racist or am i seeing things?

annotator inﬂuence on hate speech detection on twit-

ter. In Proceedings of the ﬁrst workshop on NLP and

computational social science, pages 138–142.

Wu, X., Xiao, L., Sun, Y., Zhang, J., Ma, T., and He, L.

(2022). A survey of human-in-the-loop for machine

learning. Future Generation Computer Systems.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,

R. R., and Le, Q. V. (2019). Xlnet: Generalized au-

toregressive pretraining for language understanding.

Advances in neural information processing systems,

32.

Zhang, X., Chen, F., Lu, C.-T., and Ramakrishnan, N.

(2019). Mitigating uncertainty in document classiﬁ-

cation. In Proceedings of the 2019 Conference of the

North American Chapter of the Association for Com-

putational Linguistics: Human Language Technolo-

gies, volume 1, pages 3126–3136.

Zhao, P., Zhang, Y.-J., and Zhou, Z.-H. (2021). Exploratory

machine learning with unknown unknowns. In Pro-

ceedings of the AAAI Conference on Artiﬁcial Intelli-

gence, volume 35, pages 10999–11006.

ICAART 2023 - 15th International Conference on Agents and Artiﬁcial Intelligence