Machine Learning Models for Automatic Labeling:

A Systematic Literature Review

Teodor Fredriksson

, Jan Bosch

1 a

and Helena Holmstrm Olsson

Department of Computer Science and Engineering, Division of Software Engineering,

Chalmers University of Technology, Gothenburg, Sweden

Department of Computer Science and Media Technology, Malm University, Malm, Sweden

Keywords:

Semi-supervised Learning, Active Machine Learning, Automatic Labeling.

Abstract:

Automatic labeling is a type of classiﬁcation problem. Classiﬁcation has been studied with the help of sta-

tistical methods for a long time. With the explosion of new better computer processing units (CPUs) and

graphical processing units (GPUs) the interest in machine learning has grown exponentially and we can use

both statistical learning algorithms as well as deep neural networks (DNNs) to solve the classiﬁcation tasks.

Classiﬁcation is a supervised machine learning problem and there exists a large amount of methodology for

performing such task. However, it is very rare in industrial applications that data is fully labeled which is why

we need good methodology to obtain error-free labels. The purpose of this paper is to examine the current

literature on how to perform labeling using ML, we will compare these models in terms of popularity and on

what datatypes they are used on. We performed a systematic literature review of empirical studies for machine

learning for labeling. We identiﬁed 43 primary studies relevant to our search. From this we were able to

determine the most common machine learning models for labeling. Lack of unlabeled instances is a major

problem for industry as supervised learning is the most widely used. Obtaining labels is costly in terms of

labor and ﬁnancial costs. Based on our ﬁndings in this review we present alternate ways for labeling data for

use in supervised learning tasks.

1 INTRODUCTION

In software-intensive companies in the online and in

the embedded systems domain huge sets of data are

being processed and labeled manually, either by one

or several of their employees (AzatiSoftware, 2019).

This is an expensive approach for a company but it

does allow for easy maintenance of the quality of the

data. The only downside is that the task will be te-

dious and time consuming and prohibitively expen-

sive due to the human factor.

Data labeling is a way of annotating data depend-

ing on the content of the data (see (AzatiSoftware,

2019)). The labels each data entry receives is decided

after information about the entry has been processed.

The modern and most reasonable way to try and

perform every task nowadays is to use artiﬁcial intel-

ligence and for automatic labeling there is no differ-

ence.

There are three different ways that machine learn-

ing algorithms learn, Reinforcement learning, Super-

https://orcid.org/0000-0003-2854-722X

vised learning and Unsupervised learning. Out of the

three types of learning, none of these can fully solve

the labeling problem.

There are two disciplines in machine learning that

are designed for the sole purpose of using data that

is either unlabeled or contains a small set of labeled

instances. These two labeling methods are called ac-

tive learning and semi-supervised learning. For the

remainder of this paper we will systematically map

the research that has been conducted towards semi-

supervised and active learning techniques and how

they are applied to labeling. We are particularly inter-

ested in how these methods can be applied in different

industrial scenarios and therefore we will categorize

all the possible research areas. We present this work

in hope that will contribute in inspiring people in the

industry to use active and semi-supervised learning

techniques for labeling tasks.

The contribution of this paper is threefold. First,

we provide an overview of the available approaches

for (semi-)automatic labeling of data for machine

learning based on a systematic literature review. Sec-

552

Fredriksson, T., Bosch, J. and Olsson, H.

Machine Learning Models for Automatic Labeling: A Systematic Literature Review.

DOI: 10.5220/0009972705520561

In Proceedings of the 15th International Conference on Software Technologies (ICSOFT 2020), pages 552-561

ISBN: 978-989-758-443-5

 2020 by SCITEPRESS – Science and Technology Publications, Lda. All r ights reserved

ond, we present the data types that are typically sub-

ject to (semi-)automatic labeling and which data types

require additional research. Finally, we identify the

open research questions that need to be addressed by

the research community.

The remainder of this paper is organized as fol-

lows. In the next section, we provide the background

and an overview of techniques and approaches for au-

tomatic labeling. In section 3, we provide a concise

description of the problem that we seek to address in

this paper followed by an overview of our research

method in section 4. We present the results of the

systematic literature review in section 5 and discuss

these in section 6. Finally, we end the paper with an

overview of open research questions in section 7 and

a conclusion in section 8.

2 BACKGROUND

As mentioned above, most machine learning

paradigms are either supervised or unsupervised.

This means that we have access to labels, or we do

not have access to labels. Note that most algorithms

in the industry are supervised but most data do not

have labels and so we need additional efforts to

produce good labels. Furthermore, it is not unreason-

able to think that a small subset of each big dataset

coming from companies have labels and this is where

semi-supervised learning is applicable, together with

some of the active learning framework.

2.1 Semi-supervised Learning

Semi-supervised learning is a set of machine learning

algorithms that can be used if most of the instances are

unlabeled, but a small subset of them have labels. In

technical terms, we have access to a set of data points

that can be divided into two disjoint subsets, one con-

taining the labeled instances and the other contain-

ing the unlabeled instances. The objective of semi-

supervised classiﬁcation is to train a classiﬁer on both

unlabeled and labeled data so that it is better than a

supervised classiﬁer trained only on the labeled data.

Areas of semi-supervised learning can be found in

(Zhu and Goldberg, 2009):

1. (Generative) Mixture Models and EM Algorithm.

2. Co-training and Multi-view Learning.

3. Graph-based Semi-supervised Learning.

4. Semi-supervised Support Vector Machines

(S3VM).

2.2 Active Learning

Historically machine learning algorithms usually try

to ﬁt a model according to currently labeled data and

we refer to these models as ”passive” learning mod-

els. Active learning systems on the other hand creates

new models as it iterative learns. Similar to how a sci-

entist plans several experiments to help come to a con-

clusion about a hypothesis, an active learning method

imposes query strategies to help select the most infor-

mative examples to be labeled by an oracle.

In some cases, e.g the model does not require

a huge number of labels, an active learning system

might not be optimal. Instead use it when there is a

very big set of unlabeled examples and you need to

label a huge amount of data to train the system.

If active learning is appropriate for your problem,

then we need to specify in what way we want to query

the examples (Settles, 2012). The three most common

scenarios are:

1. Query synthesis

2. Stream-based selective sampling

3. Pool-based sampling

As presented in (Settles, 2012), areas of active learn-

ing queries include:

1. Uncertainty Sampling.

2. Query by Committee/Disagreement (QBC/QBD).

3. Expected Error/Variance Reduction.

3 PROBLEM STATEMENT

Data labeling is an essential step when pre-processing

data to use with machine learning when preforming

supervised learning since it is dependent of the pres-

ence of labels.

According to reports, up to 80% (see (CloudFac-

tory.com, 2019)) of the time that companies spent

on their machine learning projects are allocated to

do task such as cleaning, pre-processing and labeling

data which is valuable time spent doing other tasks.

For example, an ML system that is trained to recog-

nize different animal species in a picture needs train-

ing data that contains images that already have labels.

Another good example is autonomous vehicles such

as self-driving cars. These cars are not safe enough to

be deployed in trafﬁc. In order for them to be safer

they need to able to distinguish between different ob-

jects in its path. Therefore, we need to train the AI

of the car using images where the key features are la-

beled.

Machine Learning Models for Automatic Labeling: A Systematic Literature Review

553

3.1 How does Data Labeling Work?

ML systems uses large datasets for training in order

to develop a strong AI that can learn patterns. This

training data must be labeled or annotated based on

the most essential features so that the model can orga-

nize the data in the best possible way.

It is essential to use labels that are informative and

independent to create an algorithm of top quality. A

well labeled dataset provides a ML model with empir-

ical evidence to evaluate the accuracy of the model.

The model is then reﬁned.

A ”quality algorithm” is an algorithm that has

both high ”accuracy” and high ”quality”, where ”ac-

curacy” refers to how good the predicted labels are

and ”quality” refers to how consistent the dataset is.

Errors in the data labeling will worsen the qual-

ity of the training data and so the performance of any

models used for prediction. To avoid these errors

several organizations chooses to implement HITL

(Human-in-the-loop) so that to keep humans involved

in the training and testing of the models through the

deployment phase. HITL is studied within Interactive

Machine Learning (iML) (Holzinger, 2016).

3.2 Methods for Labeling

Companies have several different ways they can ac-

quire labels for their data, popular choices are:

• Crowd-sourcing: Allows companies to preform

labeling more quickly by having access to a lot of

people and divide the labeling task among these

people rather than just using one employee for the

job.(CloudFactory.com, 2019)

• Contractors: Companies employ outside free-

lancers temporarily for labeling (CloudFac-

tory.com, 2019).

• Managed Teams: Companies gives the labeling

task to a group that they train speciﬁcally for la-

beling, this team is usually managed by a third-

party organization (CloudFactory.com, 2019).

• In-house Staff: The company enlists the labeling

the current employees. (CloudFactory.com, 2019)

There is no deﬁnite way of labeling data optimally

and companies have to decide by themselves on how

their labeling should be done. When selecting a data

labeling method, the main factors are the following:

1. Financial costs.

2. The size of the dataset.

3. The knowledge of the staff.

4. What is the objective of the ML model that needs

labels.

The team performing the labeling must have an ex-

cellent knowledge of the industry and its servers, they

need to be ﬂexible since labeling and machine learn-

ing is an ever-changing process that is quickly evolv-

ing as more data is coming in.

4 RESEARCH METHOD

This section presents the research method used in the

study, namely systematic literature review. System-

atic literature review seeks to identify, analyse, and

interpret all relevant research (i.e., primary studies)

on the topic of interest (Keele et al., 2007). In this

study the topic of interest is data labeling in machine

learning and the goal our SLR is to identify and ana-

lyze literature in this research area. We followed the

procedure of conducting systematic literature reviews

according to (Keele et al., 2007). The procedure can

be summarized as follows:

1. Deﬁnition of research questions.

2. Identiﬁcation of search terms and conducting

search.

3. Screening of papers on the basis of inclusion and

exclusion.

4. Data extraction and mapping.

For the rest of this chapter we will outline this proce-

dure.

4.1 Deﬁnition of Research Questions

The purpose of this study was to establish what cur-

rent research has been accomplished in the ﬁeld of

automatic labeling of data from different ﬁelds using

different machine learning method. Thus the main ob-

jectives of this literature review is:

• Examine previous research on the subject of auto-

matic labeling.

• Explore the possibility of contributing with new

research within the area.

We deﬁne a number of research questions and their

motivations below.

RQ1. In what research ﬁelds can we apply active and

semi-supervised learning?: This RQ seeks to

identify different research ﬁelds that exploit ac-

tive and semi-supervised learning

RQ2. What kind of machine learning algorithms are

used?: This RQ seeks to identify what type of

different active learning and semi-supervised

learning paradigms can be used.

ICSOFT 2020 - 15th International Conference on Software Technologies

554

RQ3. What is the popularity of data types among the

different methods?: This RQ seeks to identify

for each method, we how many papers studied

a speciﬁc datatype.

4.2 Identiﬁcation of Studies

Keyword-based database search was used to source

relevant studies.

In this study, the main search string that was con-

structed iteratively consisted of the two keywords:

Active machine learning OR semi-supervised learn-

ing. First we performed pilots with other key-

words, such as ”automatic labeling” but it gave a too

wide range of methods that were hard to categorize.

This was then changed to active learning and semi-

supervised learning methods as they were easier to

categorize

We further went on to improve the search string

in the following way. First, for active learning we

searched for ”active machine learning” + ”category

of active learning”. If we dismissed the ”machine” in

the string, we would get results related to ”education”.

Similarly for semi-supervised learning” we searched

for ”semi-supervised” learning ” + ”category of semi-

supervised learning”. The categories can be located

in table 2. Some methods that are not included in this

study that are being researched are:

• Constrained clustering.(Basu et al., 2008),

(Brefeld et al., 2006)

• Semi-supervised regression.(Cortes and Mohri,

2007), (Sindhwani et al., 2005),(Zhou and Li,

2005).

• Model and feature selection using unlabeled data.

ari

ainen, 2005), (Madani et al., 2005), (Schu-

urmans and Southey, 2002), (Li and Guan, 2008).

• Label sampling such as multi-instance learn-

ing, multi-task learning and deep learning.(Rosset

et al., 2005). (Zhou and Xu, 2007), (Liu et al.,

2008), (Ranzato and Szummer, 2008), (Weston

et al., 2012).

We did not directly include any of these in the search

string as we could not ﬁnd any relevant papers con-

taining any industrial application.

The search string was applied to Google Scholar.

Since the search terms are so general we expect a

large number of relevant articles from the search so

we deem it sufﬁcient only to use Google Scholar

( https://scholar.google.com). The second reason to

only use Google scholar is because Google scholar

is perceived as an unbiased source according to

(Wohlin, 2014). Furthermore, we do not limit our-

selves to any time period since the rise of machine

learning computations was from around the year 2000

and papers between 1980 and 1999 should mostly

contain theoretical research that we deem unneces-

sary for our study purpose.

The search strings were applied in December 2019

to the selected electronic database to retrieve arti-

cles that include the keywords in their title, abstracts

and instructions. To avoid ending up with an inﬁnite

amount of papers the retrieval stopped after the ab-

stracts and introductions became less relevant. At the

end, approximately 300 articles were retrieved for fur-

ther screening and processing of inclusion and exclu-

sion criteria.

4.3 Study Selection: Inclusion and

Exclusion Criteria

All retrieved studies were examined for inclusion

and exclusion based on pre-established criteria. The

exclusion and inclusion criteria considered in our

study are presented below:

Inclusion Criteria.

• Papers that includes AL/SSL techniques for label-

ing unlabeled and or partially unlabeled data form

the industry.

• Papers that compare several AL/SSL techniques

with each other.

• Papers that include a hybrid between AL/SSL

learning.

• Papers that compare AL/SSL techniques with

other non-AL/SSL methods.

• Papers that has a title that describes the applica-

tion.

Exclusion Criteria.

• Papers concerning theoretical proofs of AL/SSL

methods.

• Papers concerning simulation studies.

4.4 Data Extraction and Analysis

Data extraction involved the collection of information

related to the RQs of the study. For each paper we

identiﬁed the research ﬁeld, what kind of datatype it

was and what method the paper focused on.

4.5 Threats to Validity

Although we did not include deep learning in our

search string some papers might include deep learn-

ing because active or semi-supervised learning was

Machine Learning Models for Automatic Labeling: A Systematic Literature Review

555

applied to a deep neural network. Some of the papers

will contain theoretical properties as well as empirical

evaluation of the models. There is no way of telling

whether the data sets used in the papers have been

tampered with to ﬁt the models better.

5 RESULTS

In section we will interpret the results that we gath-

ered based the research questions in the previous sec-

tion.

5.1 RQ1: In What Research Fields Can

We Apply Active and

Semi-supervised Learning?

Table 1 shows how we categorize the different types

of data. Going from left to right, the ﬁrst column con-

tains the names of each category, the second columns

shows which datatypes belong to that category, the

third columns says what research areas are covered

in each category and the last column references each

paper that was used for each category.

5.2 RQ2: What Kind of Machine

Learning Algorithms Are used?

In this subsection we present the main active and

semi-supervised machine learning approaches based

on textbooks (Settles, 2012), (Zhu and Goldberg,

2009).

Table 2 shows a summary of the popular machine

learning methods for labeling (Settles, 2012), (Zhu

and Goldberg, 2009). In the left column we see the ac-

tive learning methods (Settles, 2012) and in the right

column we see the semi-supervised learning methods

(Zhu and Goldberg, 2009).

Table 3 shows how we have categorized the semi-

supervised learning methods. In the left column we

see the name of each category and in the right col-

umn we see what method(s) below to each cate-

gory. We did not include cluster based active learning

and cluster-then-label semi-supervised learning in the

search string as we did not ﬁnd any papers relevant to

industry.

Figure 1 shows an overview that illustrates how

many papers focused on each of the active learning

and semi-supervised learning methods. On the hor-

izontal axis we have the methods and on the verti-

cal axis we have the number of papers that focused

on that particular paper. The most popular category

is co-training and multi-view learning with a total of

eleven papers (Yan and Naphade, 2005), (Morsillo

et al., 2009), (Zhang and Zheng, 2017),(Di and Craw-

ford, 2011),(Guan et al., 2007), (Rigutini et al., 2005),

(Guo and Xiao, 2012), (Cui et al., 2011), (Yu et al.,

2010b), (Wu et al., 2019), (Jing et al., 2017), sec-

ond place is shared with graph-based semi-supervised

learning (Tang et al., 2009), (Tang et al., 2011), (Tang

et al., 2008), (Abbasi et al., 2015),(Zhao et al., 2015),

(Liu and Kirchhoff, 2013), (Zeng et al., 2013), (Sti-

kic et al., 2009) (Chen et al., 2008) and uncertainty

sampling (Liu et al., 2016), (Rajan et al., 2008), (Mi-

nakawa et al., 2013), (Yu et al., 2010a), (Zhu et al.,

2009), (Zhu et al., 2008), (Zhang and Chen, 2002),

(Varadarajan et al., 2009), (Kim et al., 2006). (Colares

et al., 2013), (Shi et al., 2010), (Huang and Hasegawa-

Johnson, 2009), (Nigam et al., 2006).

Co-training and multi-view methods corresponds

to 25.00% of all the methods. Graph-based methods

and uncertainty sampling both corresponds to 20.45%

each. Last but not least mixture models lands at fourth

place with 9.09%.

5.3 RQ3: What is the Popularity of

Datatypes among the Different

Methods

Here we only present graphs for the most popular

methods graph-based, co-training, multi-view learn-

ing, mixture models and uncertainty sampling. The

rest are omitted due to insufﬁcient amount of data.

The ﬁrst plot from the left of ﬁgure 2 illustrates

that for multidimensional inputs we found four rele-

vant papers (Tang et al., 2009), (Tang et al., 2011),

(Tang et al., 2008), (Abbasi et al., 2015), for sequen-

tial inputs we found four relevant papers (Zhao et al.,

2015), (Liu and Kirchhoff, 2013), (Stikic et al., 2009),

(Stikic et al., 2009) and no relevant one paper for uni-

variate inputs (Chen et al., 2008).

The second plot from the left of ﬁgure 2 illustrates

that for multidimensional inputs we found four rele-

vant papers (Yan and Naphade, 2005), (Morsillo et al.,

2009), (Zhang and Zheng, 2017),(Di and Crawford,

2011). For sequential inputs we found seven relevant

papers (Guan et al., 2007), (Rigutini et al., 2005),

(Guo and Xiao, 2012), (Cui et al., 2011), (Yu et al.,

2010b), (Wu et al., 2019), (Jing et al., 2017)

The third plot from the left of ﬁgure 2 illustrates

that for multidimensional inputs we found one paper

of interest (Colares et al., 2013), for sequential inputs

we found three papers of interest (Shi et al., 2010),

(Huang and Hasegawa-Johnson, 2009), (Nigam et al.,

2006) and for univariate inputs we found no paper of

interest.

The fourth plot from the left of ﬁgure 2 Illustrates

ICSOFT 2020 - 15th International Conference on Software Technologies

556

that for multidimensional inputs we found three pa-

pers of interest (Liu et al., 2016), (Rajan et al., 2008),

(Minakawa et al., 2013) for sequential inputs we have

found six relevant papers (Yu et al., 2010a), (Zhu

et al., 2009), (Zhu et al., 2008), (Zhang and Chen,

2002), (Varadarajan et al., 2009), (Kim et al., 2006)

and for univariate inputs we did not ﬁnd any relevant

papers.

From ﬁgure 1 and we can conﬁrm that the most

popular methods are based on co-training and multi-

view learning, graph-based methods, mixture models

and uncertainty sampling. Clearly semi-supervised

methods are more popular than active learning meth-

ods, three to one. Uncertainty sampling how-

ever includes many ways to measure uncertainty so

one could argue that we should divide it into sub-

categories.

Figure 1: Overview showing the distribution of each

method over the papers studied in this article.

Figure 2: Overview showing the distribution of datatypes

over methods based on uncertainty sampling.

6 DISCUSSION

In the research section we found what semi-

supervised methods and active learning methods are

popular. It is important to highlight that most of the

papers from this study are not based entirely on the

methods examined but are all some kind of hybrid

with other types of learning.

In the background we presented several issues that

concerned missing labels in company data and that it

is expensive so ﬁx this issue. From reviewing all the

papers in this review, we found that the common fac-

tor they all shared was that they were all missing la-

bels and they had a hard time obtaining these because

of ﬁnancial and labor costs.

None of the articles compared different semi-

supervised learning algorithms with each other so we

cannot compare the accuracy of each method. This is

because each method has its own distinct assumption

in order to work properly (Zhu and Goldberg, 2009).

Therefor it is easy to predict that some methods with

different assumptions will not work on the same data

as others. Thus, a comparison of semi-supervised al-

gorithms is not necessary.

Active learning methods was compared thor-

oughly in the papers. The best query strategy could

not be identiﬁed but the empirical evidence suggests

that every active learning approach exceeded the ran-

dom sampling approach. This proves that active

learning is much more effective than choosing the in-

stances to be labeled randomly.

7 OPEN RESEARCH QUESTIONS

The semi-supervised methods described in this paper

are the most basic ones and are taken from (Settles,

2012) and the active learning methods are taken from

(Zhu and Goldberg, 2009). Most papers studied are

based on active and semi-supervised learning algo-

rithm. Some of the methods in the papers resembles

these methods and some does mot, and this is why it is

hard to compare every method to each other so there

is a lot of research in just improving old methods.

There are many open research questions avail-

able, e.g as society is becoming more data-driven,

we need to know how do to efﬁciently incorporate

large or inﬁnite amounts of data into our labeling al-

gorithms.(Settles, 2012). Researchers wish to cre-

ate semi-supervised learning algorithms that preform

better than supervised learning by selecting the best

semi-supervised parameters and assumptions for the

model(Settles, 2012). Ideally semi-supervised learn-

ing should be used with all types data from different

Machine Learning Models for Automatic Labeling: A Systematic Literature Review

557

Table 1: Categories for each application.

Category Datatype Area

Multidimensional inputs

Image, Video

Image classiﬁcation/segmentation

Image retrieval.

Detection in videos

Monocular 3D human pose estimation.

Microalgae classiﬁcation.

Sequential inputs

Time Series, Signals, Text

Text classiﬁcation, segmentation

Word-sense disambiguation

Signal processing

Spoken language understanding/Speech recognition

Word segmentation

Phonetic classiﬁcation

Information extraction/retrieval

Univariate inputs One-dimensional

Real time trafﬁc classiﬁcation

Webpage classiﬁcation

Network intrusion detection

Table 2: Summary of all methods for ”active” and ”semi-

supervised” learning.

Active Learning Semi-supervised Learning

Uncertainty Sampling Semi-supervised SVM

Query by Committee Co-training

Query by Disagreement Mixture Models

Expected Model Change Cluster-then-label

Expected Error Reduction EM algorithm

Density-Weighted methods Multi-view learning

Variance Reduction Graph-Based

Cluster Based

Table 3: Classiﬁcation of semi-supervised methods.

Category Sub-ategory Methods

Semi-supervised co-multi Co-training and multi-view learning

mix Mixture models.

SVM Support Vector Machines.

graph Graph Based method.

Active den Density weighted methods.

query Query-by methods e.g QBC and QBD.

red Expected error or variance reduction.

unc Uncertainty sampling based methods.

areas. To make semi-supervised work on all these dif-

ferent datatypes, we need to deﬁne new assumptions

for the models and its parameters(Settles, 2012). An

impressive ﬁeld of study is combining active learn-

ing and semi-supervised learning. Active learning is

ﬁrst used to determine what instances to label. These

manually labeled instances will then be used for the

semi-supervised part of the model. (Weston et al.,

2012) For more applications see (Hakkani-Tur et al.,

2011), (Tur et al., 2005), (Zhu et al., 2003), (Leng

et al., 2013).

In future research we would like to explore the fol-

lowing:

• How can we combine active learning with

semi-supervised deep learning models? Semi-

supervised learning relies heavily on data assump-

tions. Deep learning however does not rely on the

structure of the data.

• How can we train automatic labeling algorithms

with additional infrastructure e.g test lab equip-

ment?

• How do we use time as a mechanism for auto-

matic labeling? When predicting an outcome,

how do we use the actual outcome that becomes

available after some time.

• How sensitive are learning algorithms for noise

and how low-quality data and what mitigation

strategies exists?

8 CONCLUSION

Our goal of this study is to provide a structured

overview over machine learning methods used for la-

beling unlabeled data and to identify the open re-

search challenges associated with automatic labeling.

The basis of this problem comes from the industry

rather than academia. Companies have a vast amount

of data that is not useful for supervised learning tasks

as these require labeled data. Since more than 95%

of the deployments of artiﬁcial intelligence in indus-

try, based on our observations, are concerned with su-

pervised learning, having labels is crucial for com-

panies and different strategies to obtain labels have

been adopted. These include obtaining labels through

crowdsourcing, hiring individual contractors or edu-

cate their own staff so that they can do the labeling

manually. All of these approaches involve huge ﬁnan-

cial costs and laboring costs that the companies wish

to reduce.

It proves to be difﬁcult to ﬁnd a fully automatic

ICSOFT 2020 - 15th International Conference on Software Technologies

558

approach to labeling as most approaches needs hu-

man intervention of some sort. Human intervention in

machine learning is discussed in interactive machine

learning. Active learning is a brand in machine learn-

ing were we are allowed to be pose queries in order

to choose what instances should be labeled to be in-

cluded in the training set. Semi-supervised learning is

a brand in machine learning where we use a small set

of labeled instances to try and achieve better results

that supervised learning algorithms.

Both active and semi-supervised machine learning

algorithms can be used to solve problems in which

we have an insufﬁcient amount of labeled data, but

they do this in different ways. Based on our analysis

we can say that semi-supervised and active learning

methods are well developed for labeling and between

the two, semi-supervised learning seems to be more

developed. Unlike active learning, semi-supervised

learning does not require any human intervention and

is therefore more ”automatic” and require less effort

from humans. Furthermore, we see great potential in

using the methods presented in this article for indus-

trial applications and to contribute with new ideas es-

pecially to univariate data since the current research

on this datatype is lacking. A particularly interest-

ing research topic is to combine active learning with

semi-supervised learning, e.g one could use active

learning to pose queries in order to ﬁnd the optimal in-

stances to label for inclusion in the training data and

then use semi-supervised learning for whatever pur-

pose we want to use it for.

The contribution of this paper is threefold. First,

we provide an overview of the available approaches

for (semi-)automatic labeling of data for machine

learning based on a systematic literature review. Sec-

ond, we present the data types that are typically sub-

ject to (semi-)automatic labeling and which data types

require additional research. Finally, we identify the

open research questions that need to be addressed by

the research community.

ACKNOWLEDGMENT

This work was partially supported by the Wallen-

berg AI Autonomous Systems and Software Program

(WASP) funded by Knut and Alice Wallenberg Fun-

dation.

REFERENCES

Abbasi, M., Rabiee, H. R., and Gagn

e, C. (2015). Monocu-

lar 3d human pose estimation with a semi-supervised

graph-based method. In 2015 International Confer-

ence on 3D Vision, pages 518–526. IEEE.

AzatiSoftware (2019). AzatiSoftware Automated

Data Labeling with Machine Learning.

https://azati.ai/automated-data-labeling-with-

machine-learning.

Basu, S., Davidson, I., and Wagstaff, K. (2008). Con-

strained clustering: Advances in algorithms, theory,

and applications. CRC Press.

Brefeld, U., G

artner, T., Scheffer, T., and Wrobel, S. (2006).

Efﬁcient co-regularised least squares regression. In

Proceedings of the 23rd international conference on

Machine learning, pages 137–144.

Chen, C., Gong, Y., and Tian, Y. (2008). Semi-supervised

learning methods for network intrusion detection. In

2008 IEEE international conference on systems, man

and cybernetics, pages 2603–2608. IEEE.

CloudFactory.com (2019). The Ultimate Guide

to Data Labeling for Machine Learning.

https://www.cloudfactory.com/data-labeling-guide.

Colares, R. G., Machado, P., de Faria, M., Detoni, A., Ta-

vano, V., et al. (2013). Microalgae classiﬁcation using

semi-supervised and active learning based on gaussian

mixture models. Journal of the Brazilian Computer

Society, 19(4):411–422.

Cortes, C. and Mohri, M. (2007). On transductive regres-

sion. In Advances in Neural Information Processing

Systems, pages 305–312.

Cui, X., Huang, J., and Chien, J.-T. (2011). Multi-view

and multi-objective semi-supervised learning for large

vocabulary continuous speech recognition. In 2011

IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP), pages 4668–4671.

IEEE.

Di, W. and Crawford, M. M. (2011). View genera-

tion for multiview maximum disagreement based ac-

tive learning for hyperspectral image classiﬁcation.

IEEE Transactions on Geoscience and Remote Sens-

ing, 50(5):1942–1954.

Guan, D., Yuan, W., Lee, Y.-K., Gavrilov, A., and Lee, S.

(2007). Activity recognition based on semi-supervised

learning. In 13th IEEE International Conference on

Embedded and Real-Time Computing Systems and

Applications (RTCSA 2007), pages 469–475. IEEE.

Guo, Y. and Xiao, M. (2012). Cross language text classiﬁca-

tion via subspace co-regularized multi-view learning.

arXiv preprint arXiv:1206.6481.

Hakkani-Tur, D. Z., Schapire, R. E., and Tur, G. (2011).

Combining active and semi-supervised learning for

spoken language understanding. US Patent 8,010,357.

Holzinger, A. (2016). Interactive machine learning for

health informatics: when do we need the human-in-

the-loop? Brain Informatics, 3(2):119–131.

Huang, J.-T. and Hasegawa-Johnson, M. (2009). On semi-

supervised learning of gaussian mixture models for

phonetic classiﬁcation. In Proceedings of the NAACL

HLT 2009 Workshop on Semi-Supervised Learning for

Natural Language Processing, pages 75–83. Associa-

tion for Computational Linguistics.

Machine Learning Models for Automatic Labeling: A Systematic Literature Review

559

Jing, X.-Y., Wu, F., Dong, X., Shan, S., and Chen, S.

(2017). Semi-supervised multi-view correlation fea-

ture learning with application to webpage classiﬁca-

tion. In Thirty-First AAAI Conference on Artiﬁcial In-

telligence.

ari

ainen, M. (2005). Generalization error bounds using

unlabeled data. In International Conference on Com-

putational Learning Theory, pages 127–142. Springer.

Keele, S. et al. (2007). Guidelines for performing system-

atic literature reviews in software engineering. Tech-

nical report, Technical report, Ver. 2.3 EBSE Techni-

cal Report. EBSE.

Kim, S., Song, Y., Kim, K., Cha, J.-W., and Lee, G. G.

(2006). Mmr-based active machine learning for bio

named entity recognition. In Proceedings of the Hu-

man Language Technology Conference of the NAACL,

Companion Volume: Short Papers, pages 69–72.

Leng, Y., Xu, X., and Qi, G. (2013). Combining active

learning and semi-supervised learning to construct

svm classiﬁer. Knowledge-Based Systems, 44:121–

131.

Li, Y. and Guan, C. (2008). Joint feature re-extraction and

classiﬁcation using an iterative semi-supervised sup-

port vector machine algorithm. Machine Learning,

71(1):33–53.

Liu, P., Zhang, H., and Eom, K. B. (2016). Active deep

learning for classiﬁcation of hyperspectral images.

IEEE Journal of Selected Topics in Applied Earth Ob-

servations and Remote Sensing, 10(2):712–724.

Liu, Q., Liao, X., and Carin, L. (2008). Semi-supervised

multitask learning. In Advances in Neural Information

Processing Systems, pages 937–944.

Liu, Y. and Kirchhoff, K. (2013). Graph-based semi-

supervised learning for phone and segment classiﬁca-

tion. In INTERSPEECH, pages 1840–1843.

Madani, O., Pennock, D. M., and Flake, G. W. (2005). Co-

validation: Using model disagreement on unlabeled

data to validate classiﬁcation algorithms. In Advances

in neural information processing systems, pages 873–

880.

Minakawa, M., Raytchev, B., Tamaki, T., and Kaneda, K.

(2013). Image sequence recognition with active learn-

ing using uncertainty sampling. In The 2013 Interna-

tional Joint Conference on Neural Networks (IJCNN),

pages 1–6. IEEE.

Morsillo, N., Pal, C., and Nelson, R. (2009). Semi-

supervised learning of visual classiﬁers from web im-

ages and text. In Twenty-First International Joint Con-

ference on Artiﬁcial Intelligence.

Nigam, K., McCallum, A., and Mitchell, T. (2006).

Semi-supervised text classiﬁcation using em. Semi-

Supervised Learning, pages 33–56.

Rajan, S., Ghosh, J., and Crawford, M. M. (2008). An active

learning approach to hyperspectral data classiﬁcation.

IEEE Transactions on Geoscience and Remote Sens-

ing, 46(4):1231–1242.

Ranzato, M. and Szummer, M. (2008). Semi-supervised

learning of compact document representations with

deep networks. In Proceedings of the 25th interna-

tional conference on Machine learning, pages 792–

799.

Rigutini, L., Maggini, M., and Liu, B. (2005). An em based

training algorithm for cross-language text categoriza-

tion. In The 2005 IEEE/WIC/ACM International Con-

ference on Web Intelligence (WI’05), pages 529–535.

IEEE.

Rosset, S., Zhu, J., Zou, H., and Hastie, T. J. (2005). A

method for inferring label sampling mechanisms in

semi-supervised learning. In Advances in neural in-

formation processing systems, pages 1161–1168.

Schuurmans, D. and Southey, F. (2002). Metric-based meth-

ods for adaptive model selection and regularization.

Machine Learning, 48(1-3):51–84.

Settles, B. (2012). Active learning, volume 6 of synthesis

lectures on artiﬁcial intelligence and machine learn-

ing. Morgan & Claypool.

Shi, L., Mihalcea, R., and Tian, M. (2010). Cross lan-

guage text classiﬁcation by model translation and

semi-supervised learning. In Proceedings of the 2010

Conference on Empirical Methods in Natural Lan-

guage Processing, pages 1057–1067. Association for

Computational Linguistics.

Sindhwani, V., Niyogi, P., and Belkin, M. (2005). A co-

regularization approach to semi-supervised learning

with multiple views. In Proceedings of ICML work-

shop on learning with multiple views, volume 2005,

pages 74–79. Citeseer.

Stikic, M., Larlus, D., and Schiele, B. (2009). Multi-graph

based semi-supervised learning for activity recogni-

tion. In 2009 International Symposium on Wearable

Computers, pages 85–92. IEEE.

Tang, J., Hong, R., Yan, S., Chua, T.-S., Qi, G.-J., and Jain,

R. (2011). Image annotation by k nn-sparse graph-

based label propagation over noisily tagged web im-

ages. ACM Transactions on Intelligent Systems and

Technology (TIST), 2(2):1–15.

Tang, J., Li, H., Qi, G.-J., and Chua, T.-S. (2008). Integrated

graph-based semi-supervised multiple/single instance

learning framework for image annotation. In Pro-

ceedings of the 16th ACM international conference on

Multimedia, pages 631–634.

Tang, J., Li, H., Qi, G.-J., and Chua, T.-S. (2009). Image

annotation by graph-based inference with integrated

multiple/single instance representations. IEEE Trans-

actions on Multimedia, 12(2):131–141.

Tur, G., Hakkani-T

ur, D., and Schapire, R. E. (2005). Com-

bining active and semi-supervised learning for spo-

ken language understanding. Speech Communication,

45(2):171–186.

Varadarajan, B., Yu, D., Deng, L., and Acero, A. (2009).

Maximizing global entropy reduction for active learn-

ing in speech recognition. In 2009 IEEE International

Conference on Acoustics, Speech and Signal Process-

ing, pages 4721–4724. IEEE.

Weston, J., Ratle, F., Mobahi, H., and Collobert, R. (2012).

Deep learning via semi-supervised embedding. In

Neural networks: Tricks of the trade, pages 639–655.

Springer.

ICSOFT 2020 - 15th International Conference on Software Technologies

560

Wohlin, C. (2014). Guidelines for snowballing in system-

atic literature studies and a replication in software en-

gineering. In Proceedings of the 18th international

conference on evaluation and assessment in software

engineering, pages 1–10.

Wu, F., Jing, X.-Y., Zhou, J., Ji, Y., Lan, C., Huang, Q., and

Wang, R. (2019). Semi-supervised multi-view indi-

vidual and sharable feature learning for webpage clas-

siﬁcation. In The World Wide Web Conference, pages

3349–3355.

Yan, R. and Naphade, M. (2005). Semi-supervised cross

feature learning for semantic concept detection in

videos. In 2005 IEEE Computer Society Confer-

ence on Computer Vision and Pattern Recognition

(CVPR’05), volume 1, pages 657–663. IEEE.

Yu, D., Varadarajan, B., Deng, L., and Acero, A.

(2010a). Active learning and semi-supervised learn-

ing for speech recognition: A uniﬁed framework using

the global entropy reduction maximization criterion.

Computer Speech & Language, 24(3):433–444.

Yu, Z., Su, L., Li, L., Zhao, Q., Mao, C., and Guo, J.

(2010b). Question classiﬁcation based on co-training

style semi-supervised learning. Pattern Recognition

Letters, 31(13):1975–1980.

Zeng, X., Wong, D. F., Chao, L. S., and Trancoso, I. (2013).

Graph-based semi-supervised model for joint chinese

word segmentation and part-of-speech tagging. In

Proceedings of the 51st Annual Meeting of the Associ-

ation for Computational Linguistics (Volume 1: Long

Papers), pages 770–779.

Zhang, C. and Chen, T. (2002). An active learning frame-

work for content-based information retrieval. IEEE

transactions on multimedia, 4(2):260–268.

Zhang, C. and Zheng, W.-S. (2017). Semi-supervised multi-

view discrete hashing for fast image search. IEEE

Transactions on Image Processing, 26(6):2604–2617.

Zhao, M., Chow, T. W., Zhang, Z., and Li, B. (2015). Au-

tomatic image annotation via compact graph based

semi-supervised learning. Knowledge-Based Systems,

76:148–165.

Zhou, Z.-H. and Li, M. (2005). Semi-supervised regression

with co-training. In IJCAI, volume 5, pages 908–913.

Zhou, Z.-H. and Xu, J.-M. (2007). On the relation between

multi-instance learning and semi-supervised learning.

In Proceedings of the 24th international conference on

Machine learning, pages 1167–1174.

Zhu, J., Wang, H., Tsou, B. K., and Ma, M. (2009). Ac-

tive learning with sampling by uncertainty and den-

sity for data annotations. IEEE Transactions on audio,

speech, and language processing, 18(6):1323–1331.

Zhu, J., Wang, H., Yao, T., and Tsou, B. K. (2008). Ac-

tive learning with sampling by uncertainty and den-

sity for word sense disambiguation and text classiﬁca-

tion. In Proceedings of the 22nd International Con-

ference on Computational Linguistics (Coling 2008),

pages 1137–1144.

Zhu, X. and Goldberg, A. B. (2009). Introduction to semi-

supervised learning. Synthesis lectures on artiﬁcial

intelligence and machine learning, 3(1):1–130.

Zhu, X., Lafferty, J., and Ghahramani, Z. (2003). Combin-

ing active learning and semi-supervised learning using

gaussian ﬁelds and harmonic functions. In ICML 2003

workshop on the continuum from labeled to unlabeled

data in machine learning and data mining, volume 3.

Machine Learning Models for Automatic Labeling: A Systematic Literature Review

561