A Systematic Mapping on the Use of Data Mining for the Face-to-Face

School Dropout Problem

Leandro Rondado de Sousa

, Veronica Oliveira de Carvalho

1 a

Bruno Elias Penteado

and Frank Jos

e Affonso

Universidade Estadual Paulista (Unesp), Instituto de Geoci

encias e Ci

encias Exatas, Rio Claro, Brazil

Universidade de S

ao Paulo (USP), Instituto de Ci

encias Matem

aticas e de Computac¸

ao, S

ao Carlos, Brazil

Keywords:

School Dropout, Face-to-Face, Data Mining, Systematic Mapping.

Abstract:

Dropout is a critical problem that affects institutions worldwide. Data mining is an analytical solution that has

been used to deal with it. Typically, data mining follows a structured process containing the following general

steps: data collection, pre-processing, pattern extraction, post-processing (validation). Until know, it is not

known how data mining has been used to address the dropout problem in face-to-face education considering

all steps of the process. For that, a Systematic Literature Mapping was conducted to identify and analyze

the primary studies available in the literature to address some research questions. The aim was to provide

an overview of the aspects related to data mining steps in the presented context, without going into details

about speciﬁc techniques, but about the solutions themselves (for example, imbalanced techniques, instead of

SMOTE). 118 papers were selected considering a period of 10 years (01/01/2010 to 31/12/2020).

1 INTRODUCTION

Dropout is a critical problem that affects institutions

worldwide. Many works have been done to under-

stand the factors that lead students to quit their stud-

ies. There is no consensus on the deﬁnition of dropout

(Manh

aes et al., 2014; M

arquez-Vera et al., 2016), but

in this paper it is considered as the students who in-

terrupt the course for any reason (course transfer, reg-

istration locking, etc.) and do not ﬁnish their studies

with their cohorts.

According to (Delen, 2011) there are two ap-

proaches that can be used to deal with the dropout

problem: survey-based and data-driven (analytic). In

the survey-based, theoretical models, such as the one

developed by Tinto (Tinto, 1993), are developed. In

the data-driven, institutional data are analyzed by an-

alytical methods. Data mining is one of those ana-

lytical solutions, as seen in (Gustian and Hundayani,

2017; Pertiwi et al., 2017; Pereira and Zambrano,

2017). As stated by (Plotnikova et al., 2020), data

mining projects generally follow a structured pro-

cess or methodology, such as KDD, CRISP-DM or

SEMMA. We are considering here the following main

steps that normally appear in these processes: data

https://orcid.org/0000-0003-1741-1618

collection, pre-processing, pattern extraction, post-

processing (validation). From now on, data mining

will be understood as a process containing these main

steps.

Although there are many studies that use data

mining to analyze the dropout problem, it is not

known how data mining has been used to address

the dropout problem in face-to-face education con-

sidering all steps of the process. Therefore, a Sys-

tematic Literature Mapping (SLM) (Kitchenham and

Charters, 2007) was done to identify and analyze the

primary studies available in the literature to address

some research questions. The aim was to provide

an overview of the aspects related to data mining

steps in the presented context, without going into de-

tails about speciﬁc techniques, but about the solutions

themselves (for example, imbalanced techniques, in-

stead of SMOTE).

It is important to mention that some secondary

studies were found, although only one similar to our

research. In (Agrusti et al., 2019) the authors also

present a systematic review on dropout through data

mining. However, they covered only one aspect of the

data mining process, the pattern extraction (speciﬁ-

cally techniques, algorithms and tools). Our goal is

to cover all steps of the process. In other words, the

authors did not capture all the aspects that interest us.

Rondado de Sousa, L., Oliveira de Carvalho, V., Penteado, B. and Affonso, F.

A Systematic Mapping on the Use of Data Mining for the Face-to-Face School Dropout Problem.

DOI: 10.5220/0010476300360047

In Proceedings of the 13th International Conference on Computer Supported Education (CSEDU 2021) - Volume 1, pages 36-47

ISBN: 978-989-758-502-9

 2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reser ved

This work is structured as follows: Section 2 de-

scribes the SLM protocol. Section 3 presents the re-

sults, as well as the analyzes and discussions for each

research question. An overview of the studies is also

presented. Finally, Section 4 concludes the paper and

discusses some gaps that can be explored.

2 PROTOCOL

A Systematic Literature Mapping (SLM) (Kitchen-

ham and Charters, 2007) is a process in which a set of

studies, available in the literature, is analyzed based

on a research question. The aim is to provide an

overview of the state of the art through the presenta-

tion and discussion of the results considering the ana-

lyzes carried out in studies identiﬁed as relevant. For

that, a protocol is elaborated, which contains the fol-

lowing steps: (a) formulation of one or more research

questions (Section 2.1); (b) identiﬁcation of the pri-

mary studies to be considered (for that purpose, the

studies must be extracted and analyzed) (Section 2.2);

summary and discussion of the results (Section 3).

2.1 Research Questions

The aim of this SLM was to retrieve and analyze pri-

mary studies that use data mining in the dropout con-

text to understand how the process occurs, from data

collection to the validation of the extracted patterns.

This study only addressed the face-to-face dropout

problem, since all institutions have data on the tra-

jectories of their students in their academic systems,

which can be analyzed and explored. Therefore, the

questions below were formulated.

RQ1. What levels of education are explored? This

question aims to identify the levels of education (uni-

versity, high school, etc.) data mining has been ap-

plied. It is important to know if there are researches

focused on the different levels of education, since the

problem exists in all of them and the data mining pro-

cess can be used in these different contexts in order to

better understand the problem.

RQ2. Considering the samples (datasets) used,

how big are they and how are they generated? This

question aims to identify the samples sizes and the cut

that is made in the data to obtain the sample (by year,

by course, etc.). It is important to know the samples

sizes as this impacts on the extracted model in relation

to generalization and overﬁtting (Tan et al., 2018). In

addition, if the analyzes are being made to understand

the students of a certain course, a speciﬁc area (engi-

neering, for example), etc., and in what period (an-

nual, half-yearly, etc.).

RQ3. What aspects (features, attributes) have

been used to model the dropout problem? This

question aims to identify the features that have been

used to induce the models. From this analysis it is

possible to know if there is any aspect not yet ex-

plored (academic, demographic, etc.) or even to di-

rect future explorations towards what, in general, is

used.

RQ4. What kind of pre-processing has been ap-

plied to the samples? As stated by (Romero et al.,

2014), pre-processing is one of the most important

steps. It affects all other subsequent steps. There-

fore, it is important to identify the techniques that

have been applied to prepare the samples for model

induction.

RQ5. What algorithm families have been used?

As many algorithms can be used to induce a model,

it is important to identify those that have been ex-

plored and whether they are all predictive or whether

there are solutions using descriptive tasks. However,

as many algorithms can appear, we intent to group

them by similarity, as done, for example, in Weka

named here as family.

RQ6. What measures have been used to validate

the extracted patterns? After model induction, it

is necessary to validate the extracted patterns. It is

important to identify the measures that have been used

in the post-processing step, since some measures, as

accuracy, are too general to validate the results.

2.2 Identiﬁcation of the Primary Studies

To identify the primary studies relevant to data ex-

traction, it is necessary to deﬁne the search string, the

databases for retrieving the papers, the inclusion and

exclusion criteria to select or not a paper as relevant

and the steps to make the selection.

Search String. The search string was formu-

lated with the aim of contemplating the issues of

“dropout” and “data mining”: “({desertion} OR

{attrition} OR {withdrawal} OR {withdraw} OR

{evasion} OR {dropout} OR {dropouts} OR {drop-

out} OR {drop-outs} OR {drop out} OR {drop outs})

AND ({student} OR {students} OR {school} OR

{academic} OR {education}) AND ({data mining}

OR {machine learning}) AND NOT ({distance} OR

{online} OR {on-line})”. The ﬁrst part addresses the

dropout problem, the second the school context, the

third data mining and the last the application of a ﬁlter

to exclude distance education, since we focus on face-

to-face learning. To formulate this string we evaluated

https://www.cs.waikato.ac.nz/ml/weka/.

A Systematic Mapping on the Use of Data Mining for the Face-to-Face School Dropout Problem

the words frequently used in several works, as well as

their synonyms, and then veriﬁed the works retrieved

with such string, in the databases used, to calibrate it.

Source Selection. The search string was applied

only to electronic databases, making the necessary

adjustments to the syntax of each one. The follow-

ing electronic databases were considered: Scopus

Compendex

, ISI Web of Science

, IEEE Xplore

ACM Digital Library

and ScienceDirect

. The string

was applied to titles, abstracts and keywords. The pe-

riod considered in the search was from 01/01/2010 to

31/12/2020 (10 years)

Inclusion and Exclusion Criteria. The purpose of

deﬁning these criteria is to identify the primary stud-

ies that provide direct evidence in relation to the re-

search questions. Therefore, the studies to be selected

for data extraction are those that do not match any

exclusion criteria. The following inclusion criterion

was considered: (i) the paper addresses the topic of

data mining in the face-to-face school dropout con-

text. The following exclusion criteria were consid-

ered: (i) the paper is out of scope: does not address

school dropout, face-to-face learning or data mining;

(ii) the paper does not contain an abstract; (iii) the pa-

per only contains an abstract; (iv) the paper is a copy

or a version of another paper already considered; (v)

the paper is not a primary study (such as editorial, po-

sition paper, keynote, opinion, tutorial, poster, panel,

book, technical report, etc.); (vi) the paper is a sec-

ondary study (review, survey, etc.); (vii) we were un-

able to access the full paper; (viii) the paper addresses

the use of tools and/or softwares and not the data min-

ing process itself; (ix) the paper is not written in En-

glish.

Selection Steps. To assist the process we used the

StArt

tool, since it automatically detects duplicate

papers, manages the entire process, maintaining a his-

tory on the number of included and excluded papers

by selection step and electronic database. Figure 1

shows the steps that were used to select the papers,

which are described in the ﬁgure itself. The values

initially obtained in the searches, as the values ob-

tained in each step, are also presented in the ﬁgure.

www.scopus.com.

www.engineeringvillage.com.

http://apps.webofknowledge.com/.

http://ieeexplore.ieee.org.

http://dl.acm.org.

http://www.sciencedirect.com.

The SLM was completed in October, and, at the be-

ginning of the year, the search string was executed again

in the respective databases and the papers of 2020 that had

not been identiﬁed until then were added and analyzed as

described below.

http://lapes.dc.ufscar.br/tools/start tool.

A total of 118 articles were selected. For the sake

of space, the set of selected papers is available at

https://bit.ly/dropout2021.

2.3 Data Extraction and Synthesis

Data extraction is concerned with collecting informa-

tion from selected articles in order to answer research

questions. Data extraction was performed through the

reading of the selected papers. However, it is also

important to extract and organize more general data

about the papers. The extraction forms were also built

with the StArt tool. The tool allows, after the extrac-

tion, to export all information to an electronic spread-

sheet in order to perform the data synthesis.

2.4 Threats to Validity

One of the advantages of making a SLM is to present

an overview of the state of the art through a method-

ological and not arbitrary process. However, even in

this case, it is possible that relevant papers end up

not being included. In this work we can mention two

threats. The ﬁrst is concerned to the number of syn-

onyms related to the word dropout. As mentioned be-

fore, there is no consensus on the deﬁnition of dropout

and, therefore, many other words can be used to ex-

press dropout depending on the context. However,

we consider that we used several of them. The sec-

ond is concerned to the word data mining. As we

want to focus in all steps of the process, we consid-

ered the word machine learning as a synonym for it,

as they are general terms that cover aspects such as

techniques (such as classiﬁcation), algorithms (such

as C4.5), etc. Therefore, we chose to use only gen-

eral terms and not speciﬁc ones (as neural network,

Bayesian), as done in (Agrusti et al., 2019).

3 RESULTS AND ANALYSIS

Before presenting the results regarding the research

questions, more general data about the papers are pre-

sented.

3.1 Overview of the Studies

This section presents some general aspects about the

118 selected papers.

Publication Year. As mentioned before, the period

considered was from 01/01/2010 to 31/12/2020 (10

years). We could observe an increase (Figure 2) in

the number of publications since 2018. 75.42% (89

CSEDU 2021 - 13th International Conference on Computer Supported Education

SCOPUS

N=288

COMPENDEX

N=190

WEB OF

SCIENCE

N=181

IEEE XPLORE

N=88

ACM

N=15

SCIENCE

DIRECT

N=56

Step 1: identification

and organization of

articles returned

from digital libraries.

N=818

Step 2: automatic

removal of

duplicate articles

using the START

tool.

N=377

Step 3: review of titles and abstracts to

apply the inclusion and exclusion

criteria. Articles that met the exclusion

criteria have been removed. Articles

that met the inclusion criteria were

selected for the next step. If in doubt,

the article was included and evaluated

in the next step.

N=229

Step 4: review of complete

articles. Articles that met

the exclusion criteria have

been removed. Articles that

met the inclusion criteria

were maintained.

N=118

Figure 1: Selection steps.

studies)

of the papers were published from this year.

2019 (28.81%, 34) is the year with most publications,

followed by 2020 (27.12%, 32), 2018 (19.49%, 23),

2016 (7.63%, 9), 2017 (5.08%, 6), 2015 (4.24%, 5),

2010 (2.54%, 3), 2013 and 2014 (1.69%, 2 each),

2011 and 2012 (0.85%, 1 each). It can be noticed that

the interest in the presented context has been growing

year by year.

2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

Figure 2: Number of publications per year.

Paper Type. Most papers were found in conferences

(55.08%, 65), followed by journals (44.07%, 52) and

book chapters (0.85%, 1).

Paper Country. An interesting fact that could be

observed is that dropout is a worldwide problem.

Researchers everywhere are struggling to understand

the problem. The following countries were iden-

tiﬁed, which are listed by the number of papers

(from highest to lowest values): USA (12.71%, 15),

Brazil (11.86%, 14), Colombia and Indonesia (8.47%,

10 each), India (7.63%, 9), Thailand (5.93%, 7),

Bangladesh (5.08%, 6), Ecuador (4.24%, 5), Hun-

gary and Spain (2.54%, 3 each), Taiwan, China, Italy,

Mexico, Peru, Malaysia, Croatia, Australia, Germany

and Philippines (1.69%, 2 each), Costa Rica, Por-

tugal, Czech Republic, Pakistan, Chile, Saudi Ara-

From now on we will always try to present the relative

and absolute values using “()” or “,”.

bia, Botswana, Latvia, Yemen, South Africa, Canada,

Bulgaria, Korea, United Arab Emirates, Oman and

Fiji (0.85%, 1 each). The country of the ﬁrst author

was considered.

Research Method. The papers were classiﬁed on

three categories

: “Comparative Analysis”, “Case

Study” and “Solution Proposal”. The Comparative

Analysis (56.78%, 67) includes papers that performed

comparative analyzes between techniques and/or al-

gorithms based on one or more datasets. The Case

Study (50%, 59) includes papers that presented an

analysis on a speciﬁc dataset using one or more tech-

niques and/or algorithms. The Solution Proposal

(7.63%, 9) includes papers that proposed a new so-

lution to the dropout problem through data mining.

It could be noticed that the tendency of the papers

was to perform an exploratory analysis on a speciﬁc

dataset considering different algorithms, bringing to-

gether the Comparative Analysis and Case Study cat-

egories.

3.2 RQ1: What Levels of Education Are

Explored?

The purpose of this question was to identify the lev-

els of education (university, high school, etc.) data

mining has been applied.

3.2.1 Results

It could be noticed that studies related to universities

were the majority with approximately 86.44% (102).

Sometimes a publication is counted more than once (for

example, it is classiﬁed in more than one category). Thus,

from now on, the sum of some percentages can be greater

than 100%.

A Systematic Mapping on the Use of Data Mining for the Face-to-Face School Dropout Problem

16.10% (19) were related with high school (countries:

USA, Mexico, Korea, India, Croatia, Hungary and

Brazil) (studies ID: 9, 10, 17, 18, 20, 26, 32, 41, 68,

72, 79, 94, 98, 102, 107, 111, 113, 115 and 116)

Considering the two countries with the highest num-

ber of studies, USA and Brazil, 66.67% (10) from the

USA studies are related with higher level and from

Brazil 85.71% (12).

3.2.2 Analysis and Discussion

Although dropout can occur at different levels of ed-

ucation, the level that stood out was the higher level.

The reasons for the preference for this level of edu-

cation are not reported, although studies mention that

dropout at higher level is a concern in several coun-

tries, including theoretical models of study in this

context, as addressed in (Perchinunno et al., 2019).

One of the reasons for the choice may be the fact

that the researchers use the databases of the institu-

tions where they are located. Another, as reported by

(Chen et al., 2018), is that at this level of education

more than 60% of the dropouts occur in the ﬁrst two

years. However, other levels are also important, as

reported by (Chung and Lee, 2019). Therefore, more

efforts could be made to better understand the dropout

problem at different levels.

3.3 RQ2. Considering the Samples

(Datasets) Used, How Big Are They

and How Are They Generated?

The purpose of this question was to identify the sam-

ples sizes and the cut that is made in the data to obtain

the sample (by year, by course, etc.).

3.3.1 Results

The size of the samples varied widely between works.

For a better understanding, the values were grouped in

ranges, as shown in Figure 3. It was noticed that al-

most 50% (47.46%, 56) of the works used small sam-

ples when compared to the educational context; in this

case, the sizes were less than or equal to 5,000. Con-

sidering this range (Figure 4) 60.70% (34) use sizes

less than 1,000 (28.81% (34) in relation to the total

(118)).

In relation to the strategies used to obtain the sam-

ples, it could be noticed that they followed a cer-

tain pattern: period of time (annual or half-yearly),

number of subjects and number of courses. Figure 5

presents the obtained patterns. It can be seen that the

From now on, the papers associated with the IDs can

be seen at https://bit.ly/dropout2021.

47.46% (56)

7.63% (9)

5.93% (7)

7.63% (9)

0.85% (1)

1.69% (2)

0.85% (1)

0% (0)

5.93% (7)

13.56% (16)

]1,-5,000]

]5,000-10,000]

]10,000-15,000]

]15,000-20,000]

]20,000-25,000]

]25,000-30,000]

]30,000-35,000]

]35,000-40,000]

]40,000-45,000]

]45,000-50,000]

>50,000

Not specified

Frequency

Ranges

Figure 3: Samples size grouped by ranges.

60.71%

(34)

12.50%

(7)

5.36%

(3)

8.93%

(5)

12.50%

(7)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

]1-1,000] ]1,000-2,000] ]2,000-3,000] ]3,000-4,000] ]4,000-5,000]

Frequency

Ranges

Figure 4: Range ]1-5,000] (Figure 3) broken in ranges of

1,000.

samples include data from periods longer than two

years, as well as data from more than two courses.

For a better understanding of the results, since the in-

formation was obtained during the reading of the pa-

pers, a synthesis was generated to relate “period” x

“aspects related to the courses”. As seen in Figure 6

56.78% (67) of the studies have used samples cover-

ing two or more courses considering a period of more

than two years.

3.3.2 Analysis and Discussion

As mentioned before, it is important to know the sam-

ples sizes as this impacts on the extracted model in

relation to generalization and overﬁtting (Tan et al.,

2018). It was noticed that almost 50% (47.46%,

56) of the works used small samples when compared

to the educational context (being 28.81% (34) less

than 1,000). Therefore, in some contexts, it may be

difﬁcult to conclude about the results regarding the

dropout problem, i.e., the inferences may not be gen-

eralizable. This aspect was not considered during

the data mining process, regardless of the algorithms

CSEDU 2021 - 13th International Conference on Computer Supported Education

3.39%

(4)

35.59%

(42)

3.39%

(4)

20.34%

(24)

4.24%

(5)

1.69%

(2)

32.20%

(38)

15.25%

(18)

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

Frequency

Figure 5: Samples patterns.

56.78% (67)

36.44%(43)

9.32%

(11)

2.54%(3)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

Frequency

Figure 6: Samples patterns organized by “period” x “as-

pects related to the courses”.

used. In addition, it was noticed that the data used

to carry out the experiments are not available, which

hinders the reproducibility of the research (more on

reproducibility see (Tatman et al., 2018)). In general,

the data is a sample from a database of a speciﬁc in-

stitution.

The other aspect explored was whether the ana-

lyzes are being made to understand the students of

a certain course, a speciﬁc area (engineering, for ex-

ample), etc., and in what period (annual, half-yearly,

etc.). It could be seen that many studies focus on two

or more courses considering a period of more than

two years. Thus, there is a concern about diversiﬁca-

tion (several courses) considering a longer period of

time. This fact favors the problem of the size of the

samples, which in general is small.

Finally, it is worth mentioning that most studies

do not mention the moment when dropout is ana-

lyzed. Only 16.95% (20) of the works make this in-

dication, for example, (Delen, 2010), (Castro et al.,

2018), (Chai and Gibson, 2015), which indicate that

the dropout analysis is carried out at the end of each

semester.

3.4 RQ3. What Aspects (Features,

Attributes) Have Been Used to

Model the Dropout Problem?

The purpose of this question was to identify the fea-

tures that have been used to induce the models.

3.4.1 Results

Papers, in general, use different features; however, the

reason for including them is not essentially justiﬁed.

191 distinct features were accounted (including the la-

bel “Not speciﬁed”). Thus, to better understand the

set of distinct features found, the strategy of dividing

them into groups of variables was used, namely: de-

mographic, social, psychological, ﬁnancial and aca-

demic. The works of (Chai and Gibson, 2015), (Dhar-

mawan et al., 2018), (P

erez et al., 2018a), (P

erez

et al., 2018b), (Sorensen, 2018), (Guarin et al., 2015)

and (Delen, 2010) carry out similar strategies.

Figure 7 shows the distribution of the 191 fea-

tures across the groups. The two groups that stand

out are those related to academic and demographic

variables, which together represent more than 77.49%

(148) of the total. Table 1 shows the most representa-

tive features in each group, the frequency and percent-

age of occurrence. For the academic and demographic

groups, the features with a frequency greater than or

equal to 10 are presented; for the other groups (psy-

chological, social and ﬁnancial) greater than or equal

to 5. Note that a feature can appear in one or more

studies; thus column “Freq.” (frequency) presents

two pieces of information, X/Y, where X indicates

the number of occurrence of the listed features and Y

the number of occurrence of the features in the group.

“%” indicates the percentage regarding the listed fea-

tures (

). Finally, Table 2 presents for each of the

groups the papers IDs that contain, at least, one fea-

ture of the group. Observe that most papers use fea-

Figure 7: Distribution of the 191 distinct features across the

groups.

A Systematic Mapping on the Use of Data Mining for the Face-to-Face School Dropout Problem

Table 1: Most representative features of each group.

Feature

Group

Features Freq. %

Academic course (11.76%, 40), yield (7.94%, 27), year ticket and

GPA (6.47%, 22 each), course area (5%, 17), admission

note (4.41%, 15); conclusion year, credits per semester

and admission form (3.53%, 12 each) and native lan-

guage note (3.24%, 11)

190/340 55.88%

Demographic gender (19.26%, 68), age (9.35%, 33), has work (6.52%,

23), marital status (5.95%, 21), schooling of the fa-

ther (5.38%, 19), mother’s schooling (5.10%, 18), ad-

dress (4.53%, 16), mother has work and father has work

(4.25%, 15 each) and ethnicity (3.68%, 13)

241/353 68.27%

Psychological interest in studies (17.24%, 5), personality (17.24%, 5) 10/29 34.48%

Social relationship with friends (17.24%, 5) 5/29 17.24%

Finance familiar income (54.55%, 18), ﬁnancing type (15.15%, 5) 23/33 69.70%

Table 2: Studies by feature groups.

Feature

Group

Study ID Freq. %

Academic 2, 3, 5, 6, 7, 8, 10, 11, 12, 15, 16, 17, 20, 21, 22, 24, 25,

26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 40, 41,

42, 43, 44, 46, 47, 49, 50, 51, 53, 54, 55, 56, 57, 58, 59,

61, 62, 63, 64, 66, 67, 68, 69, 71, 75, 76, 78, 79, 80, 81,

83, 84, 86, 88, 89, 90, 91, 95, 99, 100, 101, 102, 103,

105, 107, 108, 109, 111, 112, 114, 115, 116, 118

85 72.03%

Demographic 1, 2, 6, 8, 9, 10, 11, 12, 15, 16, 17, 18, 20, 21, 24, 25,

26, 27, 28, 29, 30, 31, 32, 33, 35, 36, 37, 38, 40, 41, 42,

43, 44, 46, 48, 49, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61,

63, 64, 67, 68, 69, 71, 75, 76, 78, 79, 80, 81, 83, 84, 88,

89, 90, 91, 92, 96, 99, 100, 102, 103, 105, 108, 111, 112,

114, 115, 118

77 65.25%

Finance 2, 3, 10, 12, 24, 26, 27, 30, 31, 35, 36, 37, 42, 43, 46, 48,

56, 58, 60, 63, 71, 76, 79, 89, 92, 96, 102, 105

28 23.73%

Social 1, 6, 7, 28, 30, 60, 63, 71, 89, 92, 108, 111 12 10.17%

Psychological 1, 7, 28, 30, 41, 46, 48, 54, 71, 115 10 8.47%

Not speciﬁed 4, 9, 13, 14, 15, 17, 19, 22, 39, 45, 52, 65, 70, 72, 73, 74,

82, 85, 87, 93, 97, 98, 102, 104, 106, 113

26 22.03%

tures from the academic (72.03%) and demographic

(65.25%) groups.

3.4.2 Analysis and Discussion

As seen above, the two groups that stand out are those

related to academic and demographic features. The

use of these groups is justiﬁed due to the fact that the

institutions store these data in the student’s history;

however, the other groups represent additional data

that are usually collected through other mechanisms,

such as physical or digital forms. However, there is

a tendency to use variables from different groups in

order to verify their inﬂuence in the process. In ad-

dition, it is clear that despite the papers use a consid-

erable number of variables, there is no consensus on

their choice, indicating a gap to be explored regarding

their selection.

Another aspect observed is that few papers dis-

cuss and/or present the best predictive features, as

in the following works: (Adil et al., 2018) number

of family members, relationship with teachers, inter-

est in studies; (Delen, 2010) academic performance

(grades/concepts), ﬁnancial aid (scholarship); (Castro

et al., 2018) cite, in addition to age, variables selected

through the psychological test called BADyG, among

them, visual memory, speed of reasoning and ability

to complete sentences; (Chai and Gibson, 2015) aver-

CSEDU 2021 - 13th International Conference on Computer Supported Education

age grade in the course, study time, amount of credits

taken. Thus, even if the groups of academic and de-

mographic variables stand out, the predictive features

are not always within these groups. This lack of anal-

ysis of the best predictive features may be related, for

example, to data characteristics, types of samples, re-

search objective, etc. The fact is that there was no

consensus on the most predictive, i.e., it was not pos-

sible to identify the main predictive features and/or if

there is any relationship between them.

3.5 RQ4. What Kind of Pre-processing

Has Been Applied to the Samples?

The purpose of this question was to identify the tech-

niques that have been applied to prepare the samples

for model induction.

3.5.1 Results

Even though pre-processing is one of the most im-

portant steps, 42.37% (50) of the papers do not spec-

ify the type of pre-processing performed, although

most of them (57.63%, 68) mention that this step was

used. In these papers, the following pre-procesing

were done: missing values (20.34%, 24), data bal-

ance (19.49%, 23), attribute selection (16.10%, 19),

descriptive statistic (15.25%, 18), attribute reduction

and discretization (5.93%, 7 each), data normalization

and outlier (2.54%, 3 each), attribute creation (0.85%,

1). Table 3 presents for each of the pre-processing

techniques the papers IDs that contain, at least, one

solution regarding it. See (Tan et al., 2018) for details

on data pre-processing.

3.5.2 Analysis and Discussion

Although pre-processing affects all other subsequent

steps, the reason for the lack of information on this

step (42.37%) may be due to the characteristics of

the algorithms used, as explained in (Alexandropou-

los et al., 2019). The authors relate the algorithms to

the pre-processing techniques and indicate that some

algorithms already have implicit pre-processing steps.

It could be noticed that the step was considered so

trivial that its details were not presented or brieﬂy pre-

sented; however, the results may be compromised, es-

pecially in the context of dropout, since the problem

is inherently imbalanced (only 19.49% of the papers

comment on this aspect).

3.6 RQ5. What Algorithm Families

Have Been Used?

The purpose of this question was to identify the al-

gorithm families that have been explored and whether

they are all predictive or whether there are solutions

using descriptive tasks.

3.6.1 Results

68 distinct algorithms were accounted. Grouping

them by task we have: classiﬁcation (95.59%, 65),

clustering, association and sequential pattern (1.47%,

1 each). The task that stood out was classiﬁca-

tion. Grouping them by similarity, as previously men-

tioned, named here as family, we have the frequen-

cies shown in Figure 8. Table 4 shows the most rep-

resentative algorithm in each family, the frequency

and percentage of occurrence. Note that an algo-

rithm can appear in one or more studies; thus col-

umn “Freq.” (frequency) presents two pieces of in-

formation, X/Y, where X indicates the number of oc-

currence of the listed algorithm and Y the number of

occurrence of the algorithms in the family. “%” in-

dicates the percentage regarding the listed algorithm

(

). Finally, Table 5 presents for each of the families

the papers IDs that contain, at least, one algorithm of

the group. The families that stood out were decision

tree (69.49%, 82), ensemble (51.69%, 61) and regres-

sion (32.20%, 38) (as seen in Table 5).

23.53%

(16)

22.06% (15)

13.24%(9)

11.76%(8)

8.82%(6)

5.88%(4)

4.41%(3)

2.94%(2)

1.47%(1)

0.00% 5.00% 10.00% 15.00% 20.00% 25.00%

Decision Tree

Ensemble

Rule-Based

Neural Network/Deep Neural Networks

Regression

Bayesian

Instance-Based

Support Vector Machine

Clustering

Association

Discriminant Analysis

Sequential Pattern

Nature-Inspired

Figure 8: Frequency of algorithm families.

3.6.2 Analysis and Discussion

Although different algorithms were used, there was

no mention of the reasons for the choice. It was ob-

served that the choice was made arbitrarily, seeking

to diversify the exploration in order to measure the

results (see Section 3.1 about “Comparative Analy-

A Systematic Mapping on the Use of Data Mining for the Face-to-Face School Dropout Problem

Table 3: Studies by pre-processing techniques.

Technique Study ID Freq. %

Missing

Values

2, 5, 10, 16, 18, 21, 22, 23, 24, 29, 30, 32, 37, 38, 41, 45,

54, 58, 67, 68, 99, 100, 103, 108

24 20.34%

Data Balance 9, 18, 23, 28, 30, 31, 33, 38, 41, 47, 49, 50, 60, 61, 62,

63, 64, 79, 81, 85, 99, 107, 111

23 19.49%

Attribute

Selection

1, 2, 18, 21, 28, 30, 35, 41, 49, 53, 59, 63, 77, 80, 84, 88,

89, 93, 99, 105, 110, 112

19 16.10%

Descriptive

Statistic

2, 5, 7, 10, 11, 18, 19, 23, 24, 29, 37, 41, 42, 45, 46, 48,

50, 56

18 15.25%

Attribute

Reduction

2, 4, 30, 33, 58, 69, 103 7 5.93%

Discretization 10, 24, 29, 30, 32, 40, 109 7 5.93%

Data Nor-

malization

1, 2, 113 3 2.54%

Outlier 1, 58, 93 3 2.54%

Attribute

Creation

1 1 0.85%

Not speciﬁed 3, 6, 12, 13, 14, 15, 17, 20, 25, 26, 27, 34, 36, 39, 43, 44,

51, 52, 55, 57, 65, 66, 70, 71, 73, 74, 75, 76, 78, 82, 83,

86, 87, 90, 91, 92, 94, 95, 96, 97, 98, 101, 102, 104, 106,

114, 115, 116, 117, 118

50 42.37%

Table 4: Most representative algorithm of each family.

Family Algorithm Freq. %

Decision Tree J48/C4.5 41/118 34.75%

Ensemble Random Forest 40/86 46.51%

Regression Logistic Regression 36/42 85.71%

Bayesian Naive Bayes 31/37 83.78%

Neural Network/Deep Neural Network MLP 22/36 61.11%

Support Vector Machine SVM 29/30 96.67%

Rule-Based OneR 9/28 32.14%

Instance-Based KNN 20/22 90.91%

Clustering K-means 7/7 100%

Association Apriori 5/5 100%

Discriminant Analysis Linear Discriminant Analysis (LDA) 3/3 100%

Sequential Pattern PreﬁxSpan 1/1 100%

Nature-Inspired Bacterial Foraging Optimization (BFO) 1/1 100%

sis” and “Case Study”). The decision tree and regres-

sion families may have stood out due to the fact that

the algorithms belonging to them are interpretable

(white box) (Burkart and Huber, 2020), since it is

possible not only to generate a predictive model, but

also to understand the model generated. Therefore,

thinking nowadays about Explainable Artiﬁcial In-

telligence (XAI) and white box models, other fami-

lies could be explored, such as associative classiﬁers

(Padillo et al., 2020).

3.7 RQ6. What Measures Have Been

Used to Validate the Extracted

Patterns?

The purpose of this question was to identify the mea-

sures that have been used in the post-processing step

to validate the extracted patterns.

3.7.1 Results

The measures identiﬁed in the papers were accu-

racy (81.36%, 96), precision (37.29%, 44), recall

CSEDU 2021 - 13th International Conference on Computer Supported Education

Table 5: Studies by algorithm families.

Algorithm

Family

Study ID Freq. %

Decision Tree 1, 2, 4, 5, 6, 7, 8, 9, 10, 13, 14, 16, 17, 19, 21, 23, 24, 25, 26, 27,

28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,

45, 46, 48, 49, 52, 56, 57, 58, 59, 60, 62, 63, 64, 66, 68, 69, 74,

76, 77, 78, 79, 81, 83, 84, 86, 87, 88, 89, 90, 91, 92, 97, 99, 100,

101, 103, 105, 106, 108, 109, 110, 111, 112, 117, 118

82 69.49%

Ensemble 2, 3, 5, 6, 8, 11, 12, 13, 16, 17, 22, 23, 24, 29, 32, 37, 43, 46, 49,

51, 54, 55, 57, 58, 59, 61, 62, 65, 66, 67, 68, 69, 71, 72, 73, 75,

80, 82, 83, 84, 85, 87, 88, 89, 90, 91, 93, 94, 95, 96, 98, 99, 101,

102, 105, 108, 109, 112, 113, 114, 116

61 51.69%

Regression 2, 3, 6, 8, 12, 13, 15, 16, 17, 18, 19, 22, 29, 37, 44, 46, 47, 49,

51, 53, 54, 56, 64, 68, 69, 70, 83, 89, 90, 93, 94, 96, 101, 106,

107, 110, 112, 117

38 32.20%

Bayesian 2, 4, 6, 8, 12, 13, 15, 16, 21, 22, 23, 24, 26, 31, 32, 33, 34, 37,

43, 57, 58, 72, 77, 83, 84, 87, 89, 90, 91, 95, 97, 99, 100, 105,

115, 118

36 30.51%

Neural Net-

work/Deep

Neural Net-

work

2, 3, 11, 24, 29, 32, 33, 34, 37, 43, 44, 46, 50, 52, 53, 55, 56, 59,

60, 68, 69, 78, 81, 83, 84, 87, 89, 90, 101, 106, 110, 111, 112

33 27.97%

Support Vec-

tor Machine

1, 3, 12, 15, 17, 18, 22, 26, 32, 34, 37, 46, 56, 58, 59, 62, 65, 67,

68, 69, 87, 89, 90, 91, 93, 94, 95, 105, 110, 112

30 25.42%

Instance-

Based

1, 2, 6, 12, 16, 26, 32, 33, 56, 58, 59, 62, 68, 87, 89, 95, 96, 99,

101, 106, 113

21 17.80%

Rule-Based 23, 26, 30, 32, 38, 41, 43, 63, 77, 79, 84, 103, 115 13 11.01%

Clustering 14, 27, 29, 35, 45, 47, 104 7 5.93%

Association 12, 35, 45, 52, 104 5 4.24%

Discriminant

Analysis

57, 67, 106 3 2.54%

Nature-

Inspired

62 1 0.85%

Sequential

Pattern

20 1 0.85%

(33.90%, 40)

, f-measure (27.97%, 33), area under

curve (AUC) (25.42%, 30, including ROC curve),

true positive (20.34%, 24), true negative (16.10%,

19) false positive and sensitivity (10.17%, 12 each),

false negative (8.47%, 10), speciﬁcity (7.63%, 9),

kappa (5.93%, 7), absolute average error and geo-

metric mean (4.24%, 5 each), conﬁdence and gini

(3.39%, 4 each), root mean square error (2.54%, 3),

relative absolute error (1.69%, 2) and support, un-

weighted average recall (UAR), permutation decrease

importance (0.85%, 1 each).

3.7.2 Analysis and Discussion

As the classiﬁcation task was the one that stood out,

measures referring to the confusion matrix were fre-

Recall and sensitivity are the same measures. However,

we preferred to use both to keep the way it was cited.

quently used. Accuracy was the most prevalent mea-

sure. However, as mentioned in (Fern

andez et al.,

2018, p.47-49), accuracy is not an adequate measure

to be applied when unbalanced data are used, which

usually occurs when working in the dropout domain.

Thus, the results do not necessarily express clearly the

validity of the obtained model, as it can correctly pre-

dict the examples of the majority class (dropout) and

incorrectly those of the minority class (non-dropout).

However, it is important to mention that 74 (77.08%)

of the 96 studies that used accuracy also applied other

assessment measures. Finally, it was observed that,

as in the algorithm families, the reason for the choice

was not mentioned.

A Systematic Mapping on the Use of Data Mining for the Face-to-Face School Dropout Problem

4 CONCLUDING REMARKS

This work presented a SLM to identify and ana-

lyze the primary studies available in the literature

to address some research questions on the use of

data mining for the face-to-face school dropout prob-

lem. The period considered was from 01/01/2010 to

12/31/2020 (10 years). In general, it was observed

that: (i) the academic community has shown inter-

est in the subject, approaching it more strongly since

2018 and, in particular, involving the level of higher

education; (ii) the subject has been addressed in sev-

eral countries, as it is a global problem; (iii) the use of

data mining is more focused on exploratory analysis

(“comparative analysis” and “case study‘”) on spe-

ciﬁc datasets (samples); (iv) the samples are gener-

ally small and cover two or more courses consider-

ing a period of more than two years; (v) many fea-

tures are considered in the selection of the samples,

with emphasis on the group of academic and demo-

graphic variables; (vi) most studies adhere to the clas-

siﬁcation task, with families of decision tree, ensem-

ble and regression algorithms being used frequently;

(vii) several pre-processing techniques and validation

measures (post-processing) were used.

Considering the SLM it is noted that some gaps

can be explored in order to use data mining, in the

presented context, in a broader way. As mentioned

before, as samples are, in general, small it may be

difﬁcult to conclude about the results regarding the

dropout problem, i.e., the inferences may not be gen-

eralizable. Thus, larger samples could be considered,

since the amount of data that make up an educational

system is generally high. This also makes it possible

to apply other families of algorithms, such as those of

deep neural networks. However, in this case, it would

be of interest to use XAI to make the results inter-

pretable, since in some contexts only prediction is not

enough. Another solution would be the application

of other inherently interpretable families (white box),

such as that of associative classiﬁers.

Still regarding the samples, it would be interest-

ing if the researchers made their datasets (samples)

available, in order to allow the reproducibility of the

experiments. In this case, it would be possible to

build a “ global ” dataset, making it feasible to obtain

an overview of evasion in several countries. A more

general analysis of the algorithms in relation to the

dropout problem would also be possible, as well as

the adaptation and/or proposal of speciﬁc solutions to

the problem. In addition, it is clear that despite the pa-

pers use a considerable number of variables, there is

no consensus on their choice. It is necessary to carry

out studies that try to identify which features are most

relevant or whether one group of variables has more

weight than another in the presented context.

Regarding pre-processing, the step was consid-

ered so trivial that its details were not presented or

brieﬂy presented. However, the dropout problem is

inherently imbalanced. Thus, it is interesting that

studies evaluating this issue of imbalance be car-

ried out. Associated with this, in relation to post-

processing, it is necessary that more appropriate val-

idation measures be used, and not just general ones

such as accuracy.

Finally, it is important to mention that if interested

readers want some additional information on any of

the presented aspects, they can consult the details

in the eletronic spreadsheet available at https://bit.ly/

msl evasao2020. The spreadsheet contains a tab re-

garding each question/aspect discussed here. For ex-

ample, if the reader wants to know the techniques

used to balance the samples in the pre-processing

step, he/she only needs to consult the tab related to

the topic. The aim of this SLM was to provide an

overview of the aspects related to data mining, with-

out going into details about the speciﬁc techniques,

but about the solution itself (for example, imbalanced

techniques, instead of SMOTE).

REFERENCES

Adil, M., Tahir, F., and Maqsood, S. (2018). Predictive anal-

ysis for student retention by using neuro-fuzzy algo-

rithm. In 10th Computer Science and Electronic En-

gineering (CEEC), pages 41–45.

Agrusti, F., Bonavolonta, G., and Mezzini, M. (2019). Uni-

versity dropout prediction through educational data

mining techniques: A systematic review. Journal of

e-learning and knowledge society, 15:161–182.

Alexandropoulos, S.-A. N., Kotsiantis, S. B., and Vra-

hatis, M. N. (2019). Data preprocessing in predic-

tive data mining. The Knowledge Engineering Review,

34(e1):1–33.

Burkart, N. and Huber, M. F. (2020). A survey on the ex-

plainability of supervised machine learning. CoRR,

abs/2011.07876.

Castro, R. L. F., Espitia, P. E., and Montilla, A. F. (2018).

Applying CRISP-DM in a KDD process for the analy-

sis of student attrition. Communications in Computer

and Information Science, 885:386–401.

Chai, K. E. K. and Gibson, D. (2015). Predicting the risk of

attrition for undergraduate students with time based

modelling. In Proceedings of the 12th International

Conference on Cognition and Exploratory Learning

in the Digital Age (CELDA), pages 109–116.

Chen, Y., Johri, A., and Rangwala, H. (2018). Running

out of STEM: A comparative study across STEM ma-

jors of college students at-risk of dropping out early.

CSEDU 2021 - 13th International Conference on Computer Supported Education

In ACM International Conference Proceeding Series,

pages 270–279.

Chung, J. Y. and Lee, S. (2019). Dropout early warning sys-

tems for high school students using machine learning.

Children and Youth Services Review, 96(C):346–353.

Delen, D. (2010). A comparative analysis of machine learn-

ing techniques for student retention management. De-

cision Support Systems, 49(4):498–506.

Delen, D. (2011). Predicting student attrition with data min-

ing methods. Journal of College Student Retention:

Research, Theory & Practice, 13(1):17–35.

Dharmawan, T., Ginardi, H., and Munif, A. (2018). Dropout

detection using non-academic data. In 4th Interna-

tional Conference on Science and Technology, pages

1–4.

Fern

andez, A., Garc

ıa, S., Galar, M., Prati, R. C.,

Krawczyk, B., and Herrera, F. (2018). Learning from

Imbalanced Data Sets. Springer.

Guarin, C. E. L., Guzman, E. L., and Gonzalez, F. A.

(2015). A model to predict low academic perfor-

mance at a speciﬁc enrollment using data mining. Re-

vista Iberoamericana de Tecnologias del Aprendizaje,

10(3):119–125.

Gustian, D. and Hundayani, R. D. (2017). Combination of

AHP method with C4.5 in the level classiﬁcation level

out students. In International Conference on Comput-

ing, Engineering, and Design (ICCED), page 6p.

Kitchenham, B. and Charters, S. (2007). Guidelines for per-

forming systematic literature reviews in software en-

gineering. Technical Report EBSE 2007-001, Keele

University and Durham University Joint Report.

Manh

aes, L. M. B., Cruz, S. M. S., and Zimbr

ao, G. (2014).

WAVE: An architecture for predicting dropout in un-

dergraduate courses using EDM. In Proceedings of

the 29th Annual ACM Symposium on Applied Com-

puting (SAC), pages 243–247.

arquez-Vera, C., Cano, A., Romero, C., Noaman, A.

Y. M., Mousa Fardoun, H., and Ventura, S. (2016).

Early dropout prediction using data mining: A case

study with high school students. Expert Systems: The

Journal of Knowledge Engineering, 33(1):107–124.

Padillo, F., Luna, J. M., and Ventura, S. (2020). LAC: Li-

brary for associative classiﬁcation. Knowledge-Based

Systems, 193:105432.

Perchinunno, P., Bilancia, M., and Vitale, D. (2019). A sta-

tistical analysis of factors affecting higher education

dropouts. Social Indicators Research, pages 1–22.

Pereira, R. T. and Zambrano, J. C. (2017). Application of

decision trees for detection of student dropout proﬁles.

In 16th IEEE International Conference on Machine

Learning and Applications (ICMLA), pages 528–531.

erez, B., Castellanos, C., and Correal, D. (2018a). Apply-

ing data mining techniques to predict student dropout:

A case study. In IEEE 1st Colombian Conference on

Applications in Computational Intelligence, pages 1–

erez, B., Castellanos, C., and Correal, D. (2018b). Pre-

dicting student drop-out rates using data mining tech-

niques: A case study. Communications in Computer

and Information Science, 833:111–125.

Pertiwi, A. G., Widyaningtyas, T., and Pujianto, U. (2017).

Classiﬁcation of province based on dropout rate us-

ing C4.5 algorithm. In International Conference on

Sustainable Information Engineering and Technology

(SIET), pages 410–413.

Plotnikova, V., Dumas, M., and Milani, F. (2020). Adap-

tations of data mining methodologies: A systematic

literature review. PeerJ Computer Science, 6:e267.

Romero, C., Romero, J. R., and Ventura, S. (2014). A sur-

vey on pre-processing educational data. Studies in

Computational Intelligence, 524:29–64.

Sorensen, L. (2018). Big Data in educational adminis-

tration: An application for predicting school dropout

risk. Educational Administration Quarterly.

Tan, P.-N., Steinbach, M., Karpatne, A., and Kumar, V.

(2018). Introduction to Data Mining. Pearson, 2nd

edition.

Tatman, R., VanderPlas, J., and Dane, S. (2018). A prac-

tical taxonomy of reproducibility for machine learn-

ing research. In Reproducibility in Machine Learning

Workshop at ICML 2018, page 5p.

Tinto, V. (1993). Leaving College: Rethinking the Causes

and Cures of Student Attrition. University of Chicago

Press.

A Systematic Mapping on the Use of Data Mining for the Face-to-Face School Dropout Problem