Fairness of In-session Dropout Prediction

Nathalie Rzepka

, Katharina Simbeck

, Hans-Georg Müller

and Niels Pinkwart

Hochschule für Technik und Wirtschaft, Treskowallee 8, Berlin, Germany

Department of German Studies, University of Potsdam, Potsdam, Germany

Department of Computer Science, Main University, Berlin, Germany

Keywords: Fairness, Dropout Prediction, Algorithmic Bias.

Abstract: The increasing use of machine learning models in education is accompanied by some concerns about their

fairness. While most research on the fairness of machine learning models in education focuses on

discrimination by gender or race, other variables such as parental educational background or home literacy

environment are known to impact children's literacy skills too. This paper, therefore, evaluates three different

implementations of in-session dropout prediction models used in a learning platform to accompany German

school classes with respect to their fairness based on four different fairness measures. We evaluate the models

for discrimination of gender, migration background, parental education, and home literacy environment.

While predictive parity and equal opportunity are rarely above the defined threshold, predictive equality and

slicing analysis indicate that model quality is slightly better for boys, users with higher parental education,

users with less than ten books, and users with a migrant background. Furthermore, our analysis of the temporal

prediction shows that with increasing accuracy of the model, the fairness decreases. In conclusion, we see that

the fairness of a model depends on 1) the fairness measure, 2) the evaluated demographic group and 3) the

data with which the model is trained.

1 INTRODUCTION

Algorithmic decision-making processes are playing

an increasing role in all areas of life. In addition to

education, AI models are used in marketing,

healthcare, human resources, and economics.

However, the assumption that all decisions made by

algorithms are fair and objective has already been

refuted in numerous studies. One example for

algorithmic bias is the application of a computer

calculated score (risk assessment) to predict the

likelihood of criminals to relapse that has been proven

to be discriminatory against defendants of black skin

color, mislabeling them as high risk almost twice as

often as defendants of white skin color (Angwin et al.

2016). Experiments testing the performance of

different face recognition algorithms have shown

both the commercial and the nontrainable algorithms

to be racially and sexually biased, having lower hit

accuracies when it comes to certain cohorts (females,

blacks, age 18-30) compared to the remaining cohorts

within their demographics (Klare et al. 2012). A study

using the Google Translate API to translate sentences

from gender-neutral languages into English has

shown the program’s tendency to use male defaults,

especially in sentences related to male dominated

fields such as STEM jobs (Prates et al. 2020).

The growing use of machine learning models in

educational contexts therefore goes hand in hand with

concerns about their fairness. Many studies have

already evaluated different applications and found

discriminatory tendencies. For example, in (Gardner

et al. 2019; Hu and Rangwala 2020; Riazy and

Simbeck 2019). In their review, Baker et al. (2021)

criticize the fact that most studies examine

discrimination based on gender and race - other

characteristics are considered much less frequently.

However, especially in the case of discrimination by

models in education, other factors are crucial. This is

because many studies have already shown that

educational success is strongly linked to the social

background of the family (Carroll et al. 2019; Lee and

Burkam 2007; Steinlen and Piske 2013). Bias in the

educational context can become relevant, for

example, when implementing dropout prediction

models. Dropout prediction models in massive open

online courses (MOOCs) or higher education have

already been created extensively in many studies

316

Rzepka, N., Simbeck, K., Müller, H. and Pinkwart, N.

Fairness of In-session Dropout Prediction.

DOI: 10.5220/0010962100003182

In Proceedings of the 14th International Conference on Computer Supported Education (CSEDU 2022) - Volume 2, pages 316-326

ISBN: 978-989-758-562-3; ISSN: 2184-5026

(Tasnim et al. 2019; Okubo et al. 2017; Stapel et al.

2016; Sun et al. 2019; Wang et al. 2017; Xing and Du

2019).

In our setting, we consider different

implementations of a temporal in-session dropout

prediction model that has been created in the context

of an online learning platform for German spelling

and grammar skills. The platform is mainly used in

secondary school lessons and offers various exercises

in all orthographic domains. Since the temporal in-

session dropout prediction model we studied is

applied in school lessons, it differs from MOOCs in

some respects: the homework assigned there is

obligatory for students and the time frame is also not

individual, but set by the teachers. Dropouts in

MOOCs also differ from dropouts on our platform, as

in the German school system you cannot drop through

a course, but only through the whole grade level. We

therefore consider the in-session dropout, i.e. the

early termination of a session without finishing the

assignments. From a didactic point of view it is

preferable that learners complete a set of excercises

in one training session. Dropping out early, e.g. due

to frustration, stops the learning process. Despite the

differences to MOOCs, we believe it is useful to

investigate platforms used in the school context to

find out more about the integration of online learning

in the classroom.

To evaluate the models in terms of their fairness,

we consider not only classical variables such as

gender and migration background, but also parental

educational background and home literacy

environment (HLE). For this purpose, we proceed as

follows: first, we will summarize the theoretical

foundations of algorithmic bias and fairness and of

dropout prediction. Then, we will describe the study

setting, including the data set, the in-session dropout

prediction models and the different groups for whose

discrimination the model is examined. The selected

fairness measures will then be calculated per group

and model and interpreted.

2 RELATED WORK

2.1 Algorithmic Bias

In education, machine learning based models are used

for example in dropout- or at-risk predictions,

adaptive learning environments which give

personalized feedback and correction, automated

scoring systems, or identification of struggling

students (Kizilcec and Lee 2020). As interventions,

feedback or scoring has a huge impact on the

students’ educational path, these applications and

models should be evaluated regarding fairness.

Algorithmic fairness, however, is discussed not only

in educational contexts but in almost all aspects of our

lives (Hajian et al. 2016; O'Neil 2016).

When talking about algorithmic fairness, the

term algorithmic bias is often used as well. However,

the terms are not synonymous: in the Merriam

Webster dictionary bias is defined as "a tendency to

believe that some people, ideas, etc., are better than

others that usually results in treating some people

unfairly" (Merriam-Webster Dictionary 2021a). Bias

can occur in different stages of machine learning, and

thus can lead to unfair models. Fairness is defined as

"the quality or state of being fair; especially fair or

impartial treatment; lack of favoritism toward one

side or another." (Merriam-Webster Dictionary

2021b). We therefore start with the description of

causes and origins of bias. Later on we introduce

disadvantaged groups and describe different fairness

metrics.

2.1.1 Causes and Origins of Algorithmic

Bias

There are different attempts to define causes or to

locate the sources of biases in machine learning

models (Pessach and Shmueli 2020). Pessach and

Shmueli for example differentiate between four

causes. The first one derives from bias in the dataset

which is replicated by the machine learning models.

The second one describes bias origins from missing

data or data selection biases which result in not

representative datasets. Further, there could be proxy-

attributes which are non-sensitive attributes that

derive from sensitive attributes. Lastly, if the goal is

to minimize the overall aggregated prediction error, a

model could benefit the majority group over

minorities (Pessach and Shmueli 2020). Mitchell et

al. categorize two components of biased data,

statistical bias and societal bias (Mitchell et al. 2021).

Statistical bias occurs in the mismatch between the

training sample and the reality, e.g., when the dataset

is not representative. Societal bias, on the other hand,

reflects the world as it is and replicates pre-existing

discrimination in reality (Mitchell et al. 2021).

2.1.2 Disadvantaged Groups

There are several groups that can be discriminated

against by machine learning model implementations.

Some group characteristics are protected by law, such

as gender, race, ethnicity, sexual orientation, religion,

or disability (Baker and Hawn 2021). However,

discrimination goes beyond that. Most research on

Fairness of In-session Dropout Prediction

317

educational practices focuses on fairness regarding

gender, race, and nationality (Baker and Hawn 2021).

Research on gender-related discrimination thus

defines only two groups of gender (male and female)

without considering non-binary or transgenders.

Baker and Hawn (2021) argue that there is a huge

research gap on bias in education considering other

disadvantaged groups, such as urbanicity,

socioeconomic status, native language, disabilities,

speed of learning or parental education background.

2.2 Definitions and Measures for

Fairness

There are various measures of algorithmic fairness,

which are comprehensively described in (Kizilcec

and Lee 2020; Mitchell et al. 2021; Verma and Rubin

2018). There is no ground truth in measuring fairness

and different measures should be chosen based on the

context, as they come with different advantages or

disadvantages (Pessach and Shmueli 2020). All

fairness criteria cannot be satisfied simultaneously

(Pessach and Shmueli 2020). Most of the measures

are statistical and rely on the confusion matrix: true

positive (TP), false positive (FP), false negative (FN),

true negative (TN). In their work, Verma and Rubin

(2018) group fairness metrics into three categories:

Definitions based on predicted outcome (1) focus

solely on the predictions for different demographic

distributions and do not consider the actual outcome.

Definitions based on predicted and actual outcomes

(2), on the other hand, consider both. Definitions

based on predicted probabilities and actual outcome

(3) use the predicted probability score instead of the

binary prediction outcome. Kizilcec and Lee (2020)

define three statistical notions of fairness:

independence, separation, and sufficiency.

Independence is satisfied if the protected and

unprotected groups have the same opportunity for a

predicted outcome. Separation and Sufficiency, on

the other hand, go further and do not only consider

prediction outcome but also actual outcome.

Separation requires therefore that an algorithm’s

prediction is correct and incorrect at similar rates for

different groups (Kizilcec and Lee 2020). Sufficiency

is satisfied if the proportions of correctly predicted

forecasts are equal across subgroups (Baker and

Hawn 2021).

To validate a machine learning model for

fairness, accuracy metrics are used to evaluate how

much the effectiveness of the predictive model differs

between the respective subgroups. Once the

difference in accuracy of a predictive model from two

groups exceeds a threshold, the algorithm is

considered discriminatory because demographic

parity is no longer ensured (Riazy and Simbeck

2019).

In the following, we describe four notions of

fairness that will be used in the later model

evaluation. Note, however, that there are much more

measures of fairness extensively described in other

work, for example in (Verma and Rubin 2018). We

define 𝑆 as the protected variable, 𝑌 as the attribute to

be predicted and 𝑅 as the prediction outcome.

One definition based on predicted and actual

outcomes is predictive parity (PP), which is

satisfied if both subgroups have equal predictive

positive value (Verma and Rubin 2018). Predictive

positive value is defined by

𝑃𝑃𝑉 =

𝑇𝑃

𝑇𝑃 + 𝐹𝑃

(1)

and often referred to as precision. Formally,

predictive parity is defined as:

P(Y =1

R=1,S=s1)=P(Y =1

R =1,S = s2) (2)

Predictive Equality (PE) is a classifier satisfied

if both subgroups have equal false positive

rates

(Verma and Rubin 2018). False positive rate is

defined by

𝐹𝑃𝑅 =

𝐹𝑃

𝐹𝑃 + 𝑇𝑁

(3)

Respectively, predictive equality is defined as:

P(R =1

Y=0,S=s1)=P(R =1

Y =0,S = s2) (4)

Equal Opportunity (EO) is satisfied, if both

groups have equal false negative rates (Verma and

Rubin 2018). False negative rate is defined by

𝐹𝑁𝑅 =

𝐹𝑁

𝑇𝑃 + 𝐹𝑁

(5)

Subsequently, equal opportunity is defined by:

P(R =0

Y=1,S=s1)=P(R =0

Y =1,S = s2) (6)

Slicing Analysis (SA) demands equal AUROC.

The receiver operating characteristic curve plots TPR

against FPR at different thresholds. From that the area

under the curve can be calculated using an integration

process.

2.3 Dropout Prediction

Predicting students’ dropout is one of the major

research interests in educational data mining (Dalipi

CSEDU 2022 - 14th International Conference on Computer Supported Education

318

et al. 2018; Luan and Tsai 2021). Many investigations

focus on student dropout in MOOCs and higher

education (Dalipi et al. 2018; Prenkaj et al. 2020).

The huge dropout rates can either be student-related,

MOOC- related, or both (Dalipi et al. 2018). Student-

related reasons for dropping out are especially lack of

motivation, but also lack of time or insufficient

background knowledge and skills (Dalipi et al. 2018).

Course design or lack of interactivity, as well as

hidden costs, are MOOC-related reasons for dropping

out (Dalipi et al. 2018). Other than in MOOCs,

dropout in higher education is defined differently, for

example by learning behavior, by passing a course, or

earning a certificate (Sun et al. 2019).

The data for dropout prediction models is

mostly student engagement data, clickstream data

(Dalipi et al. 2018), or student behavior data (Jin

2020). Other features can describe the student

demographic, assignment grades, or social network

analysis (Sun et al. 2019). Liang et al. (2016) define

the data flow in dropout prediction models in eight

stages. First, there is the user raw behavior log

containing raw data. After that, the data is cleaned and

pre-processed which results in a user table and action

table. Then, feature engineering is performed and, if

necessary, the predictors are labeled. The data is

subsequently split into training and test sets and a

binary classification model is tuned. As a result, and

the last stage of data flow, the predictive model is

calculated.

Prenkaj et al. (2020) differentiates between two

cases of dropout prediction: plain dropout

formulation and recurrent dropout formulation. While

the plain dropout formulation is independent in time,

the recurrent dropout formulation uses information

from previous phases/ weeks to predict the dropout

status of a student. Xing and Du (2019) define similar

categories which they call fixed-term and temporal

dropout prediction. Temporal models are modeled for

each week separately and use data only until the

current week. The advantage of this is that

developments during the course are considered and

interventions can be made each week. In temporal

dropout prediction, course activity features change

within each week whereas profile data or course data

features are static (Hagedoorn and Spanakis 2017).

The prediction strategy can be classified into

three categories: analytics examination, classic

learning methods, and deep learning (Prenkaj et al.

2020). Analytic examination describes the use of

basic statistics while classic learning methods include

traditional machine learning models. The most

commonly used machine learning algorithms in

student dropout prediction are logistic regression,

decision tree classifier, and support vector machines

(Dalipi et al. 2018). However, various models have

been implemented to tackle different purposes of

dropout prediction. Sun et al. (2019), for example,

used a temporal model based on a recurrent neural

network (RNN). This is advantageous because there

is no need for feature engineering as the clickstream

log data can be directly fed into the model. To

perform a temporal prediction mechanism, Xing and

Du’s model is as well built using a deep learning

algorithm, which calculates dropout probability rates

to prioritize interventions for at-risk students (Xing

and Du 2019). Wang et al. (2017) proposed a

combination of a convolutional neural network and

recurrent neural network for a dropout prediction

model to be able to skip the manual feature selection

process. Other researchers use Ada boost (Hagedoorn

and Spanakis 2017), random forest (Del Bonifro et al.

2020), or survival analysis (Chen et al. 2018).

In their review, Shahiri et al. (2015) discussed

important attributes on predicting student

performance. One of the most frequent attributes is

cumulative grade point average (CGPA), which has

been shown to be the most significant input variable

in a coefficient correlation analysis (Shahiri et al.

2015). Internal assessments, for example in a quiz or

assignments are another valuable attribute to predict

student performance. Attributes of students

demographic include gender, age, family

background, and disability and are as well often used

(Shahiri et al. 2015). Further attributes are extra-

curricular activities, high school background, or

social interaction network (Shahiri et al. 2015).

3 RESEARCH QUESTIONS &

STUDY SETTING

Although there are many articles on dropout

prediction models, this paper explores some aspects

that have not been studied before. We investigate the

discriminatory potential of in-session dropout

prediction models, not classical MOOC dropout

predictions. Furthermore, we consider different

fairness metrics and compare different ML

implementations. We also consider rarely studied

demographic features, such as HLE and parental

education. Our research questions are therefore as

follows:

RQ1: How fair are in-session dropout prediction

models considering different fairness measures?

Fairness of In-session Dropout Prediction

319

RQ2: What is the potential for discrimination by in-

session dropout prediction models for different

demographic groups?

RQ3: How do different ML implementations of

prediction models differ with respect to their

discrimination potential?

Our data is obtained from the platform

orthografietrainer.net, an online learning platform for

the acquisition of spelling skills of the German

language. The in-session dropout prediction model

was trained with learning process data from this

platform and predicts whether a user will end a

session early or not. A session is considered exited

early, if the session is left without completing the

assigned set of exercises.

The goal is to be able to intervene during the

processing of an assignment to support the user in

successfully completing the session. The evaluation

of the model in terms of fairness will look at different

demographic groups. To find out about the users’

demographic characteristics, a survey is carried out.

The prediction model is then applied to the different

groups and the results are examined with different

fairness metrics. In the following, the study is

described by the data set, the prediction model, and

the fairness evaluation method.

3.1 Data Set

The online learning platform orthografietrainer.net

offers exercises for spelling and grammar skills, for

example for capitalization, separate and compound

spelling, or comma formation. The target group

ranges from fifth grade to graduating classes, as well

as users from adult education or students at

university. The platform is mainly used by teachers to

assign homework to students, which can then be

solved on the platform. The users receive automated

corrections, and the teachers can then view an

evaluation. Due to the Covid-19-pandemic, access

numbers have risen sharply, which shows that the

platform was used in distance learning formats.

The data set consists of 181,792 sessions from

52,032 users and all assignments were performed

between 1

of March and 31

of April 2020.

To measure fairness, we use both, variables that

derive from the registration process such as gender,

and variables obtained in a survey that could be

answered voluntarily by users of the platform. The

survey collects data on people's social background,

the importance of school grades, interests and

enjoyment of German lessons. It is automatically

displayed to each user three months after registration

on the platform. A total of 2749 people took part in

the survey from March to June in 2020.

A peculiarity of the platform is the structure of the

exercises: If the teacher assigns an exercise to a class,

this task is displayed to the students as pending.

Exercises consist of 10 sentences devoted to a

specific orthographic area, for example,

capitalization. If a mistake is made while working on

the task, new variations of the exercise sentences are

added, and the task expands dynamically. Before a

session is finished successfully, all previously wrong

sentences are displayed at the end of the session

again. As a result, exercises can consist of 60 or more

sentences if many mistakes are made. Consequently,

a session must at least consist of ten sentences if the

assignment is finished without mistakes. It can thus

be finished successfully with more than ten sentences

if the previously wrong sentence is answered

correctly later and the versions of this sentence are

answered correctly too. Figure 1 shows the count of

sentences an their sentence number. Some session

have been exited before the tenth sentence and are

thus unsuccessful. There is as well a drop after the

tenth sentence, which shows all the sessions that are

successfully ended without any mistakes.

Figure 1: Count of Sentences and Sentence Numbers.

3.2 In-session Dropout Prediction

Model

As stated above, the platform orthografietrainer.net is

mostly used in blended-classroom scenarios, for

example by assigning homework on the platform.

Traditional prediction models cannot be applied on

that case, as the whole setting is different: instead of

a self-contained course, individual homework

assignments and exercises are carried out on online

platforms that accompany school lessons. A course

dropout as in MOOCs or in higher education courses

is therefore not as transferable. To deal with this

CSEDU 2022 - 14th International Conference on Computer Supported Education

320

scenario, an in-session dropout prediction model is

applied. This is a temporal prediction model that

predicts for each exercise sentence of an assignment

whether a user will leave the session early or not.

Instead of a course dropout, an early exit of the

session is thus predicted. Herewith, different machine

learning models have already been tested in previous

studies to obtain a termination prediction within a

session (Rzepka et al. 2022). The temporal dropout

prediction model includes assignment and user

features which are either obtained directly from the

platform’s log data or calculated in the feature

engineering process. The model uses on the one end

demographic attributes such as gender, class level, or

user group and assignment features on the other hand

such as count of correct processed tasks, field of

grammar, difficulty of the sentences, or count of

pending tasks. To construct a temporal dropout

prediction, matrices are defined which include only

the processed tasks until the current sentence

position. Consequently, the prediction is re-run after

each sentence is processed and improves as the

number of sentences increases.

As a result, the Deep Learning Model (DL)

showed the highest accuracy up to 87%. This is

followed by the Decision Tree (DT) with a maximum

value of up to 85%. Furthermore, k-nearest neighbor

(KNN) and logistic regression (LogReg) were tested

but showed lower accuracies (Figure 1). The F1-score

shows best results for the deep learning model

followed by the decision tree classifier. Lower scores

are calculated for KNN and logistic regression. All

models improve strongly during the first ten

sentences and flatten out after. For the subsequent

calculations regarding fairness, the best models (DL

and DT) and one of the less good models (KNN) are

considered.

3.3 Fairness Evaluation

Our aim is to evaluate the in-session dropout

prediction model regarding fairness. For our analysis,

we therefore examine the performance of the

predictive model on four variables:

• First spoken language

• parents’ education

• number of books in the household

• gender

The first language attribute describes the language

the user has learned first at home (mother tongue) and

is an indicator for migration background. Spelling

skills and children’s literacy as a whole have been

found to be linked to migration background and level

of education of parents (Carroll et al. 2019; Lee and

Burkam 2007; Steinlen and Piske 2013). In

households where German is not the first language,

children tend to have poor results examining language

skills (Steinlen and Piske 2013). We split the data by

students whose first language was German and all

other languages.

Regarding the question about the parents’

education background in the survey, the users could

answer 0, 1 and 2. In our study we distinguish

between users with at least one parent with a high

school diploma and parents without a high school

diploma and thus grouped the answers of 1 and 2.

The third variable is the number of books in the

household, which is often part of questionnaires

measuring cultural capital (Noble and Davies 2009)

and hereby used as an indicator for home literacy

environment (HLE). Several studies linked the HLE

to children’s literacy skills (Carroll et al. 2019;

Figure 2: Accuracy and F1-Score per Sentence and Model (DTE=Decision Tree Classifier, KNN=k-Nearest Neighbor,

logreg=Logistic Regression, DL=Deep Learning).

Fairness of In-session Dropout Prediction

321

Griffin and Morrison 1997; Sénéchal and LeFevre

2002). For the number of books in the household, four

answers were possible in the survey: less than ten

books, more than 10 and less than 50 books, more

than 50 and less than 100, more than 100. We split the

user in two groups, one having less than 10 books in

their household and the other having more than 100.

In this attribute, we deliberately consider only the

edge cases, as we have seen that the differences are

otherwise too small and number of books in the

household is only estimated by the students.

The last attribute describes the user’s gender,

which can be male or female. This variable is not

obtained by the survey, but during the registration

process. Moreover, the attribute is part of the training

data as well, while the other three attributes are not.

After splitting the user groups according to the

variables as described above, we join them with the

learning process data of the respective users. To have

temporal predictions, we build matrices for each

sentence position. This results in 60 matrices

containing the information for the current sentence

and all previous ones. A matrix can thus be defined as

𝑥





, where 𝑖 describes the sentence position.

We then predict the early exit of the sessions with the

three pretrained models, the decision tree classifier,

the deep learning model, and the k-nearest neighbour

model. As these are temporal dropout prediction

models, we have 60 predictors for each session and

each model.

To evaluate model performance, we use the

metrics Predictive Parity (PP), Equal Opportunity

(EO), Predictive Equality (PE) and Slicing Analysis

(SA), which are described in section 2. It should be

noted that PP and SA are interpreted differently than

EO and PE. As PP and EO rely on precision and AUC

positive results are best. EO and PE, on the other

hand, rely on FPR and FNR and therefore negative

results are best. To be able to compare results, we

specifiy the directions of how the fairness measures

are calculated so that in all cases, a positive outcome

is to the benefit of the advantaged group, a negative

outcome to the benefit of the protected group.

The protected groups are male (gender), other

than German (first language), no high school diploma

(parental education), less than 10 books (books in the

household). PP and SA are calculated by

= 𝑛𝑜𝑡 𝑝𝑟𝑜𝑡𝑒𝑐𝑡𝑒𝑑 – 𝑝𝑟𝑜𝑡𝑒𝑐𝑡𝑒𝑑 (7)

EO and PE, on the other hand are calculated the

opposite way

= 𝑝𝑟𝑜𝑡𝑒𝑐𝑡𝑒𝑑 – 𝑛𝑜𝑡 𝑝𝑟𝑜𝑡𝑒𝑐𝑡𝑒𝑑 (8)

It cannot be expected that there are no

differences in model quality between groups, so it is

important to define thresholds that delineate from fair

to unfair. Different thresholds were defined in

previous studies, such as 0,04 for equal opportunity

by (Chouldechova 2017), 0,05 for predictive parity,

and 0,02 for slicing analysis by (Riazy and Simbeck

2019) and 0.01 as well as 0.03 for slicing analysis by

(Gardner et al. 2019). We set the thresholds to

𝑇ℎ𝑠



= |0,03| (lower threshold) and 𝑇ℎ𝑠



|0,05| (higher threshold).

As a temporal dropout prediction model

calculates predictors for each of the up to 60

sentences, we as well have 60 results per model and

fairness measure. To be able to interpret the results

more easily, we calculate the mean for each ten

sentences, leading to six results per model and

fairness measure.

4 RESULTS

In the following, we will present the results regarding

the model fairness for first language vs. second

language learners, users with high/low parental

education, users with high/low number of books in

the household, and gender. The attribute first

language (Table 1) is fair according to the metrics

predictive parity, equal opportunity, and slicing

analysis. All three measures remain below the

threshold of 𝑇ℎ𝑠



= |0,03| and 𝑇ℎ𝑠



= |0,05|

for each model. Predictive equality, however, shows

values lower -0,03, and for the decision tree and KNN

model even lower -0,05. As the protected group is

defined as users, whose first language is not German,

the model quality is slightly better for learners with

German as a 2

language, as fewer false-positive

dropout predictions are found for them.

Parents’ education (Table 2) is similar regarding

predictive parity and equal opportunity, as they

remain below the threshold. Predictive equality and

slicing analysis, on the other hand, show disparities

above the thresholds of 0,05 in the last ten sentences.

As the protected group is defined as users with

parents without a high school diploma, the model

quality is slightly higher for users with at least one

parent with higher education, fewer false positives are

encountered for them. Moreover, in the deep learning

model, inequalities are starting earlier, already as of

the 20th sentence.

CSEDU 2022 - 14th International Conference on Computer Supported Education

322

Table 1: Results of the metrics Predictive Parity (PP), Equal Opportunity (EO), Predictive Equality (PE), and Slicing Analysis

(SA) for the attribute first language. Models: DL=Deep Learning , DTE=Decision Tree Classifier, KNN=k-Nearest Neighbor.

PE PP

model DL DTE KNN DL DTE KNN DL DTE KNN DL DTE KNN

Sentence

2 to 9 -0,032 -0,028 -0,028 -0,042 -0,041 -0,041 0,010 0,013 0,013 0,002 -0,007 -0,007

10 to 19 -0,024 -0,025 -0,025 -0,050 -0,058 -0,058 0,004 0,003 0,003 -0,001 -0,017 -0,017

20 to 29 -0,024 -0,026 -0,026 -0,034 -0,057 -0,057 0,006 0,005 0,005 0,003 -0,016 -0,016

30 to 39 -0,022 -0,016 -0,016 -0,036 -0,030 -0,030 0,007 0,009 0,009 0,007 -0,007 -0,007

40 to 49 -0,019 -0,014 -0,014 -0,044 -0,046 -0,046 0,008 0,008 0,008 0,004 -0,016 -0,016

50 to 60 -0,016 -0,014 -0,014 -0,019 -0,041 -0,046 0,014 0,014 0,014 0,006 -0,013 -0,016

Table 2: Results of the metrics Predictive Parity (PP), Equal Opportunity (EO), Predictive Equality (PE), and Slicing Analysis

(SA) for the attribute parents education. Models: DL=Deep Learning , DTE=Decision Tree Classifier, KNN=k-Nearest

Neighbor.

PE PP

model DL DTE KNN DL DTE KNN DL DTE KNN DL DTE KNN

Sentence

2 to 9 -0,020 -0,022 -0,022 -0,002 -0,007 -0,007 0,002 0,002 0,002 0,009 0,008 0,008

10 to 19 -0,013 -0,022 -0,022 -0,029 -0,030 -0,030 -0,001 -0,001 -0,001 0,007 -0,004 -0,004

20 to 29 -0,005 -0,010 -0,010 0,030 0,0180 0,018 0,004 0,004 0,004 0,005 0,014 0,014

30 to 39 -0,005 -0,004 -0,004 0,047 -0,010 -0,010 0,007 0,004 0,004 0,004 -0,003 -0,003

40 to 49 -0,004 -0,005 -0,005 0,044 -0,009 -0,009 0,009 0,006 0,006 -0,007 -0,002 -0,002

50 to 60 -0,026 -0,021 -0,021 0,133 0,111 0,111 0,016 0,015 0,015 0,058 0,066 0,066

Table 3: Results of the metrics Predictive Parity (PP), Equal Opportunity (EO), Predictive Equality (PE), and Slicing Analysis

(SA) for the attribute number of books in household. Models: DL=Deep Learning , DTE=Decision Tree Classifier, KNN=k-

Nearest Neighbor.

PE PP

model DL DTE KNN DL DTE KNN DL DTE KNN DL DTE KNN

Sentence

2 to 9 -0,065 -0,051 -0,051 -0,102 -0,122 -0,122 0,011 0,010 0,010 -0,008 -0,035 -0,035

10 to 19 -0,046 -0,050 -0,050 -0,133 -0,147 -0,147 0,006 0,002 0,002 -0,005 -0,049 -0,049

20 to 29 -0,042 -0,044 -0,044 -0,079 -0,107 -0,107 0,012 0,010 0,001 -0,003 -0,032 -0,032

30 to 39 -0,034 -0,029 -0,029 -0,067 -0,103 -0,103 0,010 0,009 0,009 0,003 -0,037 -0,037

40 to 49 -0,031 -0,017 -0,017 -0,163 -0,119 -0,119 0,001 0,007 0,007 -0,041 -0,051 -0,051

50 to 60 -0,045 -0,043 -0,043 -0,213 -0,090 -0,097 0,006 0,022 0,021 -0,076 -0,024 -0,027

Fairness of In-session Dropout Prediction

323

Table 4: Results of the metrics Predictive Parity (PP), Equal Opportunity (EO), Predictive Equality (PE), and Slicing Analysis

(SA) for the attribute gender. Models: DL=Deep Learning , DTE=Decision Tree Classifier, KNN=k-Nearest Neighbor.

PE PP

model DL DTE KNN DL DTE KNN DL DTE KNN DL DTE KNN

Sentence

2 to 9 -0,022 -0,029 -0,029 -0,036 -0,049 -0,049 0,008 0,008 0,008 -0,001 -0,010 -0,010

10 to 19 -0,011 -0,026 -0,026 0,003 -0,018 -0,018 0,009 0,007 0,007 0,007 0,004 0,005

20 to 29 -0,009 -0,021 -0,021 0,017 -0,011 -0,011 0,009 0,007 0,007 0,007 0,005 0,005

30 to 39 -0,003 -0,009 -0,009 0,008 -0,006 -0,005 0,006 0,006 0,006 0,009 0,001 0,002

40 to 49 -0,006 -0,006 -0,005 -0,079 -0,109 -0,109 -0,001 -0,003 -0,003 -0,003 -0,052 -0,052

50 to 60 0,001 0,003 0,003 -0,149 -0,155 -0,160 -0,003 -0,003 -0,003 -0,031 -0,079 -0,081

For the number of books in the household (Table

3) we do only consider users with less than ten books

and more than 100 books. Here, predictive parity

shows no inequalities. Equal opportunity, predictive

equality, and slicing analysis show values less than

-0,03 and -0,05, which suggests the model quality is

better for users with less than ten books. Specifically,

for learners from households with many books, more

false positives and false negatives are encountered as

well as a lower value of AUC

Predictive parity and equal opportunity show

again only values below the threshold for the gender

attribute (Table 4). Predictive equality and slicing

analysis are thus lower than -0,05 for the last 20

sentences. This means that the models are of higher

quality for boys.

When distinguishing between the different

machine learning models, we see few differences.

Most of the time, either all three values are above the

threshold or they are not.

5 CONCLUSION AND OUTLOOK

In our study, we evaluated an in-session dropout

prediction model regarding fairness for four different

variables which are known to have impact on

children’s literacy skills. We explored the following

research questions:

RQ1: How fair are in-session dropout prediction

models considering different fairness measures?

RQ2: What is the potential for discrimination for

different demographic groups?

RQ3: How do different ML implementations of

prediction models differ with respect to their

discrimination potential?

The results with regards to research questions 1 and 2

are mixed. The fairness measure of predictive parity

never exceeds the threshold. This means, that the

probability for a predicted early termination to truly

terminate the session early is equal for protected and

unprotected groups. The same holds true for equal

opportunity, except for the HLE attribute. The

probability of session termination which was

incorrectly predicted to be a successful session is

higher for users with more than 100 books in the

household. If the model outcome leads to

interventions, these users would not get the

intervention they need, as the model would predict

them to successfully finish the session.

In contrast,

predictive equality is above the threshold for all

attributes. This means that a successfully finished

session is incorrectly predicted as an early

termination. In an educational setting, this leads to

interventions for users who would not need them.

Depending on the intervention, this can hinder users

to achieve high levels of learning, for example, if

tasks are adjusted based on the prediction. Slicing

Analysis is as well often above the threshold.

The results have shown that the fairness of a

model is assessed differently by various definitions.

It is a matter of context which definitions should be

used, and which are more important than others. In an

educational setting, users who don’t receive help and

interventions although they need it, are as poor as

users who do not reach high levels of learning

although they would be able to accomplish it.

With regards to research question 3, we see that

different implementations do not make much

difference in evaluation in terms of fairness.

However, interesting correlations emerge in the

temporal analysis. The longer the session, the more

data about the user is available, the better the

accuracy, but at the same time, the fairness decreases.

This may indicate that the additional data in longer

CSEDU 2022 - 14th International Conference on Computer Supported Education

324

sessions can be used to improve the prediction, but at

the same time may also have a discriminatory

influence. This results in a trade off between accuracy

and fairness. This is particularly evident for the

variables parental education background and gender.

Another interesting finding is that discrimination

is higher for variables that are not balanced (such as

number of books in the household). The attribute

gender, on the other hand, is fairly well balanced and

also has the lowest disparities.

Overall, our work shows that in-session prediction

models can be discriminatory. However, this is

largely dependent on three factors: on the one hand,

different metrics produce different results; on the

other hand, different demographic subgroups can be

found in user groups, which can be affected by

discrimination to different degrees; at the same time,

training data (in our case, sentence numbers) have an

influence and fairness decreases with higher model

accuracy. In our study, ML implementation did not

affect the fairness of the model.

Our study comes with several limitations which

need to be considered in further interpretations. First,

the survey for three of four attributes is conducted

voluntarily among users. This results in a selection

bias, as we only investigate data of users who were

willing and motivated enough to answer the survey.

Secondly, the variable gender was part of the model’s

training process while the other attributes were not.

Last, the attributes parental education and

number of

books in the household are grouped in two, although

more than two answers were possible in the survey.

Our research has shown that temporal dropout

prediction, even in in-session scenarios, is at risk to

discriminate different groups. We see three factors

that affect the fairness of the model: 1) different

fairness metrics, 2) demographic groups, 3) different

training data. We therefore suggest always evaluating

predictive models using several measures and placing

the results into context. Furthermore, our analysis has

shown, that it is important not only to evaluate

discrimination with regard to gender or migration

background but to extend the examination to

variables that are known to have an impact on the

educational path, such as parental education or HLE.

Further research should consider ways to address the

disparities through pre-, in-, or post-process methods.

Our work looks at the evaluation of an online

platform specifically for teaching German.

Nevertheless, our research can be transferred to other

subjects. Especially platforms that are used in a

school context are used in particular for assigning

homework, like Orthografietrainer.net. However,

transferring MOOC dropout predictions is not

possible and the use of our approach is recommended.

Again, it is important to look at multiple

measures, as improving one definition of fairness can

lead to a worsening of another.

ACKNOWLEDGEMENTS

This research was funded by the Federal Ministry of

Education and Research of Germany in the

framework “Digitalisierung im Bildungsbereich”

(project number 01JD1812A).

REFERENCES

Angwin, J.; Larson, J.; Mattu, S.; Kirchner, L. (2016):

Machine bias: There’s software used across the country

to predict future criminals. And it’s biased against

blacks. In ProPublica 2016. Available online at

www.propublica.org/article/machine-bias-risk-

assessments-in-criminal-sentencing.

Baker, R. S..; Hawn, A. (2021): Algorithmic Bias in

Education. In Int J Artif Intell Educ, pp. 1–41. DOI:

10.1007/s40593-021-00285-9.

Carroll, J. M.; Holliman, A. J.; Weir, F.; Baroody, A. E.

(2019): Literacy interest, home literacy environment

and emergent literacy skills in preschoolers. In Journal

of Research in Reading 42 (1), pp. 150–161. DOI:

10.1111/1467-9817.12255.

Chen, Y.; Johri, A.; Rangwala, H. (2018): Running out of

STEM. In Proceedings of the 8th International

Conference on Learning Analytics and Knowledge.

LAK '18: ACM, pp. 270–279.

Chouldechova, A. (2017): Fair Prediction with Disparate

Impact: A Study of Bias in Recidivism Prediction

Instruments. In Big Data 5 (2), pp. 153–163.

Dalipi, F.; Imran, A. S.; Kastrati, Z. (2018): MOOC

Dropout Prediction Using Machine Learning

Techniques: Review and Research Challenges. In IEEE

Global Engineering Education Conference, updated on

5/7/2021, checked on 5/7/2021.

Del Bonifro, F.; Gabbrielli, M.; Lisanti, G.; Zingaro, S. P.

(2020): Student Dropout Prediction. In Artificial

Intelligence in Education. 21st International

Conference, AIED 2020, pp. 129–140.

Gardner, J; Brooks, C.; Baker, R. (2019): Evaluating the

Fairness of Predictive Student Models Through Slicing

Analysis. In Proceedings of the 9th International

Conference on Learning Analytics & Knowledge.

ACM.

Griffin, E. A.; Morrison, F. J. (1997): The Unique

Contribution of Home Literacy Environment to

Differences in Early Literacy Skills 1. In Early Child

Development and Care 127 (1), pp. 233–243. DOI:

10.1080/0300443971270119.

Hagedoorn, T. R.; Spanakis, G. (2017): Massive Open

Online Courses Temporal Profiling for Dropout

Fairness of In-session Dropout Prediction

325

Prediction. In 2017 IEEE 29th International

Conference on Tools with Artificial Intelligence

(ICTAI), pp. 231–238.

Hajian, S.; Bonchi, F.; Castillo, C. (2016): Algorithmic

Bias. In Proceedings of the 22nd ACM SIGKDD

International Conference on Knowledge Discovery and

Data Mining. KDD '16: ACM, pp. 2125–2126.

Hu, Q.; Rangwala, H. (2020): Towards Fair Educational

Data Mining: A Case Study on Detecting At-Risk

Students. In International Educational Data Mining

Society. Available online at https://eric.ed.gov/?id=

ed608050.

Jin, C. (2020): MOOC student dropout prediction model

based on learning behavior features and parameter

optimization. In Interactive Learning Environments,

pp. 1–19. DOI: 10.1080/10494820.2020.1802300.

Kizilcec, R. F.; Lee, H. (2020): Algorithmic Fairness in

Education. Forthcoming in W. Holmes & K. Porayska-

Pomsta (Eds.), Ethics in Artificial Intelligence in

Education, Taylor & Francis.

Klare, B. F.; Burge, M. J.; Klontz, J. C.; Vorder Bruegge,

R. W.; Jain, A. K. (2012): Face Recognition

Performance: Role of Demographic Information. In

IEEE Trans.Inform.Forensic Secur. 7 (6), pp. 1789–

1801. DOI: 10.1109/TIFS.2012.2214212.

Lee, V. E.; Burkam, D. T. (2007): Inequality at the starting

gate. Social background differences in achievement as

children begin school. 3. print. Washington, DC:

Economic Policy Institute.

Liang, J.; Yang, J.; Wu, Y.; Li, C.; Zheng, L. (2016): Big

Data Application in Education: Dropout Prediction in

Edx MOOCs. In 2016 IEEE Second International

Conference on Multimedia Big Data (BigMM), pp.

440–443.

Luan, H.; Tsai, C. (2021): A Review of Using Machine

Learning Approaches for Precision Education. In

Educational Technology & Society 24 (1), pp. 250–266.

Available online at https://www.jstor.org/stable/

26977871.

Merriam-Webster Dictionary (2021a): Bias. Available

online at https://www.merriam-webster.com/

dictionary/bias.

Merriam-Webster Dictionary (2021b): Fairness. Available

online at https://www.merriam-webster.com/

dictionary/fairness.

Mitchell, S.; Potash, E.; Barocas, S.; D'Amour, A.; Lum, K.

(2021): Algorithmic Fairness: Choices, Assumptions,

and Definitions. In Annu. Rev. Stat. Appl. 8 (1), pp.

141–163.

Tasnim, N.; Paul, M. K.; Sattar, A. H. M. S. (2019):

Identification of Drop Out Students Using Educational

Data Mining. In 2019 International Conference on

Electrical, Computer and Communication Engineering

(ECCE). Engineering (ECCE), pp. 1–5.

Noble, J.; Davies, P. (2009): Cultural capital as an

explanation of variation in participation in higher

education. In British Journal of Sociology of Education

30 (5), pp. 591–605.

Okubo, F.; Yamashita, T.; Shimada, A.; Ogata, H. (2017):

A neural network approach for students' performance

prediction. In Proceedings of the Seventh International

Learning Analytics & Knowledge Conference. ACM.

O'Neil, C. (2016): Weapons of Math Destruction. How big

data increases inequality and threatens democracy. New

York: Crown/Archetype (ProQuest Ebook Central).

Pessach, D.; Shmueli, E. (2020): Algorithmic Fairness.

arXiv preprint arXiv:2001.09784.

Prates, M. O. R.; Avelar, P. H.; Lamb, L. C. (2020):

Assessing gender bias in machine translation: a case

study with Google Translate. In Neural Comput &

Applic 32 (10), pp. 6363–6381. DOI: 10.1007/s00521-

019-04144-6.

Prenkaj, B.; Velardi, P.; Stilo, G.; Distante, D.; Faralli, S.

(2020): A Survey of Machine Learning Approaches for

Student Dropout Prediction in Online Courses. In ACM

Comput. Surv. 53 (3), pp. 1–34. DOI:

10.1145/3388792.

Riazy, S.; Simbeck, K. (2019): Predictive Algorithms in

Learning Analytics and their Fairness. In DELFI 2019.

DOI: 10.18420/delfi2019_305.

Rzepka, N.; Simbeck, K.; Müller, H.-G.; Pinkwart, N.:

Keep It Up: In-Session Dropout Prediction to Support

Blended Classroom Scenarios. In Proceedings of the

14th International Conference on Computer Supported

Education (CSEDU 2022), 2022 (forthcoming).

Sénéchal, M.; LeFevre, J. (2002): Parental involvement in

the development of children's reading skill: a five-year

longitudinal study. In Child development 73 (2), pp.

445–460. DOI: 10.1111/1467-8624.00417.

Shahiri, A. M.; Husain, W.; Rashid, N. A. (2015): A

Review on Predicting Student's Performance Using

Data Mining Techniques. In Procedia Computer

Science 72, pp. 414–422.

Stapel, M.; Zheng, Z.; Pinkwart, N. (2016): An Ensemble

Method to Predict Student Performance in an Online

Math Learning Environment. In International

Educational Data Mining Society.

Steinlen, A. K.; Piske, T. (2013): Academic achievement of

children with and without migration backgrounds in an

immersion primary school: A pilot study. In Zeitschrift

für Anglistik und Amerikanistik 61 (3), pp. 215–244.

DOI: 10.1515/zaa-2013-0303.

Sun, D.; Mao, Y.; Du, J.; Xu, P.; Zheng, Q.; Sun, H. (2019):

Deep Learning for Dropout Prediction in MOOCs. In

2019 Eighth International Conference on Educational

Innovation through Technology (EITT), pp. 87–90.

Suresh, H.; Guttag, J. V. (2019): A Framework for

Understanding Sources of Harm throughout the

Machine Learning Life Cycle. In Equity and Access in

Algorithms, Mechanisms, and Optimization, (pp. 1-9).

Verma, S.; Rubin, J. (2018): Fairness definitions explained.

In Proceedings of the International Workshop on

Software Fairness. ACM.

Wang, W.; Yu, H.; Miao, C. (2017): Deep Model for

Dropout Prediction in MOOCs. In Proceedings of the

2nd International Conference on Crowd Science and

Engineering. the 2nd International Conference. ACM,

pp. 26–32.

Xing, W.; Du, D. (2019): Dropout Prediction in MOOCs:

Using Deep Learning for Personalized Intervention. In

Journal of Educational Computing Research 57 (3), pp.

547–570. DOI: 10.1177/0735633118757015.

CSEDU 2022 - 14th International Conference on Computer Supported Education

326