Assess Performance Prediction Systems: Beyond Precision Indicators

Amal Ben Soussia, Chahrazed Labba, Azim Roussanaly and Anne Boyer

Universit

e de Lorraine, LORIA, France

Keywords:

Earliness, Stability, Indicators, Learning Analytics, Machine Learning, k-12 Learners.

Abstract:

The high failure rate is a major concern in distance online education. In recent years, Performance Prediction

Systems (PPS) based on different analytical methods have been proposed to predict at-risk of failure learners.

One of the main studied characteristics of these systems is its ability to provide accurate early predictions.

However, these systems are usually assessed using a set of evaluation measures (e.g. accuracy, precision) that

do not reﬂect the precocity, continuity and evolution of the predictions over time. In this paper, we propose

to enrich the existing indicators with time-dependent ones including earliness and stability. Further, we use

the Harmonic Mean to illustrate the trade-off between the predictions earliness and the accuracy. In order to

validate the relevance of our indicators, we used them to compare four different PPS for predicting at-risk

of failure learners. These systems are applied on real data of K-12 learners enrolled in an online physics-

chemistry module.

1 INTRODUCTION

The use of distance online education has evolved over

the last few years, and has exploded further with the

recent covid-19 pandemic. It presents an effective

way to maintain the continuity of the learning pro-

cess by allowing access to school from anywhere at

any time.

However, the major concern of this learning mode

is the high failure rate among its learners. In order

to meet this issue, Performance Prediction Systems

(PPS) based on Machine Learning (ML) models have

been proposed (Wang et al., 2017; Iqbal et al., 2017;

Zhang et al., 2017; Chui et al., 2020). The main ob-

jective of this type of systems is to predict accurately

and at the earliest at-risk of failure learners using real-

time ﬂow data.

The data generated by online education platforms

are available progressively over time, as they are

highly dependent on the time at which learners inter-

act with the educational content. Thus, these data are

called time series, as they consist of a set of sequences

of events that occur over time. The time reference can

be a timestamp or any other ﬁnite-grain time interval

(e.g week). To assess the effectiveness of the PPS in

fulﬁlling their objectives, common indicators are be-

ing used. On the one hand, performance measures are

used to assess the temporal and space complexity of

the systems in respect to the used data. On the other

hand, ML indicators such as precision, recall and F1

measure are used to qualify the ability of the system

to predict correctly. Despite the diversity of existing

evaluation indicators, none of them is dedicated to as-

sess the precocity, continuity and evolution of predic-

tions over time. However, time is an important di-

mension that needs to be considered while assessing

a PPS as both learning and prediction evolve.

In this paper, we focus on assessing PPS based on

time-series classiﬁers. To overcome the limitations

of existing common indicators, we focus on provid-

ing new time-dependent ones that can be used to as-

sess these systems over time. Two indicators includ-

ing earliness and stability are proposed. Indeed, the

temporal stability is the ability of a PPS to provide

the longest sequences of correct predictions over the

prediction times. Whereas, the earliness indicates the

ﬁrst time that the PPS correctly predicted a given class

label. A PPS is said to be effective if it predicts accu-

rately and as early as possible. To this end, we used

the Harmonic Mean (HM) to study the trade-off be-

tween the accuracy and earliness indicators since they

are proportionally inverted.

To validate the relevance of these indicators, we

applied them to assess four different PPS for pre-

dicting at-risk of failure learners. These systems

use real data of k-12 learners enrolled in a physics-

chemistry module within a French distance learning

Ben Soussia, A., Labba, C., Roussanaly, A. and Boyer, A.

Assess Performance Prediction Systems: Beyond Precision Indicators.

DOI: 10.5220/0011124300003182

In Proceedings of the 14th International Conference on Computer Supported Education (CSEDU 2022) - Volume 1, pages 489-496

ISBN: 978-989-758-562-3; ISSN: 2184-5026

489

center (CNED

To summarize, our contribution is twofold: 1) new

indicators including earliness and stability

; and 2) a real case study to support the use of our

indicators.

The rest of the paper is organized as follows: the

Section 2 presents the related work and discusses our

contribution with respect to the state of the art. The

Section 3 introduces the problem formalization as

well as the deﬁnitions of the proposed indicators. The

Section 4 describes our context and the used PPS. The

Section 5 presents the conducted experiments and the

results. The Section 7 concludes on the results and

introduces the work’s perspectives.

2 RELATED WORK

ML-based education systems and especially the de-

tection of learners at-risk of failure and dropout are

gaining momentum in recent years.

Static ML precision indicators are the most used to

evaluate the performance of these systems (Hu et al.,

2014). (Ba

neres et al., 2020) proposed a model based

on students’ grades to predict the likelihood to fail a

course. Authors of this paper evaluated the perfor-

mance of the model using the accuracy metric. The

main goal of (Lee and Chung, 2019) was to improve

the performance of a dropout early warning system.

For this aim, the trained classiﬁers, including Ran-

dom Forest and boosted Decision Tree, were evalu-

ated with both the Receiver Operating Characteristic

(ROC) and Precision-Recall (PR) curves. Based on an

ensemble model using a combination of relevant ML

algorithms, (Karalar et al., 2021) aimed to identify

students at-risk of academic failure during the pan-

demic. In order to make a classiﬁcation in which stu-

dents with academic risks can be predicted more ac-

curately, authors of this paper relied on the results of

the speciﬁcity measure to evaluate the performance

of the ensemble method. The goal of (Adnan et al.,

2021) was to identify the best model that analyzes the

problems faced by at-risk learners enrolled in online

university. The performance of the various trained

ML algorithms was evaluated by using accuracy, pre-

cision, recall, support and f-score metrics. The Ran-

dom Forest was the model with the best results.

Predicting at-risk learners at the earliest is one of

the main topics in the Learning Analytics (LA) ﬁeld.

(Hlosta et al., 2017) introduced a novel approach,

based on the importance of the ﬁrst assessment, for

the early identiﬁcation of at-risk learners. The key

Centre National d’Enseignement

a Distance

idea of this approach is that the learning patterns can

be extracted from the behavior of learners who have

submitted their assessment earlier. For the earliest

possible identiﬁcation of students who are at-risk of

dropout during a course, (Adnan et al., 2021) divided

the course into 6 periods and then trained and tested

the performance of ML algorithms at different per-

centages of the course length. Results showed that

at 20% of the course length, the RF model was pro-

ducing promising results with 79% average precision

score. At 60% of the course length, the performance

of RF improved signiﬁcantly. (Figueroa-Ca

nas and

Sancho-Vinuesa, 2020) present a study for a simple

and interpretable procedure to identify dropout-prone

and fail-prone students before the halfway point of the

semester. The results showed that the main factor to

the ﬁnal exam performance is continued learning ac-

quired during at least the ﬁrst half of the course. The

work conducted within the Open University (OU) by

(Wolff et al., 2014) has proven that the ﬁrst assess-

ment is a good predictor of a student’s ﬁnal outcome.

To summarize, the existing research works rely

mainly on ML precision indicators to evaluate the per-

formance of PPS. Although the results given by these

indicators are important to obtain an overall assess-

ment of ML projects, their use alone is not sufﬁcient

for evolving systems over time. In fact, the common

indicators do not consider the importance of the tem-

poral evolution of the predictions. When dealing with

a time-continuous process, such as learning, the reg-

ular tracking of prediction results reveals the need for

other time-dependent indicators. Thus, in this work,

we consider earliness and stability indicators to pro-

vide a deeper evaluation of the PPS. Further, we pro-

pose to use the HM measure to establish a compro-

mise between both time-dependent and precision in-

dicators.

3 TIME-DEPENDENT

INDICATORS

In this section, we formally present the problem of

time series classiﬁcation (Section 3.1) as well as the

new proposed indicators, including earliness (Sec-

tion 3.2) and stability (Section 3.3).

3.1 Problem Formalization

The objective is to predict the class of the students as

early and accurate as possible.

Assume Y={C

, C

, .., C

} is the set of predeﬁned

class labels that is determined using an existing train-

ing data. Let S=(S

,...,S

) be the set of the students

A2E 2022 - Special Session on Analytics in Educational Environments

490

Figure 1: An Example of a Regular tracking prediction.

in the test dataset D

test

and T={t

,..,t

} be the set of

the prediction times.

Each student S

∈ S, at a prediction time t

∈ T , is

presented by a vector X

={ f

, f

,..., f

, C

} where the

∈ R presents the k

feature, and C

∈ Y is the j

class label. The main objective is to determine at each

prediction time t

the right class for the student.

As shown in Fig. 1, on each t

∈ T , each learner

∈ S is predicted to belong to a class C

∈ Y .

The Fig. 1 presents an example of a regular track-

ing prediction of a set of students S = (S

) over

a time interval T={t

, t

}. Each student is as-

signed a single class label in Y={C

, C

The quality of PPS based on classiﬁers can be

measured by different indicators, but none of them

determines their effectiveness with respect to the tem-

poral dimension. Indeed, the regular tracking of pre-

diction results is at the origin of identifying new time-

dependent indicators including earliness and stability.

Further, we need an additional measure that shows the

trade-off between our time-dependent indicators and

the accuracy. Indeed, the assignment of a class label

should be as early and accurate as possible.

3.2 Earliness

The objective of PPS has always been the early pre-

diction of the less performing learners. Early predic-

tion is commonly deﬁned as the adequate and relevant

time allowing both an accurate prediction of learners

performances and effective tutor interventions with

at-risk learners (Ba

neres et al., 2020).

Indeed, learners’ behavior is not stable and may

vary continuously over time, therefore the perfor-

mance of the model may change and evolve from one

prediction time to another. For these reasons, the ear-

liness measurement is signiﬁcant in the evaluation of

an educational PPS as both learning and prediction

are time-evolving.

We propose the following Algorithm (see

Algorithm.1) to compute the earliness by class label.

It takes as input the list of the students (S), the set

of class labels (Y ) as well as the test data (D

test

) and

provides as output the earliness measures by class

label (E

). The Algorithm starts by iterating over

the class labels (C

∈ Y ) (Line 1). For each C

, it

initializes two variables (Early

tot

and count

), which

will respectively contain the sum of the ﬁrst correct

prediction times and the number of students who

were assigned at least once to the class C

(Lines

2-3). Then, the Algorithm iterates over the set of

students (S

∈ S) (Line 4). It veriﬁes ﬁrst if S

has

been assigned at least once to the class C

in question

(line 5). If so, the Algorithm searches for the ﬁrst

time S

has been in C

(Line 6) as well as the ﬁrst

time S

is predicted correctly in C

(Line 7). The

ﬁrst correct prediction time is then calculated (Line

8) and both (Early

tot

and count

) are updated (Lines

9-10). The earliness (Early

) that corresponds to

the class label C

is determined at line 13. Then

before iterating on the next class label, the measured

earliness for the C

in question is saved in E

(Line

14).

As an example of calculation of earliness, we refer

to the Fig. 1. The set S is composed of three students

, S

and S

. At each t

, a student belongs to a class

label (l

) and has a predicted label (p

). Assume t

rep-

resents a week, and we need to calculate the earliness

with respect to class C

. The ﬁrst predicted right la-

bel for each of the three students is represented by the

green box. The student S

is not considered, since

he/she never been labeled as C

. By applying the

Algorithm.1, the earliness measures for S

and S

are

equal respectively to 2 and 1. Thus, the earliness for

class label C

= 1.5 (3/2). In other words, the system

is able to predict correctly the class label C

after 1.5

weeks a student is labeled as C

For a PPS, the earlier the predictions are correct,

the better the system is.

However, while improving the earliness indicator

outcomes, the results of ML precision indicators in-

cluding the accuracy have to remain high enough to

provide stakeholders with as accurate predictions as

possible. For this aim, we propose to follow, along

with the earliness indicator, the HM, which is a mea-

sure of central tendency and used when an average

ratio is needed. The HM highlights the reverse pro-

portionality between two variables. This measure has

Assess Performance Prediction Systems: Beyond Precision Indicators

491

Algorithm 1: Earliness Algorithm.

Require: S,Y , D

test

Ensure: E

1: for each C

in Y do

2: Early

tot

← 0

3: count

← 0

4: for each S

in S do

5: if (assigned(S

) ==True) then

6: t

← First Labeled(S

)

7: t

← First Predicted(S

)

8: Early

← t

−t

+ 1

9: Early

tot

← Early

tot

+ Early

10: count

← count

+ 1

11: end if

12: end for

13: Early

← Early

tot

/count

14: E

← put(C

,Early

)

15: end for

been already used in (Sch

afer and Leser, 2020) to

investigate the relation between accuracy and earli-

ness for early classiﬁcation of electronic signals. Like

them, we used the same HM measure but this time to

determine the relationship between earliness and ac-

curacy in a completely different domain. To the best

of our knowledge, the HM has never been used in the

education domain to express the ability of PPS to pro-

vide accurate predictions at the earliest. The applica-

tion of the HM measure is as follows:

HM =

2 ∗ (1 − earliness) ∗ accuracy

(1 − earliness) + accuracy

(1)

The higher HM is, the more the system is qualiﬁed

to be able to provide accurate early predictions.

3.3 Stability

Evaluating the performance of PPS based on the earli-

ness and stability indicators is of high interest to show

the evolution of predictions.

Given the changes of learners behaviors through the

learning period, the model could have at each pre-

diction time a different performance which makes the

system unstable.

In the state of the art, stability is usually related

to small changes in system output when changing the

training set (Philipp et al., 2018). However, in our

context, we are more interested in temporal stabil-

ity which refers to the capacity of a model to give

the correct output over time when training the same

dataset (Teinemaa et al., 2018). The temporal stability

characterizes the ability of the system to maintain the

same performance throughout the prediction times.

In the frame of this work, we deﬁne the temporal

stability as the average of the longest sequences of

successive correct predictions over time. The stability

is calculated using equation 2.

Stability =

∑

p=1

|h(S

|D|

(2)

Where h : S → T

is a function that associates to each

student in D ⊆ D

test

the longest sequence of succes-

sive correct predictions. The given equation allows

to calculate the stability on a class label or on the

whole D

test

at a given time interval. As an example

of calculation of stability, we refer to the Fig. 1. For

each student, the longest correct prediction sequence

is presented by a red line.

The overall stability value till the time prediction

is calculated as follows:

Stability =

2 + 3

= 1.66 (3)

Whereas, the stability value at the time prediction

is calculated as follows:

Stability =

2 + 5 + 2

= 3 (4)

Thus, the evaluated PPS shows an ascending sta-

bility over time, which allows it to be qualiﬁed as a

stable system.

4 PROOF OF CONCEPT:

COMPARISON OF FOUR PPS

To prove the effectiveness of our temporal indicators,

we used them to compare four existing PPS. The same

dataset has been used for the four systems. This sec-

tion presents the case study and introduces the evalu-

ated PPS.

4.1 Context and Dataset Description

Our case study is the k-12 learners enrolled in

the physics-chemistry module during the 2017-2018

school year within the French largest center for dis-

tance education (CNED

). CNED offers multiple

fully distance courses to a large number of physically

dispersed learners. In addition to the heterogeneity

of learners, learning is also quite speciﬁc as the reg-

istration remains open during the school year. Sub-

sequently, the start activity date t

could be different

from one learner to another.

The objective is to track students performance on a

weekly basis to identify those at-risk of failure. Thus,

the prediction time t

∈ T corresponds to a week.

https://www.cned.fr/

A2E 2022 - Special Session on Analytics in Educational Environments

492

Our dataset is composed of learning traces of 647

learners who followed the physics-chemistry module

for 37 weeks.

Learners are classiﬁed into three classes based on the

obtained grades average: Y={C

, C

}

• Success (C

): when the marks average is strictly

superior to 12.

• Medium risk of failure (C

): when the marks av-

erage is between 8 and 12.

• High risk of failure (C

): when the marks average

is strictly inferior to 8.

Each week, a student is represented by a set of

learning features and a class label. Based on a pre-

vious work (Ben Soussia et al., 2021), these learning

features include performance, engagement, regularity

and reactivity.

The systems are compared using accuracy and sta-

bility over the entire learning period. However, for

earliness, it is evaluated in relation to the ﬁrst 12

weeks. The choice of this period is not arbitrary. Ac-

cording to the existing work on earliness (Figueroa-

nas and Sancho-Vinuesa, 2020), it can be deduced

that earliness is always targeted on the ﬁrst weeks.

4.2 Overview of the Evaluated PPS

To provide an example of application of the pro-

posed time-dependent indicators, we used them to

compare four different PPS (PPS

, PPS

PPS

). The ﬁrst two systems (PPS

, PPS

) are based

on the Random Forest (RF) model, while the last

two (PPS

, PPS

) use the Artiﬁcial Neural Network

(ANN) model.

PPS

and PPS

use all the learning features in-

cluding performance, engagement, regularity and re-

activity, in addition to demographic data to make

weekly basis predictions. While, PPS

and PPS

use

only the engagement features to deﬁne students at-

risk of failure. An additional difference between the

systems is presented by the way the class is assigned

in time. Indeed, for PPS

and PPS

, the predictions

are made with respect to the learner ﬁnal class at the

end of the year. In other words, each learner belongs

to one single class over the year and the model tries to

predict that ﬁnal class as early as possible. Whereas,

for PPS

and PPS

, the class label is dynamic and

may change based on the student performance. For

example, a student may be in the successful class for

3 successive weeks, but in the 4th week he/she may

be assigned to a different class label due to ﬂuctua-

tions in performance. The model must therefore cap-

ture these changes from one week to another to pre-

dict correctly the student’s class label.

Figure 2: PPS

VS PPS

in terms of accuracy.

The Fig. 2 presents the accuracy of the four sys-

tems PPS

, PPS

and PPS

over the whole

learning weeks. As shown, PPS

and PPS

that

use all the features perform much better in terms of

weekly accuracy than PPS

and PPS

that use only

the engagement features. Indeed, PPS

has almost

≈ 91% of accuracy at t

. From week 1 to 6, the accu-

racy of this system decreases, then, it increases again

starting from week 7 to reach a value of ≈ 72% by

the end. While, the accuracy of PPS

does not present

variations and it is almost stable over the weeks and it

reaches a max value of ≈ 73%. The accuracy of these

systems (PPS

, PPS

) is not poor and shows that it

performs quite well in general.

However, the precision indicators do not reﬂect

the earliness of systems with respect to all the classes

and particularly the high and medium risk.

Thus, in the Section.5, the four systems are as-

sessed beyond the use of precision indicators by using

the time-dependent ones deﬁned in Section.3.

5 EXPERIMENTAL RESULTS

In this section, we interpret the earliness, HM and sta-

bility results for the four systems PPS

, PPS

and PPS

5.1 Earliness and HM Results

Earliness is relevant, especially for detecting high and

medium risk students (C

and C

classes). Thus, we

present the results of earliness and HM by class label

(See Table 1, and Table 2 ).

Comparing PPS

and PPS

: For both systems

PPS

and PPS

, the dominant class is C

(success),

which is predicted respectively at 9.5% (≈ 1.14 week)

and 8.83% (≈ 1 week) of the ﬁxed prediction time

interval (12 weeks). The earliness and accuracy are

Assess Performance Prediction Systems: Beyond Precision Indicators

493

Table 1: Earliness and HM measurements-PPS

VS PPS

PPS

Earliness Accuracy HM Earliness Accuracy HM

9.5% 92.63% 90.95% 8.83% 91% 91.29%

26.6% 9% 16.03% 0% 0% 0%

44.1% 46% 50.46% 0% 0% 0%

Table 2: Earliness and HM measurements-PPS

VS PPS

PPS

Earliness Accuracy HM Earliness Accuracy HM

10.75% 78.94% 83.83% 37.85% 21.62% 25%

17.66% 55.75% 66.52% 0% 0% 0%

9.33% 100% 95.12% 8.5% 92.37% 90.65%

slightly different. By referring to the HM measure the

PPS

is better in predicting at the earliest the learn-

ers at success class. However, PPS

is worst when it

comes to detecting both classes C

and C

. Indeed,

over the 12 ﬁrst weeks the system has 0% HM mea-

sures for C

and C

Comparing PPS

and PPS

: For both systems

PPS

and PPS

, the dominant class, at the beginning

of the school year, is C

which is predicted respec-

tively at 9.33% (≈ 1.11 week) and 8.5% (≈ 1 week).

The earliness values are slightly different, but PPS

has higher accuracy and higher HM measures. Con-

ventionally, for a good model, the accuracy increases

with time. However, PPS

predicted C

latter and

less accurately than PPS

. Further, despite of pre-

dicting C

accurately and quite early, PPS

has never

predicted learners belonging to C

during the ﬁrst 12

weeks. According to the HM measures, PPS

outper-

forms PPS

in predicting accurately at the earliest all

of the three classes.

To summarize, for the four systems, the dominant

class is always predicted at the earliest with respect to

the rest of the class labels. If we interpret, for exam-

ple, the earliness rate in relation with the class label

for PPS

and PPS

, one can think that the system

was able to predict the class at the earliest. Theoreti-

cally 0% is the best earliness value with an accuracy

of 100%. However, this is not true since the accuracy

and the HM are equal to zero. Thus, 0% does not

explain anything unless we investigate accuracy and

consequently the HM measure. These results prove

the importance of feature selection in predicting at the

earliest students at risk of failure. PPS

and PPS

which use only engagement features, are the best ex-

amples, where one or two of the medium and high

risk classes are not detected. These results prove also

that evaluating a PPS based on either accuracy or ear-

liness is not pertinent enough to conclude on the per-

formance of a classiﬁer. The trade-off between both

indicators through the measurement of HM gives a

more precise insight about the PPS performance.

The earliness and HM can be used also to evaluate

PPS adopting different classiﬁcation approaches such

as PPS

and PPS

. The ﬁrst one performs the predic-

tions with respect to the ﬁnal class at the end of the

school year. While the second one performs the pre-

dictions at a time t with respect to the class label at the

time t+1. As shown in Tables 1 and 2, the HM mea-

sures show that PPS

is less early and less accurate

than PPS

in predicting C

and C

, knowing that they

use the same test data. Determining which system is

better than the other is beyond the scope of this paper,

but we can conclude that the prediction approach can

have an impact on the earliness of the system.

5.2 Stability Results

Stability is complementary to HM. The Table. 3

presents the stability measures for the four systems

over the ﬁrst 12 weeks. The stability per class is more

pertinent for PPS

and PPS

than for PPS

and PPS

This can be explained by the fact that the former pre-

dict with respect to the ﬁnal class labels while the lat-

ter predict with respect to the t+1 class labels. For

this reason, we consider to follow also the predictions

stability on the entire test dataset D

test

. The results

of D

test

stability are more coherent with the HM ones

and the overall stability is more pertinent as it reﬂects

the true stability of the PPS. As shown in Table 3, un-

til week 12, PPS

is more stable than PPS

in terms

of class stability and overall stability. Indeed, PPS

succeeded in predicting the class labels correctly and

build longer sequences of correct predictions. While,

when it comes to PPS

and PPS

, the overall stabil-

ity is more relevant, and it shows that PPS

is more

stable.

Unlike earliness, the stability of PPS is more in-

teresting when it is tracked throughout the learning

period. The Figure. 3 shows the D

test

stability evolu-

tion of the four systems throughout the school year.

A2E 2022 - Special Session on Analytics in Educational Environments

494

Figure 3: Comparing PPS

V, PPS

, PPS

and PPS

terms of temporal stability.

The stability of PPS

increases slightly over time. It

is at ≈ 70% and ≈ 76% respectively at the ﬁrst and

last prediction weeks. While the stability of PPS

decreasing over the weeks. It started with a value of

≈ 64% and ended with ≈ 59%.

Both systems PPS

and PPS

start with high sta-

bility values ( ≈ 100% and ≈ 92% respectively). This

can be explained by the dominance of the class C

since at the beginning all learners are assigned to this

class by default. However, these high values decrease

rapidly over the following weeks. For PPS

, around

the week 4, it starts to correctly assign each learner to

the suitable class among C

, C

and C

. Then, from

week 5, the stability of PPS

increases continuously

and reaches a rate of ≈ 96% at the last prediction

time. For PPS

, the shape of the curve is identical

to this of PPS

with a downward shift. Until week

13, the overall stability decreases, then from week 14,

it starts to increase again to reach ≈ 57%. We notice

either a partial or a total drop in stability for the sys-

tems PPS

, PPS

and PPS

. Although the stability

of PPS

has never decreased over time, PPS

remains

the most stable.

Table 3: Stability Measurements.

PPS

PPS4

92.26% 89.12% 58.79% 18.24%

11.36% 0% 35.85% 0%

27.56% 0% 47.70% 39.86%

test

72.72% 65.12% 88.10% 48.48%

Yet, stability and accuracy are proportional, how-

ever, from the accuracy graphs of PPS

and PPS

, we

cannot conclude on the effectiveness of the systems in

maintaining correct prediction sequences over time.

6 THREATS TO VALIDITY

The current work presents some limitations that we

tried to mitigate when possible: i) the earliness algo-

rithm returns a single value which corresponds to the

mean of the ﬁrst correct predictions. However, a high

associated HM is partially reliable as the accuracy of

the system may decrease after the identiﬁed earliness.

We intend to consider several earliness points with

respect to the one returned by our algorithm. The

objective is to study the variations of the HM mea-

sures between these earliness points. ii) in this work,

we have adopted a weekly-basis prediction approach.

However, we believe that the proposed indicators can

be also used to deﬁne the appropriate temporal gran-

ularity that provides better prediction results. iii) in

the frame of this work, we only considered classiﬁca-

tion problems. To prove the generic use of our indica-

tors, we aim to apply them on other systems that use

regression-based analytical models.

7 CONCLUSION AND

PERSPECTIVES

In this paper, we introduced time-dependent indica-

tors, namely the earliness and stability to assess PPS

used in online distance education. Further, a trade-off

between the earliness and accuracy is important to as-

sess the ability of a PPS to predict learners at-risk of

failure. The use of HM measure serves to illustrate

this balance. The new indicators along with the accu-

racy are used to compare four different PPS. The ex-

perimental results prove that the accuracy is not sufﬁ-

cient to evaluate a PPS within a context of time-series

data. The experimental results show that the HM mea-

sure is relevant in identifying the earliest and accurate

system and that the stability is more pertinent when

the whole test dataset is considered. Further, a system

is considered better when its stability increases over

time.

As perspectives for this work, we intend to im-

prove the proposed earliness algorithm so that it re-

turns a set of different earliness values. Such a re-

sult will enable us to study more deeply the behavior

of the PPS. In addition, our next goal is to study the

trade-off between the stability and earliness indicators

and conclude on the relation between both of them.

REFERENCES

Adnan, M., Habib, A., Ashraf, J., Mussadiq, S., Raza,

A. A., Abid, M., Bashir, M., and Khan, S. U. (2021).

Assess Performance Prediction Systems: Beyond Precision Indicators

495

Predicting at-risk students at different percentages of

course length for early intervention using machine

learning models. IEEE Access, 9:7519–7539.

neres, D., Rodr

ıguez, M. E., Guerrero-Rold

an, A. E., and

Karadeniz, A. (2020). An early warning system to de-

tect at-risk students in online higher education. Ap-

plied Sciences, 10(13):4427.

Ben Soussia, A., Roussanaly, A., and Boyer, A. (2021).

An in-depth methodology to predict at-risk learners.

In European Conference on Technology Enhanced

Learning, pages 193–206. Springer.

Chui, K. T., Fung, D. C. L., Lytras, M. D., and Lam, T. M.

(2020). Predicting at-risk university students in a vir-

tual learning environment via a machine learning al-

gorithm. Computers in Human Behavior, 107:105584.

Figueroa-Ca

nas, J. and Sancho-Vinuesa, T. (2020). Early

prediction of dropout and ﬁnal exam performance in

an online statistics course. IEEE Revista Iberoameri-

cana de Tecnologias del Aprendizaje, 15(2):86–94.

Hlosta, M., Zdrahal, Z., and Zendulka, J. (2017).

Ouroboros: early identiﬁcation of at-risk students

without models based on legacy data. In Proceed-

ings of the seventh international learning analytics &

knowledge conference, pages 6–15.

Hu, Y.-H., Lo, C.-L., and Shih, S.-P. (2014). Develop-

ing early warning systems to predict students’ online

learning performance. Computers in Human Behav-

ior, 36:469–478.

Iqbal, Z., Qadir, J., Mian, A. N., and Kamiran, F. (2017).

Machine learning based student grade prediction: A

case study. arXiv preprint arXiv:1708.08744.

Karalar, H., Kapucu, C., and G

uler, H. (2021). Predict-

ing students at risk of academic failure using ensem-

ble model during pandemic in a distance learning sys-

tem. International Journal of Educational Technology

in Higher Education, 18(1):1–18.

Lee, S. and Chung, J. Y. (2019). The machine learning-

based dropout early warning system for improving the

performance of dropout prediction. Applied Sciences,

9(15):3093.

Philipp, M., Rusch, T., Hornik, K., and Strobl, C. (2018).

Measuring the stability of results from supervised

statistical learning. Journal of Computational and

Graphical Statistics, 27(4):685–700.

Sch

afer, P. and Leser, U. (2020). Teaser: early and accurate

time series classiﬁcation. Data mining and knowledge

discovery, 34(5):1336–1362.

Teinemaa, I., Dumas, M., Leontjeva, A., and Maggi, F. M.

(2018). Temporal stability in predictive process mon-

itoring. Data Mining and Knowledge Discovery,

32(5):1306–1338.

Wang, W., Yu, H., and Miao, C. (2017). Deep model for

dropout prediction in moocs. In Proceedings of the

2nd international conference on crowd science and

engineering, pages 26–32.

Wolff, A., Zdrahal, Z., Herrmannova, D., Kuzilek, J., and

Hlosta, M. (2014). Developing predictive models for

early detection of at-risk students on distance learning

modules.

Zhang, W., Huang, X., Wang, S., Shu, J., Liu, H., and Chen,

H. (2017). Student performance prediction via on-

line learning behavior analytics. In 2017 International

Symposium on Educational Technology (ISET), pages

153–157. IEEE.

A2E 2022 - Special Session on Analytics in Educational Environments

496