Clustering Techniques to Identify Low-engagement Student Levels
Kamalesh Palani, Paul Stynes
a
and Pramod Pathak
b
School of Computing, National College of Ireland, Dublin, Ireland
Keywords: Online Learning, Virtual Learning Environment, Data Mining, Unsupervised Clustering, Gaussian Mixture,
Hierarchical, K-prototype.
Abstract: Dropout and failure rates are a major challenge with online learning. Virtual Learning Environments (VLE)
as used in universities have difficulty in monitoring student engagement during the courses with increased
rates of students dropping out. The aim of this research is to develop a data-driven clustering model aimed
at identifying low student engagement during the early stages of the course cycle. This approach, is used to
demonstrate how cluster analysis can be used to group the students who are having similar online behaviour
patterns in the VLEs. A freely accessible Open University Learning Analytics (OULA) dataset that consists
of more than 30,000 students and 7 courses is used to build clustering model based on a set of unique
features, extracted from the student’s engagement platform and academic performance. This research has
been carried out using three unsupervised clustering algorithms, namely Gaussian Mixture, Hierarchical and
K-prototype. Models efficiency is measured using a clustering evaluation metric to find the best fit model.
Results demonstrate that the K-Prototype model clustered the low-engagement students more accurately
than the other proposed models and generated highly partitioned clusters. This research can be used to help
instructors monitor student online engagement and provide additional supports to reduce the dropout rate.
a
https://orcid.org/0000-0002-4725-5698
b
https://orcid.org/0000-0001-5631-2298
1 INTRODUCTION
The increase in online learning in higher education
has led to increases in educational data. Aljohani,
Fayoumi and Hassan (2019) indicates that
educational data from the VLEs provide
opportunities to analyse the student’s behaviour
patterns, and to increase the performance of teaching
and learning behaviour. Student dropout rates and
withdrawal from the course are major challenges
with the VLEs. Hussain et al., (2018) emphasise that
the student login data is the main source for the
instructor to monitor the student’s online
engagement and provide high quality education. It is
difficult for instructors in the online platform to
monitor and access all the individual student data in
order to determine the student engagement level in
their courses. The student drop out prediction is an
ongoing challenge in the online learning platforms
which needs to be addressed so that both the student
and the online educational institution will benefit
(Chui et al., 2020).
Current research uses machine learning algorithms
to build the dropout prediction model where labelled
data is used to train the model. Hassan et al., (2019)
propose that to predict the at-risk students, individual
student engagement pattern has to be identified from
the VLEs along with academic performance to derive
the valuable insights from the data. Since the
educational data continues to increase, the diversity of
the data changes (based on the research question), and
as such there is no standard way to monitor the
students online based on their individual interaction in
the VLEs.
This research proposes a data-driven clustering
algorithm using a freely accessible OULA
(https://analyse.kmi.open.ac.uk/) dataset, to identify
low-engagement students in early stages of the
course cycle based on the individual student’s
behaviour and academic performance in the VLEs.
The aim of this research is to investigate to what
extent the unsupervised clustering algorithm can be
used to identify low-engagement students during the
early stages of the course cycle in from the VLEs.
248
Palani, K., Stynes, P. and Pathak, P.
Clustering Techniques to Identify Low-engagement Student Levels.
DOI: 10.5220/0010456802480257
In Proceedings of the 13th International Conference on Computer Supported Education (CSEDU 2021) - Volume 2, pages 248-257
ISBN: 978-989-758-502-9
Copyright
c
2021 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
This study designs a clustering model and
implements unsupervised clustering models to
identify at risk students. A clustering evaluation
metric is used to measure the better-defined cluster
and separation between the clusters.
The key contribution of this paper is to help the
online instructor to track the student’s online
activities and build the students profile. This helps to
predict the future outcomes of the student
performance which can be used to alter teaching
content and also helps to optimize the learning
environment in the VLEs.
This paper describes related work with a focus
on low-engagement student prediction and clustering
methods in the VLEs in section 2. Section 3
describes the OULA dataset and the methodology
used in the paper. Section 4 presents the
implementation of the Clustering algorithms.
Section 5 provides the evaluation of the model.
Section 6 describes the conclusion and future work.
2 RELATED WORK
The Literature review for this research has been
written from the peer reviewed papers published
during 2010 to 2020 on the student engagement and
dropout prediction in the VLEs. Section 2.1
discusses the uses of online education system.
Section 2.2 discusses the challenges in predicting the
student’s dropout rate. Section 2.3 discusses student
engagement and learning behaviours in VLEs.
Section 2.4 provides an overview of machine
learning techniques used in dropout predictions.
Section 2.5 provides an overview of clustering in
VLEs. Section 2.6 discusses the research gap.
2.1 Study of Technology-enhanced
Learning Platform
Web based learning platforms have shown rapid
growth in higher educational institutions in many
forms such as Virtual Learning Environment, E-
Learning, Massive Open Online Courses (MOOCs)
and Modular Object-Oriented Dynamic Learning
Environments such as Moodle. This section
discusses how the VLEs are used in educational
institutions in addition to the challenges.
Corsatea and Walker (2015) has stated that most
of the VLEs in the higher educational institutions are
used as a data container to upload the study
materials. The teacher does not utilize all the tools in
the VLEs such as blogs, chat forms, and tracking of
student’s engagement in the VLEs. Students loose
motivation due to the absence of one to one
interaction in the VLEs and difficulty in finding
course materials. This affects the students’
performance. Hussain et al., (2018) has used the
VLEs log to overcome the challenges of motivation
and engagement faced by the learners. Educational
logs of the individual students can be used to analyse
the student’s engagement behaviour in the VLEs.
The instructor can monitor the students using the
logs stored in the VLE. However, it is not possible
to analyse individual student logs for all courses due
to the limited number of instructors in higher
educational institutions. Furthermore, (Hussain et
al., 2018) suggests that an automated intelligent
system is required to process or extract information
from student’s logs. This information can be used by
the instructor to profile the students and understand
the student’s engagement in the VLEs in a
meaningful way. Agnihotri et al., (2015) analysed
student’s login data from an online assessment
platform tool called “connect”. Connect contains the
number of times the students logged in to the course
for the entire course duration. The student logs were
used for student profiling and monitoring, however a
limitation in this research is the choice limited
factors when profiling the students.
From the current research it is clear that the
student’s log in the VLEs can be used to predict
their behaviour by monitoring and profiling the
students engagements in the VLEs courses. In the
next section the reason for student drop out in the
VLEs will be explored.
2.2 Study of Student Drop out in
Virtual Learning Environment
One of the major challenges faced by higher
educational institutions is students drop out and
failure rates. Low-engagement students who are
enrolled on the course may not complete the course.
Dalipi, Imran and Kastrati, (2018) have reviewed
the student dropout prediction and their challenges.
Their recommendations are to tackle student related
factors such as the lack of motivation, lack of time
and insufficient knowledge for the courses. In
addition they recommend to address VLE related
factors such as course design, hidden cost and lack
of interactivity or monitoring in VLEs. In order to
build the effective prediction model, students’
clickstreams data, academic performance and social
engagement features or variables have to be
considered.
Yi et al., (2018) have used non-cognitive skills
such as sleep hour, usage of smart phones,
Clustering Techniques to Identify Low-engagement Student Levels
249
consumption of energy drinks, and the number of
visits to doctor in order to predict the drop out
students. The main limitation of this research is that
the data collected is course specific and not
generalized to other courses. In addition the data
used to train the model is small.
Liang et al., (2016) used data from the edX
platform to build the predictive model. Data is
extracted from the edX platform which contains the
enrolment, user and course feature data. The
classification model was built to classify the
students. For the user feature, this research has used
the data from the student interaction with the video
and the clicks the students has made for each course
in order to build the model. This approach has not
been carried out in VLEs and the students
interacting with the video are not properly recorded.
Therefore, in this research the trained model has data
loss which is a major drawback.
Overall to predict the students drop out in VLEs
feature selection from the VLEs log and the size of
the dataset are the important factors that have to be
considered in building the model. In addition, the
growing educational data in the institution provides
opportunities to improve the student performance
and optimise the learning environment (Hassan et
al., 2019).
2.3 Understanding of Student
Engagement in VLEs
Student engagement in the VLE is the effort that the
student spends on interacting with the VLE. The
student engagement metric in the prediction of
student drop out is an important factor because lack
of interaction in the VLEs will usually affect student
engagement. Due to the absence of face to face
meetings in web-based systems it is difficult to
measure student engagement in VLEs such as
attendance, interaction of the students in the courses
and grades. There are no standard approaches to
understand student behaviour in VLEs due to the
challenges in measuring student engagement.
Waheed et al., (2020) uses student’s engagement
as a key factor to predict the student academic
performance in the VLEs and develops a deep
learning prediction model using a binary
classification dataset that describes whether a
student will pass or fail at the end of the course. The
VLEs log clickstream is taken as an important factor
in predicting the student performance. However, the
model was built on the assumption that the student’s
behaviour during the course is treated as equal. The
absence of individual student’s behaviour pattern is
not considered in this research. Boroujeni and
Dillenbourg, (2019) have tried different approaches
to analyse the individual learning processes from the
VLEs. In their research video, assessment details are
extracted from the student’s interaction logs on a
weekly basis in order to analyse the individual
student behaviour. A limitation in this research is the
fixed-study pattern which was used to train the
model and the students who change their study
pattern are given less importance.
Understating the individual students learning
behaviour in the VLEs is an important metric that
has to be included while training the model so that
the accuracy of predicting the low engagement
students in the VLEs can be increased (Corrigan and
Smeaton, 2017). In the next subsection different
machine learning (ML) and clustering techniques
that are used to build the student drop out prediction
model in the VLEs is discussed.
2.4 Machine Learning Techniques
Used in Predicting
Low-engagement Students
Chui et al., (2020) used support vector machine
(RTV-SVM) to predict low-engagement students
and marginal students in the VLEs. However, in this
work the students who are dropping out of the
course cannot be identified in real time. They can
only be identified after the completion of the course
when the drop out students are identified. Macarini
et al., (2019) has tried to predict the at-risk students
during the early stages of the course cycle using a
Moodle dataset. Four classification models were
built namely AdaBoost, Decision Tree, Random
Forest and Naive Bayes. The dataset which was
transformed on a weekly basis and the “Area Under
Curve” (AUC) was used to evaluate the model. A
limitation of this research was that the dataset used
to train the model is small and oversampling
techniques such as SMOTE are used to balance the
data. The performance of the model changes every
time the model is trained. A drop out predicting
system developed by (Hassan et al., 2019) used
Deep learning models such as Long Short-Term
Memory (LSTM) and Artificial Neural Network to
build the model using smart data which was
transformed into week-wise clickstream data. The
authors (Hassan et al., 2019) mention that deep
learning models perform better than the traditional
machine learning models with better accuracy in
predicting the at-risk students. They also suggest
that sequence to sequence approach on student’s
interaction pattern can be built into the model for
CSEDU 2021 - 13th International Conference on Computer Supported Education
250
better accuracy. However, a limitation of research
(Hassan et al., 2019) is that students engagement
pattern in their courses is not considered. Corrigan
and Smeaton (2017) have used Recurrent Neural
Networks (RNN) with student interaction pattern to
predict how well the students will perform in their
VLEs courses. However, a limitation is that 2,879
students are used to train the model, and to include
any new courses, one year of data has to be
collected, and after that the model has to be trained.
2.5 An Understanding of Clustering in
VLEs
Agnihotri et al., (2015) used data-driven clustering
methods to identify the high and low achiever in
online courses. The K-means clustering algorithm is
used to group students based on the login behaviour
of the students and the number of attempts to clear
the course. Data aggregations used in this research
are not properly processed. There are lot of null
values in training the model and less factors are used
to build the model. Preidys and Sakalauskas, (2010)
extracted huge data from the BlackBoard Vista
distance learning platform to analyse the learners
study pattern. Three clusters were identified from
the dataset namely Important, Unimportant and
Average importance using K-means clustering.
There are several outliers in the dataset and the same
is used to build the model. The above mentioned
challenges have been resolved in (Navarro &
Moreno-Ger, 2018). In this research a huge dataset
with no outliers has been used on an education
dataset to determine which clustering algorithm
performs better in predicting the low learners in the
VLEs. Seven clustering models have been used in
this work and to benchmark the performance
different evaluation metrics like Dunn Index,
Silhouette score and Davies-Bouldin score have
been compared to identify which algorithm performs
better. However, a specific limitation in this research
is that missing data in instances in the factors are
removed, which may contain useful information and
provide additional insights. 44% of the data is
cleaned from the original data.
2.6 Research Gaps
Current research studies indicate that there is no
standard way to predict low-engagement students in
the VLEs. The size of the dataset is a major
limitation where most of the studies have used the
student’s data which is less than 1000 in order to
train the model.
Therefore, the aim of this research is to
implement the clustering model on the OULA
dataset and to identify low-engagement students in
the VLEs. All interaction patterns of the individual
students in the VLEs such as academic performance
and student information will be used and converted
into smart data to predict the low-engagement
students in the early stages of the course cycle.
Clustering models like Gaussian mixture, K-
prototype and Hierarchical clustering are used with
different parameters and compared with the
evaluation metric (Navarro Moreno-Ger, 2018) to
evaluate which model performs better. Overall, this
research will be helpful to both the instructor and the
students in the online learning environments for
profiling and tracking of students. The teaching
content can be altered in VLEs by knowing the
students behaviour.
3 RESEARCH METHODOLOGY
To extract meaningful insights from the complex
data, the Knowledge Discovery in Database (KDD)
methodology is used in this research. The steps
followed are the data selection and understanding,
data pre-processing and transformation, modelling
and evaluation.
3.1 Data Selection and Understanding
The dataset has been extracted from the Open
University Learning Analytics (OULA) Dataset
which is one of the distance learning universities in
the United Kingdom (UK). This dataset is unique
from the other educational data because it contains
the student’s demographic data along with the
student’s interactions in the VLEs which is
clickstream. There are 32,593 students in this dataset
for 22 different courses for the period 2013 and
2014. The dataset is publicly available and contains
the student’s anonymized information. The dataset
follows ethical and privacy requirements of the
Open University. There are 7 different CSV files
which contain different information related to
student’s demographic, assessment scores and the
student’s interaction with the VLEs.
Raw data is transformed to aggregated data with
newly created attributes from different files of the
data. Three different type of category are extracted
from the dataset namely learning behaviour, student
course performance and the demographic details of
students.
Clustering Techniques to Identify Low-engagement Student Levels
251
3.2 Data Pre-processing and
Transformation
The raw data is transformed into actionable
aggregated data because it cannot be directly used as
input into the clustering model. All the pre-
processing and transformation steps are performed
in Python Jupyter Notebook using pandas library.
First, data exploration is carried out to check the
distribution of the data, finding missing values and
checking outliers in the data. Both univariate and
bivariate analysis has been carried out and outlier
and missing values are filtered from the dataset. In
the second step data transformation like encoding
the categorical variables and standardization of the
data is performed. In the third step new variables are
created for each student namely the overall studied
credits, total score, average clicks week wise, and
attempted weight for each course. To improve the
clustering model performance one hot-encoding is
done on the categorical column before giving as an
input to train the model. In the last step, columns
that are not contributing to the low student
engagement prediction are dropped before
implementing the model. A detailed description of
aggregated data preparation and processing is
explained is section 4.1.
3.3 Modelling
The aggregated and transformed data is given as an
input to the clustering model. Three clustering
models are implemented on the above transformed
smart dataset namely K-Prototype, Gaussian
Mixture and Hierarchical. Identifying the optimal
number of clusters in the dataset is done using the
Gap Statistics (MacEdo et al., 2019). The dataset is
used to train the K-Prototype model. The K-
Prototype is the combination of K-means and K-
mode clustering technique. The aggregated dataset
contains both numeric and categorical variables
therefore this specific type of clustering model is
chosen (Wang et al., 2016). Hierarchical clustering
is used as this analysis is based on finding similar
student’s interaction behaviour in VLEs.
Hierarchical clustering merges the clusters based on
the similarity and also both top down and bottom up
approaches can be tested (De Morais, Araújo &
Costa, 2015). The Gaussian Mixture clustering
model is chosen because it is a probabilistic model
and the approach will not complete until all the data
points are converged in different clusters and also it
uses a soft clustering approach.
3.4 Evaluation
The clustering evaluation Metrics, Silhouette
Coefficient, Davies-Bouldin index and Calinski-
Harabasz will be used to evaluate the model
performance. These metrics can show if the clusters
are well separated and are not overlapping. The
Silhouette coefficient metric calculates the mean
distance between the data points to find the better-
defined clusters, the clustering configuration is
appropriate if it has a high value (range -1 to 1).
The higher the Calinski-Harabasz index the better
the clusters are defined in the model. Finally, the
Davies-Bouldin index is used to check the similarity
between the clusters and the lower the index value
the better is the clustering.
4 IMPLEMENTATION
In this section implementation of the Fuzzy C-means
model (MacEdo et al., 2019), proposed clustering
models and preparation of aggregated data is
discussed along with the technical specifications.
4.1 Aggregated Data Preparation and
Pre-processing
In order to predict the low-engagement students, the
raw OULA dataset is transformed to aggregated data
by processing all the data from the files into a single
table. The aim of this research is to use the three
important attributes Learning Behaviour,
Performance and Demographic details of the
students as an input to the clustering models.
Therefore, data transformation has been conducted
in the cleaned dataset to derive the above-mentioned
attributes. Firstly, to derive the learning behaviour
student’s clickstream data has been processed to
week wise for 20 different activity from the VLEs
namely URLs, Homepage, Forums, Quiz,
Questionnaires, Folders, etc. Each week wise
aggregation of clicks has been added to the previous
week student click stream behaviour. Secondly, for
student’s performance, the average score the
students has attained in all the assignments before
the final exam has been added into a new column in
the dataset. Also, adjusted mark and attempted
weights are calculated based on the assessments
score and total credits. Finally, for Demographic
attributes, one-hot encoding is done on the
categorical columns. Prior to running the model,
data was normalized and scaled down to fixed range
(0 to 1). This normalization of the data improves the
CSEDU 2021 - 13th International Conference on Computer Supported Education
252
performance of the model, due to the fact that all the
clustering models use Euclidean distance to find the
distance between the closest points to the near
clusters. Overall, after performing the above steps
actionable aggregated data has been prepared and
the same is given as an input to train the clustering
models.
4.2 Implementation of Clustering
Models
All the clustering models are implemented in Python
3.7 using Jupyter Notebook and Scikit-learn
libraries. The number of clusters for the clustering
models is identified by using Gap Statistics
(MacEdo et al., 2019) on the aggregated data. Gap-
stat library has been imported from python and used
by the range of values from 0 to 11 for K by fitting
the model and including all the indexes. The point of
reflection of the curve was found at 3 which means
for the dataset the number of clusters can be used is
3, to run the clustering models. Therefore, all the
models were executed with 3 clusters to group the
students based on the individual behaviours in the
VLE.
4.2.1 Fuzzy C-means Clustering
The Fuzzy C-means clustering model has been
implemented by defining three clusters. MAX_ITER
parameter has been set to 20 to limit the model from
running an infinite loop. Also, m parameter value is
given greater than 1 to avoid the model to run as K-
nearest neighbours. After, passing the parameters,
the model is fitted and cluster labels are stored in a
separate variable. A scatter plot was used to
visualize the clusters in order to find the dispersion
of the data and the clusters.
4.2.2 Hierarchical Clustering Model
Agglomerative clustering has been imported from
the “sklearn.cluster” library in python in order to
perform hierarchical clustering on the normalized
data. The output result of the clusters labels are used
to identify the students who have low-engagement
by plotting the scatter plot using matplotlib library in
python and setting the parameter of x-axis to the
score attribute and the y-axis to the sum of clicks
attribute in the dataset. Additionally, to check the
performance of the model the clusters labels, metric
and normalized data are used to find how well the
clusters are separated between the datapoints using
the evaluation metric.
4.2.3 Gaussian Mixture Clustering
The Gaussian mixture clustering model is imported
from the sklearn.mixture library in python and the
created function runs the model using the defined
parameters. After setting the parameters, the model
is built using the fit method and the output of the
methods is the cluster labels. Using the clusters
labels both the scatter plot and evaluation metrics
are performed to find the performance of the model.
4.2.4 K-prototype Clustering
In this clustering model both categorical and
numerical data have been given as input to the
model as the K-prototype algorithm works well with
mixed data. For numerical data this model uses the
Euclidean distance to cluster the data points and for
categorical data it uses the similarity between the
data points to group into clusters. Ten iterations are
carried out and for each iteration centroids and
clusters are redefined and the best iteration is chosen
based on the less variance between the clusters.
After setting the parameters the model was fitted
into the fit method by defining the categorical
variable separately. The output of the method
showed that in the eighth iteration less variance has
been achieved and the clusters labels are plotted in
the seaborn library in python to find whether the
clusters are well separated.
5 EVALUATION
This section discusses the results and performance of
the clustering models. In experiment 1, choosing the
number of clusters for the aggregated data is
discussed and experiment 2, 3, 4 compares the
clustering models in order to identify the best model
which has less overlap of data points between the
clusters.
5.1 Experiment 1: Gap Statistics
Statistical testing methods are used to find the
optimal numbers of clusters in a dataset. Gap
statistics is used as the metric to find the clusters as
used in MacEdo et al., (2019). The gap statistics
identifies the elbow point at 3. Therefore, 3 optimal
clusters are used in the clustering models to cluster
the students using the aggregated dataset.
Clustering Techniques to Identify Low-engagement Student Levels
253
5.2 Experiment 2: Fuzzy C-Means vs
Gaussian Mixture
In this experiment Fuzzy C-means and the Gaussian
Mixture models were built and their results are
compared. Model performance is compared using
evaluation metrics. Table 1 shows the metric result
of the Fuzzy C-means and Gaussian Mixture
clustering model.
Table 1: Fuzzy c-means vs Gaussian Mixture.
Model
Silhouette
Coefficient
Calinski-
Harabasz Index
Davies-
Bouldin
Index
Fuzzy c-
means
0.38 4731 0.94
Gaussian
Mixture
0.51 3152 0.67
Results demonstrate that the proposed Gaussian
mixture model outperformed the fuzzy c-means
model. The Silhouette score of the Gaussian model
shows a 13% increase and the Davis score is 27% less
when compared to the fuzzy c-means model.
However, Calinski index metric which explains how
well the data points are separated from other clusters
shows a less result for the gaussian model. The scatter
plot of Gaussian mixture model showed that the data
are overlapped in cluster 1 and 2. Therefore, to reduce
the overlapping of the data points in the cluster, the
model that outperformed in this experiment, namely
Gaussian Mixture is compared with the Hierarchical
Clustering model in experiment 3.
5.3 Experiment 3: Gaussian Mixture vs
Hierarchical
The hierarchical clustering model was compared
with the gaussian mixture model. The results show
that hierarchical model performed better than the
gaussian in the calinski-harbaz and the davis-bouldin
index.
Table 2: Gaussian Mixture vs Hierarchical.
Model
Silhouette
Coefficient
Calinski-
Harabasz Index
Davies-
Bouldin
Index
Gaussian
Mixture
0.51 3152 0.67
Hierarchical 0.52 4552 0.52
Table 2 shows the performance comparison of
the model. From the hierarchical scatter plot, it was
evident that the hierarchical clustering model has an
overlapping of datapoints between clusters 1 and
cluster 2. Therefore, to reduce the overlapping of
datapoints between the clusters, the K-Prototype
clustering model is used in the next experiment and
compared with the hierarchical model.
5.4 Experiment 4: Hierarchical vs
K-prototype
The K-Prototype clustering model was implemented
in this experiment and used a different notion of
distance in order to calculate the distance between
the clusters. 10 iterations were used to find the best
separation of clusters and centroids. The K-
Prototype model produced the best result in iteration
2. Table 3 shows the performance of the models.
Table 3: Hierarchical vs K-Prototype.
Model
Silhouette
Coefficient
Calinski -
Harabasz Index
Davies-
Bouldin
Index
Hierarchical 0.52 4552 0.52
K-Prototype 0.75 17847 0.28
The K-Prototype clustering algorithm shows
better results when compared to all the experiments
and the Davies-Bouldin index is lower (closer to 0)
which indicates that the groupings of the students is
better partitioned. The Silhouette and Calinski-
Harabasz value is higher when compared to the
Hierarchical clustering model which shows that the
clusters are better defined in the k-prototype model.
The scatter plot in Figure 1 shows that the clusters 1
and 2 are well partitioned and separated between the
data points and the overlapping of clusters is
reduced in this model compared to the models
implemented in experiments 2 and 3.
Figure 1: K-Prototype model scatter plot.
CSEDU 2021 - 13th International Conference on Computer Supported Education
254
5.5 Discussion
Results show that the k- prototype clustering model
produced a better partition of clusters compared to
the other models. The reason behind the
performance improvement of k-prototype is that this
model is designed to work on both categorical and
numerical attributes in the dataset. In addition, the
distance between the data points to group the
clusters is measured using two metrics. For numeric
values Euclidean distance is used and for categorical
values the similarity between the points is used. In
the other models categorical data is converted into
numeric data using one-hot encoding which reduces
the model’s performance. Figure 2 shows that, the
silhouette coefficient score for the k-prototype
model is 0.75 and for fuzzy c-means which is 38.
There is significant increase in the separation of the
data points between the clusters in the k-prototype
model. The hierarchical and gaussian mixture
models also performed less compared to k-prototype.
Figure 2: Comparison of Evaluation Metric for all models.
The Calinski -Harabasz score is used to find the
variance of the data points between the clusters. If
the value of the score is higher then the cluster is
dense and well separated. The Calinski-Harabasz
score for the k-prototype is 17847 which is higher
when compared to the other models. The Davies-
Bouldin score is calculated for the scaled data. The
lesser the value of the Davies-Bouldin score the
better the separation of the clusters. For k-prototype
model the score is 0.28. Table 4 shows the cluster
labels that is observed in the clustering result for the
k-prototype model. It shows that cluster 0, represent
the class of low-engaging at-risk students with low
interaction in the VLE and low scores in the
modules. Cluster 1 contains the marginal students
who are also at risk with medium engagement in the
VLE and where they attained low scores. Finally,
cluster 2 represent the distinction students who have
attained high scores in assignments with high
interaction in the VLEs.
Table 4: K-prototype clustering result.
Cluster Class
Cluster 0
Low-engagement
students
Cluste
r
1Mar
g
inal students
Cluste
r
2 Distinction students
In summary, the k-prototype model shows less
overlapping of clusters compared to other model and
identifies the at-risk students with high performance.
The generalisability of this research is limited as it is
only using one dataset, though it is a large dataset
from a very established open university. Further
research studies are needed to establish the
generalisation of this model.
6 CONCLUSION AND FUTURE
WORK
Identifying low-engagement students in an online
learning environment is important because it allows
the instructor to monitor the student’s behaviour.
The OULA dataset from one of the largest distance
learning university in the UK is collected and
formatted to actionable aggregated data in a form
suitable for input to the clustering model. Then
fuzzy c-means model and multiple clustering models
have been applied on the data to identify the optimal
model at identifying low-engagement students.
The results of the experiment showed that K-
Prototype clustering algorithm is the most appropriate
algorithm in identifying low-engagement students in
the VLEs compared to Fuzzy C-means, Gaussian
Mixture and Hierarchical models showing the
Silhouette score of 0.75 which indicates the clusters
are better partitioned and Davies-Bouldin score of
0.28 which indicates less variance between the
cluster. The results show that the clickstream
Clustering Techniques to Identify Low-engagement Student Levels
255
behaviour of the students in VLE and academic
success are the key factors that have an impact in
identifying the low-engagement students.
Future work will extend this research further by
exploring individual student’s day to day activity to
get detailed understanding of student’s behaviours in
VLEs. Also, behavioural change of the students
between the courses may be analysed for examining
student’s behaviour. Mining the student’s textual
data from the feedback forms using Natural
Language processing from the VLEs can also be an
important factor in identifying the student
performance. Additionally, use of date attributes like
assignments submission date and student’s week
wise interactivity in VLE can be used to build the
model using time series which can result in
monitoring the students daily or in weekly
frequency. Future work is also needed to test the
model in other online teaching contexts.
Finally, this research work will be helpful for
educational institutions, learning analytics and future
researchers in choosing the important attribute to
identifying the low-engagement students in the online
learning environment and to figure-out how to pick
the best performing clustering algorithm based on the
clustering analysis in educational dataset.
REFERENCES
Agnihotri, L. et al. (2015). ‘Mining Login Data For
Actionable Student Insight’, Proceedings of the 8th
International Conference on Educational Data Mining
(EDM), pp. 472–475.
Aljohani, N. R. et al. (2019) ‘Predicting at-risk students
using clickstream data in the virtual learning
environment’, Sustainability (Switzerland), 11(24), pp.
1–12.
Boroujeni, M. S. and Dillenbourg, P. (2019) ‘Discovery
and temporal analysis of MOOC study patterns’,
Journal of Learning Analytics, 6(1), pp. 16–33. doi:
10.18608/jla.2019.61.2.
Chui, K. T. et al. (2020) ‘Predicting at-risk university
students in a virtual learning environment via a
machine learning algorithm’, Computers in Human
Behavior. Elsevier, 107(December 2017), p. 105584.
doi: 10.1016/j.chb.2018.06.032.
Corrigan, O. and Smeaton, A. F. (2017) ‘A course
agnostic approach to predicting student success from
vle log data using recurrent neural networks’, Lecture
Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture
Notes in Bioinformatics), 10474 LNCS, pp. 545–548.
doi: 10.1007/978-3-319-66610-5_59.
Corsatea, B. M. and Walker, S. (2015) Opportunities for
Moodle data and learning intelligence in Virtual
Environments, 2015 IEEE International Conference on
Evolving and Adaptive Intelligent Systems, EAIS
2015. IEEE. doi: 10.1109/EAIS.2015.7368776.
Dalipi, F., Imran, A. S. and Kastrati, Z. (2018) ‘MOOC
dropout prediction using machine learning techniques:
Review and research challenges’, IEEE Global
Engineering Education Conference, EDUCON. IEEE,
2018-April, pp. 1007–1014. doi:
10.1109/EDUCON.2018.8363340.
Hassan, S. U. et al. (2019) ‘Virtual learning environment
to predict withdrawal by leveraging deep learning’,
International Journal of Intelligent Systems, 34(8), pp.
1935–1952. doi: 10.1002/int.22129.
Hussain, M. et al. (2018) ‘Student Engagement Predictions
in an e-Learning System and Their Impact on Student
Course Assessment Scores’, Computational
Intelligence and Neuroscience, 2018. doi:
10.1155/2018/6347186.
Kuzilek, J., Hlosta, M. and Zdrahal, Z. (2017) ‘Data
Descriptor: Open University Learning Analytics
dataset’, Scientific Data, 4, pp. 1–8. doi:
10.1038/sdata.2017.171.
Liang, J. et al. (2016) ‘Big data application in education:
Dropout prediction in edx MOOCs’, Proceedings -
2016 IEEE 2nd International Conference on
Multimedia Big Data, BigMM 2016. IEEE, pp. 440–
443. doi: 10.1109/BigMM.2016.70.
Macarini, L. A. B. et al. (2019) ‘Predicting students
success in blended learning-Evaluating different
interactions inside learning management systems’,
Applied Sciences (Switzerland), 9(24). doi:
10.3390/app9245523.
MacEdo, M. et al. (2019) ‘Investigation of college dropout
with the fuzzy c-means algorithm’, Proceedings -
IEEE 19th International Conference on Advanced
Learning Technologies, ICALT 2019. IEEE, pp. 187–
189. doi: 10.1109/ICALT.2019.00055.
De Morais, A. M., Araújo, J. M. F. R. and Costa, E. B.
(2015) ‘Monitoring student performance using data
clustering and predictive modelling’, Proceedings -
Frontiers in Education Conference, FIE, 2015-
Febru(February). doi: 10.1109/FIE.2014.7044401.
Navarro, Á. M. and Moreno-Ger, P. (2018) ‘Comparison
of Clustering Algorithms for Learning Analytics with
Educational Datasets’, International Journal of
Interactive Multimedia and Artificial Intelligence,
5(2), p. 9. doi: 10.9781/ijimai.2018.02.003.
Preidys, S. and Sakalauskas, L. (2010) ‘Analysis of
students’ study activities in virtual learning
environments using data mining methods’,
Technological and Economic Development of
Economy, 16(1), pp. 94–108. doi:
10.3846/tede.2010.06.
Waheed, H. et al. (2020) ‘Predicting academic
performance of students from VLE big data using deep
learning models’, Computers in Human Behavior.
Elsevier Ltd, 104(November 2018), p. 106189. doi:
10.1016/j.chb.2019.106189.
Wang, F. et al. (2016) ‘Empirical comparative analysis of
1-of-K coding and K-prototypes in categorical
CSEDU 2021 - 13th International Conference on Computer Supported Education
256
clustering’, CEUR Workshop Proceedings, 1751(c),
pp. 248–259.
Yi, J. C. et al. (2018) ‘Predictive analytics approach to
improve and sustain college students’ non-cognitive
skills and their educational outcome’, Sustainability
(Switzerland), 10(11). doi: 10.3390/su10114012.
Clustering Techniques to Identify Low-engagement Student Levels
257