Effectiveness of Comments on Self-reflection Sheet in Predicting
Student Performance
Rumiko Azuma
College of Commerce, Nihon University, Setagaya-ku, Tokyo, Japan
Keywords: Learning Analytics, Machine Learning, Text Mining, Educational Support, Reflection Sheet.
Abstract: In recent years, schools and universities have become more focused on how to allow learners to learn
successfully, and it has become an expectation to design instruction in a way that takes into account the
individual differences of learners. Accordingly, the purpose of this study is to predict, at an earlier stage in a
course, which students are likely to fail, so that adequate support can be provided for them. We proposed a
new approach to identify such students using free-response self-reflection sheets. This method uses the
unrestricted comments from the students to create a comment vector that can be used to predict who are likely
to fail the course. Subsequently, we conducted experiments to verify the effectiveness of this prediction. In
comparison to methods used in existing research which predict potential failures using quiz scores and the
students’ subjective level of understanding, our proposed method was able to improve the prediction
performance. In addition, when cumulative data after several sessions were used to predict which students
were likely to fail, the predictions made by the support vector machine (SVM) algorithm showed a consistent
prediction performance, and the prediction accuracy was higher than that of other algorithms.
1 INTRODUCTION
Recently, Learning Analytics (LA) that is a study of
estimating or predicting learners' performance
through Learning Management System (LMS) or e-
learning system has been actively performed (Hirose,
2019a). The most common use of LA is to identify
students who appear less likely to succeed
academically and to enable targeted interventions to
help them achieve better outcomes (Scapin, 2018).
Providing some kind of feedback to the learners based
on the analysis results in LA enable learners to take
learning support tailored to each learner. It offers
promise for predicting and improving learners
success and retention (Uhler et al., 2013).
On the other hand, a decline in academic ability of
university students has become a problem in Japan,
and universities are required to find students who
cannot keep up with classes at an early stage and
follow them up. Therefore, it is important for
educational institutions to predict the student's real
understanding level or their grade from the student's
learning data by LA.
In many Japanese universities, students’ class
evaluation questionnaire is adopted as one of the
methods for measuring students' understanding level.
The questionnaire is for students to evaluate their
subjective understanding level, their attitude towards
class, lecturer and class contents. However, student’s
subjective evaluations often do not reflect the
student's real understanding level or learning
conditions (Azuma, 2016). In addition, since the
questionnaire is often conducted at the final lesson, it
serves to improve the future offering of the class than
to provide feedback to the student. For these reasons,
in order to measure the level of understanding and
satisfaction of students in a class, the lecturers
increase the frequency of questionnaire conducting
surveys with.
One of the methods for lecturers to grasp students'
level understanding and satisfaction is “a minute
paper (Davis et al., 1983).A minute paper is
defined as a very short comment of a student, in-class
writing activity (taking one minute or less to
complete). It prompts students to reflect on the day’s
lesson and provides the lecturer with useful feedback.
In this study, we propose a method to predict
students who have a risk of failing course from their
comments in self-reflection sheets like “a minute
paper”. The purpose of this study is to improve the
student's learning behavior by predicting the student's
grade and providing useful feedback to the student. In
394
Azuma, R.
Effectiveness of Comments on Self-reflection Sheet in Predicting Student Performance.
DOI: 10.5220/0010197503940400
In Proceedings of the 10th International Conference on Operations Research and Enterprise Systems (ICORES 2021), pages 394-400
ISBN: 978-989-758-485-5
Copyright
c
2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
addition, lecturers could quickly find out students
with low level understanding and provide
personalized advice.
2 RELATED STUDIES
It is very important for both learner and lecturer to
grasp learner's performance because feedback of the
information obtained from these can help improve
their learning.
Hirose (2018, 2019a) analyzed the accumulated
weekly testing results to identify students who may
fail in the final examination, using item response
theory. He also proposed a method with high
predictive accuracy which predicts the risk for failing
courses and/or dropping out for students using the
learning check testing scores, the follow-up program
testing success/failure times, and attendance rate
(Hirose, 2019b). Although the subjects dealt with in
his paper are limited to mathematics, it describes this
kind of system will easily be applied to other subjects.
On the other hand, there are several studies which
predict student performance based on student
comments in the lesson. Sorour et al., (2014, 2015)
proposed the model to predict university students’
grade from their comments written according to the
PCN method. Their paper describes that they can
clearly distinguish high score group, but the
prediction accuracy of lower score group became
lower. The PCN method (Goda et al., 2011)
categorizes student comments into three items of
P(previous), C(current) and N(next). Item P is
learning activities for preparation of a lesson such as
review of previous class. Item C is understanding of
the lesson and learning attitudes to the lesson. Item N
is the learning activity plan until the next lesson. Luo
(Luo et al., 2015) discussed the prediction method of
student grade based on the comments of item C using
Word2Vec and Artificial Neural Network. Their
study expressed the correlation between self-
evaluation descriptive sentences and academic
performance. Niiya and Mine (2017) also verified the
accuracy of model which predicts junior high school
students score from their comments based on the PCN
method.
However, students' freestyle comments such as
the minute paper used in universities often do not
limit what is written because it is effective for
increasing student satisfaction. It is different from
student's comments using PCN with limited writing
content. This study discusses a method that be able to
predict students' final examination score (pass or fail)
using the reflection sheets based on "a minute paper".
An aim of this study is to quickly find students who
may fail in the course and give them feedback.
Therefore, we consider the method that does not
predict their score or grade, but predicts the
possibility of whether they will fail in the final
examination.
3 OVERVIEW OF STUDENTS'
DATA
In previous study (Azuma, 2017), we tried to predict
student's final exam score by multiple regression
analysis based on student's understanding level,
background knowledge level, quiz scores, and
reflection sheets. Regarding the reflection sheet, the
number of characters and technical terms extracted
from the reflection sheets were used quantitatively as
independent variables. The final model had a
coefficient of multiple determination, R
2
, of 0.211
(p<0.001), and predictive accuracy was low.
In this study, we used 193 university students’
data for 3 years (from 2012 to 2014) that is data from
the previous study (Azuma, 2017) plus new data.
Then, we discuss the prediction model using machine
learning algorithms.
This study’s data were collected from 13 lessons
of “basic statistics” course in three years. The
reflection sheet is a freestyle comment sheet that
students can write freely about things they have
learned, noticed, understood, did not understand,
questions, requests, etc. Number of reflection sheets
is 2501 and number of sentences is 6051. Mean of
characters per sheet is 87.35, maximum of it 686, and
minimum of it 6. This study uses the following data
except student’s background knowledge level
because of very weak correlation with student's score.
Score of final examination
Score of quizzes
Understanding level for a lesson (5-level
evaluation based on student's subjectivity)
Comments in reflection sheets (Japanese)
To predict failed student from these data, we
classified final examination score to two categories
such as “Passed” or “Failed”. Table 1 shows a
corresponding relation between the categories and the
scores. Scores less than 60 were classified as the
Failed group, and others were classified as the Passed
group. Table 2 shows the summary statistics of
examination score. The correlation coefficient
between the final score and each features is shown in
Table 3.
Effectiveness of Comments on Self-reflection Sheet in Predicting Student Performance
395
Table 1: The corresponding relation between the categories
and the range of examination scores.
Category Passed Failed
Score 60~100 0~59
Grade S, A, B, C D
Number of students 129 64
Table 2: The summary statistics of examination scores.
Mean 63.63
Standard deviation 21.83
Max 100
Min 1
Rate of Failed students 0.33
Table 3: The correlation coefficient between the final score
and each features (p<0.01).
Score of quizzes 0.35
Understandin
g
level 0.24
Comments in reflection sheets
Num. of characters 0.31
Num. of technical terms 0.33
4 PREDICTION
METHODOLOGY OF FAILED
STUDENT
4.1 Prediction Methodology using Four
Features
In the previous study (Azuma, 2017), it was clarified
that students with high score tend to have more
comments and technical terms in the reflection sheet,
although a weak positive correlation with students’
score. Therefore, first of all, we selected the number
of characters and technical terms in comments, the
quiz scores and the understanding level as features to
predict. Each value was the average of 13 lessons.
4.1.1 Performance Measures of Failed
Student Prediction
The ML models used in the prediction are followings:
Decision Tree (DT), Random Forest (RF), Support
Vector Machine (SVM), Logistic Regression (LR),
Generalized Linear Model (GLM), and Naive Bayes
(NB).
We evaluated the performance of each machine
learning (ML) models by 10-fold cross validation
using four features. We randomly separated dataset to
training and test data set, taking into account the
balance of the categories so that they are the same as
the original dataset. 80% of the data is randomly
assigned to the training data and the remaining 20%
is assigned to the test data.
Table 4: The confusion matrix.
Actual value
1 0
Predicted
outcome
1 T
P
F
P
0 F
N
T
N
Table 5: The prediction results of “Failed” by basic
method.
ML
Mo
d
el
Precision Recall F-measure Accuracy
DT 1.000 0.071 0.133 0.675
RF 0.500 0.285 0.333 0.650
SVM 0.400 0.285 0.333 0.600
LR 0.400 0.285 0.333 0.600
GLM 0.400 0.285 0.333 0.600
NB 0.571 0.571 0.571 0.700
The experiment results show the test accuracy and
F-measure for each of the models. These values are
defined using TP (True Positive), FP (False Positive),
TN (True Negative), FN (False Negative) in Table 4
as follows:
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
(
𝑇𝑃 + 𝐹𝑃
)
(1)
𝑟𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃
(
𝑇𝑃 + 𝐹𝑁
)
(2)
𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒=
(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙)
(
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
)
(3)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝐹𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
(4)
Recall tells us how confident we can be that all the
instances with the positive target level have been
found by the model. Precision tells us how confident
we can be that an instance predicted to have the
positive target level actually has the positive target
level (Kelleher et al., 2015). In this study, “Failed”
students are the positive target. F-measure is the
harmonic mean of precision and recall and offers a
useful alternative to the simpler misclassification rate.
It reaches its best value at 1 and worst score at 0.
4.1.2 Prediction Results using ML
The prediction results using four features are shown
in Table 5. We call this method the basic method.
ICORES 2021 - 10th International Conference on Operations Research and Enterprise Systems
396
Precision, Recall, and F-measure in Table 5 are
values when "Failed" is the positive target.
Although accuracies were over 60% in all models,
all F-measure scores were low. As for accuracy and
F-measure, the results of NB were the highest, which
scores 70.0% and 57.1%.
4.2 Prediction Methodology by
Quantification of Student
Comments
We analyzed quantitatively student comments in
reflection sheets using KHcoder (Higuchi, 2004). KH
Coder is software for quantitative content analysis or
text mining. KHcoder supports multilingual analysis.
The number of analysis target words extracted by
KHcoder was 123046 in this study. The average of
words used per sheet was 43.9. In this section, we
describe a method for treating student comments as
features.
4.2.1 Creating a Comment Vector
We assigned some labels to sentences of student
comment containing specific words that were
extracted by correspondence analysis. Each label was
defined as shown in Table 6.
“Positive Understanding” means whether or not a
phrase indicating understanding is included in the
student's comment. For example, if a student's
comment contains the sentence “I understood today's
lesson”, the comment was labeled the PU-. On the
other hand, if a technical term is included, such as “I
understood about the Bayes' theorem”, the comment
was labeled the PU+.
Similarly, "Negative Understanding" means a
comment with a phrase that expresses what student
could not understand. For example, if a comment
contains a technical term such as “I did not
understand the conditional probability well”, the
comment was labeled NU+, and otherwise NU-. In
addition, seven labels including Negative Words”,
“Vagueness Words”, and “phrase for Expressing
Willingness” were assigned to each sentence.
Moreover, it was allowed to be assigned multiple
labels to one sentence. A sentence like "I did not
understand today's lesson because it was difficult"
was labeled NU- and NW. Then we counted the
number of sentences for each label and created a 7-
dimensional comment vector that has the number of
occurrences of each label as elements, for each
student. The correlation coefficient between each
element of comment vector and the number of
characters or technical terms are shown in Table 7. In
addition to 4 features of the previous section, the
comment vector and the number of words were added
as features to improve predictive accuracy.
Table 6: The kinds of label associated with specific words.
Meaning Label
Specific words
Positive Understanding
co-occurrence with
technical terms
PU+
I understood xxxx,
I got xxxx
no co-occurrence with
technical terms
PU-
Negative Understanding
co-occurrence with
technical te
r
ms
NU+
I could not
understand xxx,
I am not sure xxx
no co-occurrence with
technical terms
NU-
Negative Words NW
difficult, many,
poor, tough,
anxiety, forget
Vagueness Words VW
feel, to an extent,
a little
phrase for Expressing
Willingness
EW
I want to,
I need to,
I have to
Table 7: The correlation coefficient between each element
of comment vector and the number of characters or
technical terms (p < 0.01).
PU+ PU- NU+ NU- NW VW EW
Characters
0.41 0.23 0.30 0.41 0.40 0.30 0.50
Technical
terms
0.43 -0.01 0.19 0.25 0.12 0.02 0.33
4.2.2 Prediction Results using Comment
Vector
The results predicted by the dataset including the
comment vectors are shown in Table 8. For all models
except DT and NB models, F-measure and accuracy
were increased. In particular, the F-measure of RF
model was the highest score and was improved from
33.3% to 58.3% compared to the basic method. The
F-measure score of NB also improved slightly from
57.1% to 57.8% in this method, but accuracy, which
was the highest in the basic method, was declined
from 70.0% to 66.0%.
4.3 Prediction of Potential Students of
Failed Examination
4.3.1 Category of Potential Students
Next, students with grades C or D, less than 70 score
are regarded as a “problem” group. We calculated the
prediction performance of “problem” students. Table
Effectiveness of Comments on Self-reflection Sheet in Predicting Student Performance
397
9 displays a corresponding relation between new
categories and the range of scores.
Table 8: The prediction results of “Failed” by comment
vector + basic method.
ML
Mo
d
el
Precision Recall F-measure Accuracy
DT 0.500 0.062 0.111 0.600
RF 0.875 0.437 0.583 0.750
SVM 0.545 0.375 0.444 0.625
LR 0.666 0.375 0.480 0.675
GLM 0.714 0.312 0.434 0.675
NB 0.500 0.687 0.578 0.660
Table 9: The corresponding relation between new
categories and the range of examination scores.
Category No problem Problem
Score 70~100 0~69
Grade S, A, B C, D
Number of students 88 105
Table 10: The prediction results of “Problem” students by
comment vector + basic method.
ML
Mo
d
el
Precision Recall F-measure Accuracy
DT 0.615 1.000 0.761 0.625
RF 0.700 0.875 0.777 0.700
SVM 0.724 0.875 0.792 0.725
LR 0.720 0.750 0.734 0.675
GLM 0.714 0.833 0.769 0.700
NB 0.700 0.875 0.777 0.700
Table 11: The prediction results of “Problem” students
using only data of reflection sheets.
ML
Mo
d
el
Precision Recall F-measure Accuracy
DT 0.615 1.000 0.761 0.625
RF 0.666 0.833 0.740 0.650
SVM 0.656 0.875 0.750 0.650
LR 0.740 0.833 0.784 0.725
GLM 0.714 0.833 0.769 0.700
NB 0.700 0.875 0.777 0.700
4.3.2 Prediction Results by New Category
Students of category "Problem" were predicted using
features, which the understanding level, quiz score,
number of characters, number of technical terms, the
number of words, and comment vector, as well as in
section 4.2. The prediction results are shown in Table
10.
As for F-measure and accuracy, the result of SVM
model was the highest, which scores 79.2% and
72.5%. In all models, the F-measure scores were over
70%. RF, which had the highest predictive accuracy
in the previous section (the prediction of “Failed”
students), was the second highest accuracy in this
case.
4.3.3 Prediction Results using Only
Reflection Sheets
Since the purpose of this study is to predict the failure
by using only the student's comments, we also
checked performance for prediction of "Problem"
students using only data of reflection sheets; those are
the comment vector, the number of characters, and
the number of technical terms, the number of words.
Table 11 shows the results. In all models, its F-
measure scores were more than 70%. The result of LR
model was the highest, with F-measure score 78.4%
and accuracy 72.5%. Secondly, the results of NB
were better, with F-measure score 77.7% and
accuracy 70.0%. Those scores in the DT model did
not change.
We compared this results with the results of the
method added comment vector to the basic method in
prediction of “Problem”. As shown in Figure 1, in
LR, GLM, and NB, prediction performances of the
method using only data of reflection sheets were
better or equal to the method with comment vector
added to the basic method. In the other three
algorithms, adding the comment vector to the basic
method had higher accuracy and F-measure.
Therefore, we can see that it is possible to predict
potential students of failed examination only with the
reflection sheets depending on the machine learning
algorithm.
4.4 Prediction Results using
Cumulative Data from Prior Weeks
Finally, we examined the performance of the
prediction of potential students who may fail using
cumulative weekly data.
All weeks (1-13) training data was used to
construct the SVM and LR models, which had higher
predictive performance on the comment vector. Then,
the model was evaluated on cumulative data from
each weeks labeled as test data.
The results are shown in Figure 2 and 3. In both
figures, the x-axis represents the cumulative test data
from week 1 to n. For example, the 1-3 on x-axis
shows the prediction result using average of students’
data from week 1 to 3. Figure 2 displays the plot of
results with the comment vector and basic method.
ICORES 2021 - 10th International Conference on Operations Research and Enterprise Systems
398
(a) Comparison of F-measure scores in prediction results.
(b) Comparison of accuracies in prediction results.
Figure 1: Comparison of the results of Tab. 8 with Tab. 9.
(a) F-measure using the comment vector + basic method.
(b) Accuracy using the comment vector + basic method.
Figure 2: The prediction results of “Problem” students
using the comment vector + basic method in each week.
The used model is constructed from all weeks (1-13) data.
Similarly, Figure 3 shows the test accuracy based
on reflection sheets including the comment vector.
As shown in Figure 2 and 3, SVM showed
consistent prediction performance on both the test
sets. Unlike SVM, LR had a higher variance in
accuracy of the prediction results, for instance with
comment vector and basic method the F-measure
ranged from 37.8% to 73.4%. As the number of
weeks progressed, the prediction accuracy improved.
This is to be expected as the learning algorithm is
given more data points which gives a better reflection
of the level of student’s understanding of concepts. It
is worth highlighting that half way into the semester,
SVM achieves at least 60% accuracy in identifying
“problem” students. This is very useful in instituting
remedial help for these at-risk students in the second
half of the semester.
(a) F-measure using only the data of reflection sheets.
(b) Accuracy using only the data of reflection sheets.
Figure 3: The prediction results using of “Problem”
students only the data of reflection sheets in each week. The
used model is constructed from all weeks (1-13) data.
5 CONCLUSIONS AND FUTURE
WORK
In this study, we proposed the method to predict
students who may fail the examination using
reflection sheets. In addition to conventional features
of the previous study, adding the comment vector
Effectiveness of Comments on Self-reflection Sheet in Predicting Student Performance
399
extracted from the reflection sheets to features
improved the prediction performance. Moreover, the
prediction using only the reflection sheets did not
significantly reduce the accuracy. Therefore, we
believe the comment vector is an effective feature to
predict failing students.
Furthermore, we examined the performance using
cumulative weekly students’ data on the prediction of
potential students who may fail. As the result based
on models constructed from all data, the prediction by
support vector machine (SVM) was relatively stable.
The prediction with only the data of reflection sheets
showed lower accuracy than one including basic
method, but the F-measure, which is a predictive
measure for "Problem" students, was around 70%. In
order to predict "Problem" students with high
accuracy at an early stage, improvements in the
method are needed.
Another issue is whether this method can also be
applied to other courses in the prediction of failed
student. Also, it is necessary to examine a comment
vector or factors (McKenzie et al., 2001) more
strongly associated with predicting academic
performance than the labels defined in this study,
through data mining. Furthermore, we need to
consider about predicting student performance from
English comments using our proposed method. In the
future, we will try to investigate these issues.
REFERENCES
Azuma, R., 2016. Analysis of the Students' Level of
Understanding using Reflective Sheet. The 41st
conference of Japanese Society for Information and
Systems in Education, PP.263-264.
Azuma, R., 2017. Analysis of Relationship between
Learner's Characteristics and Level of Understanding
using Text-mining. The 42nd annual conference of
Japanese Society for Information and Systems in
Education, PP.73-74.
Davis, B. G., Wood, L., Wilson, R. C., 1983. ABCs of
teaching with excellence. Berkeley: University of
California.
Goda K., Mine T., 2011. Analysis of Students’ Learning
Activities through Quantifying Time-Series
Comments. International Conference on Knowledge-
Based and Intelligent Information and Engineering
Systems, KES 2011, PP.154-164.
Higuchi, K., 2004. Quantitative Analysis of Textual Data:
Differentiation and Coordination of Two Approaches.
Journal of Sociological Theory and Methods, 19(1),
PP.101-115. Japanese Association for Mathematical
Sociology.
Hirose, H., 2018. Difference Between Successful and
Failed Students Learned from Analytics of Weekly
Learning Check Testing. Information Engineering
Express, Vol.4, No.1, PP.11 – 21. IIAI Publications.
Hirose, H., 2019a. Prediction of Success or Failure for Final
Examination using Nearest Neighbor Method to the
Trend of Weekly Online Testing. International Journal
of Learning Technologies and Learning Environments,
Vol. 2, No. 1, PP.19 – 34.
Hirose, H., 2019b. Key Factor Not to Drop Out is to Attend
the Lecture. Information Engineering Express, Vol.5,
No.1, PP.59 – 72. IIAI Publications.
Kelleher, J. D., Namee, B. M., D’Arcy, A., 2015.
Fundamentals of Machine Learning for Predictive
Data Analytics, The MIT Press. Cambridge,
Massachusetts London, England.
Luo, J., Sorour, E. S., Goda, K., Mine, T., 2015. Predicting
Student Grade based on Free-style Comments using
Word2Vec and ANN by Considering Prediction
Results Obtained in Consecutive Lessons. The 8
th
International Conference on Educational Data Mining
(EDM). International Educational Data Mining Society.
Niiya, I., Mine, T., 2017. Comment mining to estimate
junior high-school student performance toward
improvement of student learning. IPSJ SIG Technical
Report. The Information Processing Society of Japan.
Scapin, R., 2018. Learning Analytics: How to Use
Students’ Big Data to Improve Teaching. Article of
Vitrine technologie-éducation. https://www.
vteducation.org/en/articles.
Sorour, S. E., Mine, T., Goda, K., Hirokawa, S., 2014.
Examining students' performance based on their
comments data using machine learning technique. Joint
Conference of Electrical and Electronics Engineers in
Kyushu, P.350.
Sorour, S. E., Mine, T., Goda, K., Hirokawa, S., 2015. A
predictive model to evaluate student performance.
Journal of Information Processing, Volume 23, Issue 2,
PP.192-201. The Information Processing Society of
Japan.
Uhler, D. B., Hurn, E. J., 2013. Using Learning Analytics
to Predict (and Improve) Student Success: A Faculty
Perspective. Journal of Interactive Online Learning,
Volume 12, Number 1.
Kirsten McKenzie and Robert Schweitzer, 2001. Who
Succeeds at University? Factors predicting academic
performance in first year Australian university students,
Higher Education Research and Development, 20(1),
PP.21-33.
ICORES 2021 - 10th International Conference on Operations Research and Enterprise Systems
400