Machine Learning-Based Prediction of the Course Assessment

Weimin Geng

, Qiuling Li

2,*

and Dian Zhang

Shanghai Urban Construction Vocational College, 2360 Jungong Road, Shanghai 201999, China

Clinbrain Co., Ltd., Shanghai Putian Information Industry Park B2 Building, Shanghai 200233, China

Keywords: Machine Learning, Water Supply and Drainage Engineering, Course Assessment Prediction, LightGBM,

Lasso Regression.

Abstract: In order to keep track of the students' learning status and make early warning, the model of predicting the

final course assessment was proposed based on machine learning. Take the course of water supply and

drainage engineering cost as an example, the students’ related information and the historical assessment data

(such as the teaching activities and the stage assessment scores etc.) collection and cleaning were carried out

firstly. Then the features were filtered out by Light Gradient Boosting Machine (LightGBM), and the

prediction model of the final score was built on the basis of Least Absolute Shrinkage and Selection Operator

(Lasso) regression. Phases 1 and 2 forecast were completed and the error statistics were analysed. The

predicted results at different stages of the semester help the teachers and students get the learning situation

and take timely adjustment measures.

1 INTRODUCTION

For the applied colleges, the assessment and

evaluation of the specialized courses usually adopt a

process-based and comprehensive mode and are

distributed throughout the semester. In recent years,

researchers have done various studies on teaching

assessment and evaluation. The processing-

assessment mechanism and reasonable curriculum

assessment method were introduced by Zhou et al.

(2023). Zhou and Liu (2023) studied the evaluation

framework building based on Context, Input, Process,

Product (CIPP). The evaluation index system of

online and offline blended curriculum was

constructed by Huang (2023). And Kou analysed the

problem of a course evaluation based on Outcome

Based Education (OBE) and proposed the

implementation scheme of the diversified course

evaluation system (Kou, 2023). With the

development of Artificial Intelligence (AI), the

researchers began to apply it in the course evaluation,

and AI has formed a discipline system with neural

networks, machine learning, and expert systems etc.

as core algorithms. Maestrales et al. (2021) trained

human raters and compiled a robust training set to

https://orcid.org/0009-0004-8404-5114

develop machine algorithmic models and cross-

validate the machine scores in chemistry and physics.

Gao et al. (2023) summarized the advantages of

machine learning in the fields of scoring strategies,

learning assessment and educational intervention. In

addition, Cao et al. (2023) studied the method of

student learning situation early warning.

During the semester, to know the learning

situation of students timely and take measures when

there is an abnormality are very important for the

teachers. The paper suggested the method to forecast

the total final grade of the course according to the

various assessments that have been completed during

the semester. The prediction helps the teachers to

keep track of the learning situation of each student in

a timely manner and make solutions for some

students who have difficulties in passing the final

assessment.

168

Geng, W., Li, Q., Zhang and D.

Machine Learning-Based Prediction of the Course Assessment.

DOI: 10.5220/0013593400004671

In Proceedings of the 7th International Conference on Environmental Science and Civil Engineering (ICESCE 2024), pages 168-172

ISBN: 978-989-758-764-1; ISSN: 3051-701X

2 BUILDING THE PREDICTION

MODEL OF THE COURSE

ASSESSMENT

The course assessment is an important teaching

evaluation and feedback. Taking the course of water

supply and drainage engineering cost as an example,

the process assessment includes writing articles,

course comprehensive assignments, course practice,

explanation and presentation, and quality behaviours

(such as attendance, engagement, teamwork) etc.

Therefore, based on the phased assessment results

during the semester, the prediction of the final course

score is possible. The related student’s information

items and the course assessment’s components are

listed in Table 1. And the prediction model is based

on the data in Table 1 except Score 4, with the Final

Score as the target variable and the rest of the

information as the characteristic variables.

Table 1: The related students’ information items and the course assessment’s components.

Numbe

Data Name Meaning Types of variables

1 Attribute 1 In high school, the student has learned the basic

knowledge of science or not.

Categorical variables

2 Attribute 2 Gende

Categorical variables

3 Activit

1 Credits of view online e-resource. Continuit

variables

4 Activity 2 Frequency of answering questions in class. Continuity variables

5 Score 1 Quality behaviours (such as attendance, course

artici

ation, teamwork etc.

)

Continuity variables

6 Score 2 Test of basic concept understanding. Continuity variables

7 Score 3 Inte

rated homework. Continuit

variables

8 Score 4 Ex

lanation and

resentation. Continuit

variables

9 Final Score Final course assessment. Continuity variables

Since the linear correlation between the input

features (number3-7 in Table 1) and the target

variables was very strong, the prediction model

adopted the combination of LightGBM and Lasso

regression. LightGBM is a type of decision tree, and

it could directly use the characteristics of categorical

features, calculate the importance of features. So, the

features that are meaningful to the target variables are

filtered out. Tree models belong to machine learning

algorithms, which are non-parametric supervised

learning methods. The decision-making process is

represented through a tree structure, where each node

of the tree represents a feature or attribute, and each

leaf node represents a category or numeric value, and

the decision-making rules of the data are presented

through the tree-like structure. LightGBM is an

algorithm that combines Gradient Boosting Decision

Tree and Random Forest to improve prediction

performance by building multiple weak learners

(decision trees) (Meng et al., 2016). Lasso is a linear

model and proposed by Tibshirani for estimation

based on Ridge Regression Theory in 1996

(Tibshirani, 1996). This method minimizes the

residual sum of squares subject to the sum of the

absolute value of the coefficients being less than a

constant. The building of the prediction model mainly

includes the following 3 steps.

2.1 Data Processing

The categorical variables were processed firstly.

Attribute 1 and Attribute 2 generally couldn’t be

directly input into the model, so they need to be

numerically encoded. The data were encoded

according to 0 and 1 respectively. Then the two

variables were converted into category type and

passed to the LightGBM model for screening

features. Secondly, for the continuous variables in the

characteristics, the test score range was 0~100, the

maximum value of activity 1 and activity 2 was not

clear, and the numerical dimensions were different.

So, they were mapped and scaled to the same interval

range. Use maximum-minimum normalization to

scale the data between the intervals [0,1].

𝑥



𝑥−𝑥



𝑥



−𝑥



(1)

where 𝑥



is the normalized data, and 𝑥



and

𝑥



represent the maximum and minimum values of

the data respectively (https://scikit-learn.org).

2.2 Feature Selection

LightGBM uses fewer feature fragments, allows for

faster model training, and has better generalization

capabilities (Meng et al., 2016). In addition,

LightGBM can directly use categorical features, and

its high efficiency is mainly reflected in the

Machine Learning-Based Prediction of the Course Assessment

169

processing of multi-sample and multi-features

(Ke et

al., 2017). LightGBM uses the characteristics of

categorical features, calculates the importance of

features, and filter out features that are meaningful to

the target variable.

2.3 Predicting the Final Score

The training samples were cross validated with 5

folds, and 85% of the data was selected as the training

set and 15% as the test set for each fold. Mean

Absolute Error (MAE), Coefficient of Determination

(R2, the closer the value is to 1, the better the model

is trained.), Root Mean Square Error (RMSE) and

Mean Absolute Percentage Error (MAPE) were

adopted to evaluate the model effect to ensure it was

balanced, and the average score of the model was

used as the final model evaluation result. The error

calculation formulas are as follows

(Zhou, 2016).

𝑀𝐴𝐸=

𝑛

|y



−𝑦









(2)

𝑅𝑀𝑆𝐸=



𝑛

y

−𝑦











(3)

𝑅



= 1−y

−𝑦













y

−𝑦















(4)

𝑀𝐴𝑃𝐸=

𝑛



−𝑦



𝑦



 100%







(5)

where 𝑦



is the measured value, y

 is the predicted

value, y

 is the measured mean, and n is the number

of samples.

3 CASE STUDY

Taking the course of water supply and drainage

engineering cost assessment as an example, and there

are a total of 79 students. The Final Score is the object

variable, and number1- number 7 are the feature

variables. Firstly, the features were screened.

Through the ranking, the importance of Attribute 1

and Attribute 2 to the target in the input features is

almost 0, so these two were not applied as input

features in the prediction model (see Figure 1).

Secondly, the correlation was analysed, and their

linear correlation to the target was calculated. The

correlation heat map is shown in Figure 2.

The prediction of the Final Score was carried out

by Lasso regression. After the Score 2 and Score 3

were obtained during the semester, the Final Score

was predicted respectively and named Phase 1 and

Phase 2 (shown in Figure 3). The coincidence of

Phase 2 prediction is higher than that of phase 1, and

it can also coincide well for some uneven cases.

Figure 4 is the scatter plots of the forecasting results,

and it indicates that the prediction is better when it

falls on the diagonal.

Figure 1: The feature importance ranking graph.

Figure 2: The correlation heat map.

The error statistics of the forecasting model is

listed in Table 2. In addition, the residuals of the

results were calculated. For Phase 1 period, the

residuals of the mean value, the standard deviation,

the minimum value, and the maximum value are -

0.2026, 3.1173, -8.6150, 4.4774 respectively. In

Phase 2, the above values are -0.0842, 1.0752, -

2.7227, 1.8601.

The prediction values of Phase 2 are better than

Phase 1 and could meet the demand to assist the

teaching and at the same time notify the students who

have possibilities not to pass the final course

assessment.

ICESCE 2024 - The International Conference on Environmental Science and Civil Engineering

170

Table 2: The error statistics.

Phase MAE R

RMSE MAPE

1 2.48 0.71 3.1 2.93

2 0.83 0.97 1.07 0.97

(a) Phase 1

(b) Phase 2

Figure 3: Comparison of the predicted and actual values.

(a) Phase 1 (b) Phase 2

Figure 4: The fitted scatter plot of the prediction results.

Machine Learning-Based Prediction of the Course Assessment

171

4 CONCLUSION

Based on LightGBM and Lasso regression

algorithms, the prediction model of the course final

assessment was built. The final scores were predicted

by filtering the features and learning the historical

data rules. The course assessment prediction values

can be obtained during the semester, and they remind

some students to adjust their learning status avoid

failing the final assessment. At the same time, it plays

a role in helping the teachers take timely adjustment

measures. Then in the future, with the increasing use

of smart classrooms, their statistics number can also

be included in the prediction model.

REFERENCES

Zhou, Y., Wu, J., Li, Z., Yu, J., Xia, L. 2023. Thinking on

Processing-assessment Mechanism and Reasonable

Curriculum Assessment Method. Higher Education in

Chemical Engineering, 40(1): 70-75.

Zhou, L., Liu, C. 2023 Research on the Evaluation of Labor

Education Courses in Colleges and Universities Based

on CIPP Evaluation Model. Western China Quality

Education, 9(15): 42-45,98.

Huang, R. 2023. Construction and Optimization of Online

and Offline Blended Curriculum Evaluation System.

Journal of Ningbo Polytechnic, 27(5): 102-108.

Kou, J. 2023. Research on Diversified Evaluation System

of "Higher Mathematics" Based on OBE. Innovative

Teaching, 9: 130-132.

Maestrales, S., Zhai, X., Touitou I., Baker, Q., Schneider,

B., Krajcik, J. 2021. Using Machine Learning to Score

Multi‑Dimensional Assessments of Chemistry and

Physics. Journal of Science Education and Technology,

30: 239–254.

Gao, S., Zhang, S., Meng, X., Ding, Y., Wang, J. 2023.

Application of Machine Learning in Science Education

Evaluation: Dimensions, Domains and Laws. Chinese

Journal of ICT in Education, 29(10): 83-92.

Cao, M., Ou, Y., Wu, D., Du, P. 2023. Research on Student

Learning Situation Early Warning Method Based on

Machine Learning. Modern Information Technology

7(19): 142-144,150.

Meng, Q., Ke, G., Wang, T., Wei, C., Ye, Q., Ma, Z., Liu,

T. 2016. A Communication-efficient Parallel

Algorithm for Decision Tree. 30th Conference on

Neural Information Processing Systems (NIPS 2016),

Barcelona, Spain.

Tibshirani, R. 1996. Regression Shrinkage and Selection

Via the Lasso. Journal Of the Royal Statistical Society:

Series B (Methodological), 58(1): 267-288.

https://scikit-learn.org.

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W.,

Ye, Q., Liu, T. 2017. LightGBM: A Highly Efficient

Gradient Boosting Decision Tree. 31st Conference on

Neural Information Processing Systems (NIPS 2017),

Long Beach, CA, USA.

Zhou, Z. 2016. Machine Learning. Tsinghua University

Press, Beijing. pp. 73-91.

ICESCE 2024 - The International Conference on Environmental Science and Civil Engineering

172