Large Language Models for Student Code Evaluation: Insights and

Accuracy

Alfonso Piscitelli

, Mattia De Rosa

, Vittorio Fuccella

and Gennaro Costagliola

Department of Informatics, University of Salerno, Fisciano (SA), Italy

Keywords:

Programming Education, Large Language Models, Automatic Code Evaluation.

Abstract:

The improved capabilities of Large Language Models (LLMs) enable their use in various ﬁelds, including

education. Teachers and students already use LLMs to support teaching and learning.

In this study, we measure the accuracy of LLMs gpt-3.5, gpt-4o, claude-sonnet-20241022, and llama3 in

correcting and evaluating students’ programming assignments. Seven assessments carried out by 50 students

were assessed using three different prompting strategies for each of the LLMs presented. Then we compared

the generated grades with the grades assigned by the teacher, who corrected them manually throughout the

year.

The results showed that models such as llama3 and gpt-4o obtained low percentages of generated evaluations,

while gpt-3.5 and claude-sonnet-20241022 obtained interesting results if they received at least one example

of evaluation.

1 INTRODUCTION

In recent years, Large Language Models (LLMs) have

demonstrated signiﬁcant versatility in addressing a

broad range of tasks, extending their utility well be-

yond generating natural language text. Indeed, LLMs

have been employed to address a variety of tasks, in-

cluding classiﬁcation (Alahmadi et al., 2024), pro-

gram analysis, and code generation across multiple

programming languages (Jiang et al., 2024; Dong

et al., 2024). As their capabilities increase, and as

these tools become easier to use, they can also be used

by educational staff to support teaching or analysing

assignments.

In this study, we analysed the capabilities of

LLMs in evaluating programming assignments (writ-

ten in the C language) carried out by 50 students,

to understand the extent to which these tools can be

used to support teachers in correcting assignments,

or at least as support tools in student self-assessment.

We automatically invoked the various LLMs used in

this study: gpt-3.5 (OpenAI, 2022), gpt-4o (OpenAI,

2024), claude-sonnet-20241022 (Anthropic, 2024),

https://orcid.org/0000-0002-3567-1504

https://orcid.org/0000-0002-6922-5529

https://orcid.org/0000-0003-3244-3650

https://orcid.org/0000-0003-3816-7765

llama3 (Meta, 2024). The prompts required for the

execution of the experiment are deﬁned in a JSON

ﬁle, structured in such a way as to be able to test these

three different strategies: zero-shot (i.e. without pro-

viding any examples to the model), one-shot (i.e. with

an example provided to the LLM), few-shot (i.e. pro-

viding a few examples, ﬁve, to the model). One ex-

ample consists of the pair (assignment, vote).

During the experiment, a dataset composed of 350

assignments carried out during a Programming and

Data Structures course was used. The assignments

covered a range of programming concepts, including

basic data types, control structures, and simple algo-

rithms.

The results showed that the LLMs did not per-

form well in the assessment tasks, with a signiﬁcant

gap compared to the human evaluation provided dur-

ing the lecture. The difference between the evalua-

tion generated by the LLMs and that of the correct-

ing teacher is approximately 1.5 points out of 10;

however, these models generally perform better with

the one-shot prompting strategy, in particular claude-

sonnet-20241022, which achieves an average differ-

ence of 1.17 (S.D. 1.25), and 60% of the marks with

a difference of 1 or less.

The remainder of the paper is structured as fol-

lows. Section 2 presents the related work, while Sec-

tion 3 describes the methodology, the research ques-

534

Piscitelli, A., De Rosa, M., Fuccella, V. and Costagliola, G.

Large Language Models for Student Code Evaluation: Insights and Accuracy.

DOI: 10.5220/0013287500003932

In Proceedings of the 17th International Conference on Computer Supported Education (CSEDU 2025) - Volume 2, pages 534-544

ISBN: 978-989-758-746-7; ISSN: 2184-5026

tions and the software developed to carry out this ex-

periment. Section 4 presents the results and answers

to the research questions, while Section 5 presents the

implications of the results and describes some ethical

concerns. Finally, Section 6 presents the conclusions

and next steps.

2 BACKGROUND

Large Language Models are models trained on a very

large amount of data and can understand and gener-

ate text in different contexts. Communication with

the LLM takes place via instructions (which may be

simple or complex) that are commonly referred to as

prompts. Querying LLMs with appropriate prompts

is crucial to the quality of the expected response, as is

the number of parameters in the model itself (Brown

et al., 2020). An appropriate prompt generally con-

sists of some contextual information and a set of ex-

pected questions and answers (examples) that form

the knowledge base of the LLM. The following deﬁni-

tions formalize key concepts related to the interaction

with large language models, including examples and

prompts, which are essential for effective communi-

cation and response generation.

Deﬁnition 1: Example. Let Q be the question to be

asked of the LLM and A the answer. Let us deﬁne the

pair (Q, A) as example E. In the case where the an-

swer is obtained from the language model, this will be

referred to as A

obtained

; conversely, in the case where

this answer is generated as training data, it will be re-

ferred to as A

expected

Deﬁnition 2: Prompt. A prompt P is deﬁned as a

pair (C, E∗), where C represents the context instruc-

tions (also called system) while E∗ represents a list of

examples (0 or more examples).

The size of the example set, denoted as |E ∗ |, sig-

niﬁcantly inﬂuences LLM performance (Brown et al.,

2020). Using the number of examples, we can deﬁne

three distinct prompting strategies:

• zero-shot, which involves sending prompts to the

LLM without providing any examples;

• one-shot, which involves sending prompts to the

LLM with one example;

• few-shot, which involves sending prompts to the

LLM with a few examples (from less than 10 to

even a hundred)

There are several strategies for designing the ex-

ample set. One approach is to deﬁne a list of ex-

pected results. Alternatively, counterexamples can

be included, as demonstrated in the chain-of-thought

method (Wei et al., 2024), which helps LLMs better

understand the task at hand. Another effective strat-

egy involves providing veriﬁcation messages for the

generated response, particularly useful when the re-

sponse must adhere to a speciﬁc syntax that can be

validated by a parser or validator (Tabari et al., 2025).

In addition to presenting examples as a list of

(Q, A) pairs, (Wang et al., 2023) explored the use

of meta-languages, such as context-free grammars

(CFGs), to deﬁne examples. This approach leverages

the recursive nature of CFGs, where a ﬁnite set of

rules can generate an inﬁnite number of valid exam-

ples by repeatedly applying these rules. For instance,

a simple grammar rule like:

S → aSb | ε

can produce strings such as ‘ab,’ ‘aabb,’ ‘aaabbb,’ and

so on, without any predeﬁned upper limit. This allows

for the representation of diverse and ﬂexible exam-

ple structures in a compact manner, enabling the sys-

tematic generation of complex examples that would

otherwise be impractical to enumerate exhaustively.

This type of prompting, called grammar-prompting,

reached interesting results in working with domain-

speciﬁc languages. LLMs have also proven to be par-

ticularly useful in the understanding and translation

of programming languages (Eniser et al., 2024).

LLMs have also found a place in education and

teaching. In (Piscitelli et al., 2024), an empirical

study analysed students’ interaction with a gpt-4o-

based bot when solving programming assignments,

ﬁnding that students generally tend to rely on the

LLM in requesting solutions (which is something to

be discouraged) but also that several students tend to

use it as a kind of online help (which is something to

be encouraged). Similar conclusions were reached in

(Lifﬁton et al., 2024), particularly on the over-reliance

of students (and junior programmers) on LLM. Over-

reliance on LLMs is also an issue analysed in (Prather

et al., 2023), highlighting how in certain cases the

code generated by LLMs makes it difﬁcult to analyse

and ﬁx errors at later stages. In any case, these types

of tools, with appropriate modiﬁcations, can be useful

in student learning, as highlighted in (Qi et al., 2023),

where they limited LLMs to providing only code ex-

amples and never the solution to the problem, through

prompting (Dong et al., 2024).

The use of LLMs for student assessment repre-

sents a further use of these models, which on the

one hand can support the teacher in his work and on

the other hand can provide the student with a self-

assessment tool available during his studies.

In (Henkel et al., 2024), the use of gpt-4o to eval-

uate students’ answers to quizzes was analysed, and

the results showed comparable performance to human

evaluations. In the study of (Chiang et al., 2024), on

Large Language Models for Student Code Evaluation: Insights and Accuracy

535

the other hand, gpt-4o was again used to automati-

cally assess the assignments of around 1,000 students,

and the results were mixed. While the evaluations

obtained were acceptable, the model did not always

comply with the instructions for evaluating the assign-

ments. Not only that, in some cases, students were

able to manipulate the LLM-based assistant to still

obtain a high score, but without doing the assignment.

Studies speciﬁcally looking at the use of LLMs (of

various LLMs in particular) for the assessment of pro-

gramming tasks have not been carried out. However,

some studies have investigated the use of LLMs to

carry out program analysis. In (Mahbub et al., 2024)

gpt-4o was used to detect inconsistencies of com-

ments in the code, while in (Chapman et al., 2024)

prompting techniques were used to perform static pro-

gram analysis and detect anomalies.

3 METHODOLOGY

In this section, we will describe the research method-

ology used for this study, the software architecture,

the data used for the experiment, and the metrics that

were collected during the execution of the software.

3.1 Research Questions

This project aims to analyse the ability of the most

common LLMs to analyse student assignments and

assess them, generating a grade and a comment, based

on an evaluation grid provided to the LLM. This

study, therefore, will attempt to answer the following

research questions:

• RQ1 - What is the accuracy of LLMs in gener-

ating grades and comments for programming as-

signments?

• RQ2 - Which LLM achieves the best performance

in terms of evaluation time for programming as-

signments?

• RQ3 - What is the difference between LLM-

generated grades and those assigned by educators

for programming assignments?

3.2 The ClueLLM4Eval Platform

To achieve the objectives presented in the RQ sec-

tion, we implemented a software, ClueLLM4Eval, us-

ing the Java language, that would allow the different

LLMs to be queried with the different prompt strate-

gies available in the system. The various classes for

querying the following LLMs were implemented by

adopting a Strategy Pattern. In addition to the strat-

egy pattern, a workﬂow was also implemented that

reads from a JSON ﬁle the sequence of instructions to

be sent to the LLM. The JSON ﬁle takes up the struc-

ture of the ‘conversation’ with the model, having the

following structure:

[{

"role": "system",

"content": "You are a teacher who has

to evaluate with a mark

from 0 to 10 the

following assignment."

}, {

"role": "user",

"content": "C code of the assignment"

},{

"role": "assistant",

"content": "10; The work done is correct

and all points have been

completed."

}]

This structure, appropriately conﬁgured according

to the prompting strategy adopted, is then loaded at

runtime and sent to the LLM together with the actual

task to be done by the system.

3.3 Prompting Strategies

During this experiment, three different prompting

strategies were used, described below:

• zero-shot: No example is given to the LLM, only

the system prompt.

• one-shot: an example of a well-done assignment

for the LLM (graded 10), and the system prompt.

• few-shot: two examples: one assignment well

done (graded 10) and another assignment incom-

plete and rated 0 at the LLM, and the system

prompt.

The initial system prompt included both the de-

scription of the task (in our case, evaluating student

assignments) and the evaluation grid that the sys-

tem had to consider when grading the various assign-

ments. Table 1 reports the text of the system prompt

used for the experiment.

The grid was designed to provide the evaluation

criteria for assigning a score to each assignment, with

scores ranging from a maximum of 10 to a minimum

of 0. Table 3 describes the grid used for the evalu-

ation: we deﬁned with grade ten an assignment that

was correctly done and well documented through in-

line and block comments, while the grade zero de-

ﬁned an assignment that was not done, or done incor-

rectly or incompletely.

CSEDU 2025 - 17th International Conference on Computer Supported Education

536

Table 1: System prompt used in this study.

You are a teacher who has to evaluate with a mark from 0 to 10 the following assignment.

Describe the evaluation of the assignments using this format: GRADE;COMMENT.

0 totally wrong assignment.

10 totally correct assignment.

Grade must respect the following grid, and ALL criteria for a grade must be met to assign

that grade: To evaluate, use the following grid:

Before assigning a grade, evaluate each required function separately, noting its presence and

correctness. Then, based on these individual evaluations, assign an overall grade that matches

ALL criteria for that grade level. Provide a brief justiﬁcation (less than 100 words) for the

assigned grade, referencing the speciﬁc criteria met or missed.

This is the assignment: {{ASSIGNMENT }}

Table 2: Aggregate results of LLMs in evaluation tasks, combining performance across all three prompting strategies.

LLM Valid Response Format Non-Compliant Grade Invalid Response

gpt-3.5 62.76% 28.57% 8.67%

gpt-4o 5.43% 29.71% 64,86%

claude-sonnet-20241022 47.36% 35.08% 17.56%

llama3 0% 0% 100%

Table 3: Evaluation grid used for the correction of the as-

signments and described in the prompt.

Grade Meaning

0 The assignment was not carried out

1 The assignment is incorrect and incomplete

2 The assignment is incorrect and incomplete,

with some sketched parts

3 The assignment is incorrect and incomplete,

with several sketched parts

4 The assignment is mainly incorrect and incom-

plete, with some parts almost completed

5 The assignment is partially incorrect and in-

complete, with many parts almost completed

6 The assignment is partially correct and well de-

signed, with several syntactic and semantic er-

rors

7 The assignment is correct and well designed,

with few semantic errors

8 The assignment is correct and well designed,

but with not a good structure and poorly com-

mented

9 The assignment is correct and well designed,

but with several parts not commented

10 The assignment is entirely correct, well de-

signed, and well commented

3.4 Datasets

For this experiment, we created a dataset consisting

of 7 assignments completed by 50 students each, re-

sulting in a total of 350 assignments. This dataset

consists of assignments carried out by real students

attending the ﬁrst year of the Bachelor of Science

in Computer Science in the Programming and Data

Structures course, who voluntarily granted the use of

the assignments for this experiment. The assignments

covered the implementation of abstract data types, ar-

ray sorting algorithms, and data structures such as

lists, queues, stacks, and trees. The evaluation and

errors of these assignments are also known, as they

were manually checked and assessed by the course

assistant lecturer. The manual evaluations are used

during the results analysis phase as a comparison to

the evaluations provided by the LLM.

3.5 Response Classiﬁcation

To assess the outputs produced by LLMs, a well-

deﬁned taxonomic framework is essential to differen-

tiate between responses based on their conformity to

the required format and level of completeness. This

classiﬁcation system partitions responses into valid,

partially valid, and unusable categories, facilitating a

structured evaluation of the models’ performance and

their suitability for educational applications.

1. Valid Responses

• Deﬁnition: A Valid Response is an LLM re-

sponse which adhere to the required format, a

numerical grade followed by a text comment.

• Example. 8;well-structured code with minor

issues in commenting

Large Language Models for Student Code Evaluation: Insights and Accuracy

537

2. Format Non-Compliant Responses

• Deﬁnition. A Format Non-Compliant Re-

sponse is an LLM response that contain both

a grade and a comment but fail to comply with

the exact required format. These responses are

still interpretable, but require further parsing or

manual correction to be converted into a com-

pliant format.

• Example. Grade: 8. Comments: Well-

structured code with minor issues.

3. Incomplete or Empty Responses

• Deﬁnition. An Incomplete Response is an

LLM response that do not contain both required

elements (grade and comment) or are entirely

off-task.

• Examples. The code looks good but could be

improved. (missing grade) or I cannot process

this assignment. (off-task response).

3.6 LLMs Used

In this experiment, the following LLMs were used:

• gpt-4o (OpenAI, 2024)

• gpt-3.5 (OpenAI, 2022)

• llama3 (Meta, 2024)

• claude-sonnet-20241022 (Anthropic, 2024)

For all these LLMs we used the same prompt and

we set the temperature : 0.5. Temperature is a pa-

rameter used to control the randomness and creativity

of responses generated by a language model, gener-

ally between 0 and 2. It determines how the model

assigns probabilities to different possible outputs. A

higher temperature (e.g., 1.0 or above) results in more

diverse and unpredictable outputs, as the model gives

less priority to the most probable next word and ex-

plores a wider range of possibilities. Conversely, a

lower temperature (e.g., close to 0) makes the model

more focused and deterministic, favoring highly prob-

able outputs and reducing variability.

3.7 Metrics

Each session of this experiment involved three

prompting strategies, four different LLMs, and the

dataset of assignments provided by students. The

metrics that will be measured are as follows:

• Correction Time: the time required by the LLM

to evaluate the assessment given in input;

• Assessment Difference: difference between the

grade generated by LLMs and the grade given by

the course assistant lecturer.

• Valid Responses: number of valid responses gen-

erated by LLMs;

• Format Non-Compliant Grades: number of

Format Non-Compliant responses;

• Invalid Responses: number of incomplete or

empty responses;

3.8 Procedure

Before conducting the experiment, the dataset was

prepared by organizing the assignments. Subse-

quently, we generated the JSON ﬁle containing the

prompts, which incorporated the evaluation grid and

the assignment track, as presented in Table 1.

Throughout the experiment, the platform executed

the prompt against all the LLMs described in Section

3.6. To respect the token limitations of each platform,

the platform implemented a ten-second timeout be-

tween each execution.

At the end of the experiment, the collected data

were analysed and categorised into three groups:

valid responses, format non-compliant grades, and in-

valid responses. After enumerating the results and in-

corporating them into Table 2, a data cleansing pro-

cess was performed to reﬁne the incorrectly formatted

grades, thereby generating valid responses from them.

4 RESULTS

This section describes the results obtained for the

three research questions guiding this study. The en-

tire experiment took approximately one day to run,

the majority of which was required by the llama3

model, as it was run locally on the researchers’ ma-

chines. The other models analysed in this study took

approximately one hour each to run for the analysis

of the dataset described in the previous section.

Based on the results obtained, it was determined

that the llama3 model did not generate any accurate

responses, as indicated in Table 2. As a result, the

discussion and comparison of the research questions

in the subsequent sections will exclude any references

to this model.

4.1 RQ1: Accuracy of LLMs

In this section of the study, we will attempt to an-

swer the question RQ1: What is the accuracy of

LLMs in generating grades and comments for pro-

gramming assignments?. Table 2 shows the results

obtained from individual LLMs, aggregated for all

CSEDU 2025 - 17th International Conference on Computer Supported Education

538

Table 4: Overall results of prompting strategies in evaluation tasks.

LLM Strategy Valid Response Format Non-

Compliant

Grade

Invalid Responses

gpt-3.5

zero-shot 0.57% 80.57% 18.86%

one-shot 93.14% 3.43% 3.43%

few-shot 94.57% 1.71% 3.72%

gpt-4o

zero-shot 1.14% 41.43% 57.43%

one-shot 2.57% 20.29% 77.14%

few-shot 12.57% 27.43% 60.00%

claude-sonnet-

20241022

zero-shot 0.00% 61.53% 38.47%

one-shot 78.95% 10.53% 10.52%

few-shot 64.86% 32.43% 2.71%

gpt-3.5 zero-shot

gpt-3.5 one-shot

gpt-3.5 few-shot

gpt-4o zero-shot

gpt-4o one-shot

gpt-4o few-shot

claude-sonnet zero-shot

claude-sonnet one-shot

claude-sonnet few-shot

100

Percentage

Valid Response Format Non-Compliant Grade Invalid Responses

Figure 1: Overall results of prompting strategies in evaluation tasks. The blue bars depict the valid responses generated, the

yellow bars represent the grades produced in an incorrect format, and the red bars illustrate the invalid responses generated.

prompting strategies (which will be discussed in Sec-

tion 4.3). The prompt required to assess the assign-

ment and provide the result of the grade in the format

GRADE;COMMENT where the ﬁrst value was the

grade (numerical) and the second was the comment.

The valid response column shows the percent-

age of answers that complied with the required for-

mat, while the Format Non-Compliant Grade column

Large Language Models for Student Code Evaluation: Insights and Accuracy

539

shows the percentage of answers that contained both

the grade and the commentary but did not strictly

comply with the required format (many answers were

of the type GRADE:8, or ##Grade:8).

Table 2 shows that gpt-4o was able to provide a

valid response (that corresponded to the format de-

scribed in the prompt’s request) in 5 percent of the

cases. When including responses that contained both

a grade and a comment but were non-compliant with

the required format, the total percentage of responses

increased to 35 percent. The llama3 model did not

provide correct answers: in none of the analysed

prompt strategies did the model provide a numerical

evaluation, always limiting itself to a descriptive eval-

uation of the assigned exercise.

The other two models analysed, however, gpt-3.5

and claude-sonnet-20241022, gave better results. gpt-

3.5 was able to provide answers that respected the

required format in 62.76% of the evaluations, while

28.57% of its evaluations included errors in format-

ting but still contained both a score and a comment.

Claude-sonnet-20241022, on the other hand, pro-

vided formally correct answers in 47.36% of cases,

while about 35% of the time it produced answers with

a grade preceded by additional textual commentary,

which were non-compliant with the expected format.

This analysis pertains to formal correctness, while the

substantive correctness of the scores and comments is

examined separately in Section X (or elsewhere, if ap-

plicable).

Table 4 instead shows the results obtained from

the various models, highlighting the prompting strat-

egy used. The ﬁrst result that emerges from these

results is that the prompting strategy zero-shot, with-

out examples, rarely succeeds in providing a correctly

formatted response, while still performing the as-

signed task and evaluating the work performed (80%

for gpt-3.5 and 61% for claude-sonnet-20241022).

Already providing a single evaluation example in-

creases the model’s ability to provide an answer in

the required format and to reduce the number of null

values, i.e. failed evaluations. The graph in Figure

1 makes it clear that as the number of examples in-

creases, the number of responses that comply with the

format increases, and errors generally decrease.

4.2 RQ2: Time Analysis

This section examines the second research question,

Which LLM achieves the best performance in terms of

evaluation time for programming assignments?. Ta-

ble 5 exhibit varying performance across the differ-

ent prompting strategies. Speciﬁcally, the assignment

evaluation on gpt-3.5 took approximately 30 min-

utes, gpt-4 required over an hour, and claude-sonnet-

20241022 required around two hours. Furthermore,

the average time required to evaluate each assignment

was 1.81 seconds for gpt-3.5, 3.73 seconds for gpt-4,

and 6.66 seconds for claude-sonnet-20241022. These

results suggest that the gpt-3.5 model provided the

fastest execution times among the three LLMs tested

in this study.

The time required for evaluating the assignment

does not appear to be signiﬁcantly impacted by the

prompting strategies employed, except for the gpt-3.5

model. In the case of the gpt-3.5 model, the execu-

tion time was reduced from 1.99 seconds for zero-

shot prompting to 1.71 seconds when utilizing one

and few-shot prompting strategies. This suggests that

the prompting strategies can affect the performance of

the gpt-3.5 model, leading to a noticeable decrease in

the execution time for the assignment evaluation.

The analysis of the prompting strategies indicates

that sending samples to the large language mod-

els may also be beneﬁcial for execution time, but a

more in-depth analysis is required. Speciﬁcally, for

the models gpt-3.5 and claude-sonnet-20241022, the

one-shot strategy demonstrated the fastest execution,

while no signiﬁcant differences were observed for the

gpt-4o model. The low variability exhibited in the

Standard Deviation column suggests a high level of

consistency across the executions. Further investiga-

tion is needed to fully understand the impact of the

prompting strategies on the execution time of these

language models, as the current ﬁndings suggest po-

tential performance advantages, but more comprehen-

sive testing is required to draw deﬁnitive conclusions

(see Section 5).

4.3 RQ3: Comparison with Human

Assessment

This section investigates the third research question,

What is the difference between LLM-generated grades

and those assigned by educators for programming as-

signments?. As reported in Table 6, the rating dis-

crepancies across the various large language models

and prompting strategies vary, with some models and

strategies showing substantial differences, while oth-

ers are more closely aligned with human assessments.

The smallest average difference of 1.17 (S.D. 1.25)

points was achieved by the claude-sonnet-20241022

model using the one-shot prompting approach, fol-

lowed by the zero-shot strategy with a 1.38 (S.D.

1.28) point difference out of 10, and the one-shot

strategy of gpt-4o with a 1.58 (S.D. 1.21) point dif-

ference. However, other strategies and models exhibit

larger disparities, in some cases approaching 2 points,

CSEDU 2025 - 17th International Conference on Computer Supported Education

540

Table 5: Time analysis on evaluation tasks performed by LLMs across the different prompting strategies.

LLM Strategy Total Time (s) Mean Time (s) S.D. (s)

gpt-3.5 1901,40 1,81 1,81

zero-shot 695,86 1,99 6,89

one-shot 606,89 1,73 4,91

few-shot 598,66 1,71 4,01

gpt-4o 3916,23 3,73 1,63

zero-shot 1302,45 3,72 4,62

one-shot 1303,60 3,72 3,75

few-shot 1310,18 3,74 6,04

claude-sonnet-20241022 6989,49 6,66 2,57

zero-shot 2305,21 6,59 1,26

one-shot 1777,85 5,08 3,02

few-shot 2922,67 8,35 2,04

which can be considered a signiﬁcant deviation for an

evaluation.

The claude-sonnet-20241022 model demon-

strated the best performance in terms of the number

of evaluations that matched those of human raters,

and overall, this model had the highest rate of

evaluations that were quite close to the human

grades. The one-shot strategy attained 31.58% of

evaluations that were identical to the human ones

for program assignments, and 28.95% of evaluations

that differed from the human evaluation by between

0.5 and 1 point, for a total of 60.53% of grades that

differed from the human evaluation by less than 1

point. In general, the claude-sonnet-20241022 model

achieved good results across each strategy, followed

by the gpt-3.5 model, which obtained the best results

using the zero-shot strategy, with 38.86% of grades

differing from the human evaluation by less than 1

point.

The LLMs generally exhibit moderate discrepan-

cies in their evaluations compared to human assess-

ments, although certain models and prompting strate-

gies demonstrate larger deviations. However, claude-

sonnet-20241022 demonstrates a stronger capacity to

emulate human-like evaluations, particularly in the

one-shot strategy. Conversely, gpt-3.5 and gpt-4o ex-

hibit more substantial deviations, implying that their

calibration to mimic human evaluations could be en-

hanced, especially in the zero-shot and few-shot ap-

proaches.

5 DISCUSSIONS

The results presented so far show that using Large

Language Models to generate automatic corrections

of programming assignments does not offer consis-

tent results compared to the assessment given by the

teacher, deviating rather markedly from the intended

assessment. This evidence encourages the pursuit

of more prompting strategies to enhance the quality

of assessments and align them as closely as possible

with the desired outcomes.

Although the activity of correcting 350 assign-

ments took quite some time, reaching around two

hours claude-sonnet-20241022 to correct the assign-

ment, it must be said that this time is far less than

the time a teacher would spend correcting 350 assign-

ments: time reduction has the advantage of provid-

ing students with feedback quickly, with the undeni-

able advantages this has on learning. However, as-

sessments must also be consistent and as close to the

grading grid as possible; otherwise, there is a risk of

creating barriers to student learning. The reliability

highlighted shows that LLMs alone are not usable in

a self-assessment context, as the assessments may de-

viate widely from the grid with which they would be

assessed by the teacher himself.

The models analysed in many cases did not com-

ply with the required format: the 2 table high-

lighted in the Format Non-Compliant Grade column

the grades obtained in a different format: this phe-

nomenon occurs in the majority with the zero-shot

strategy, which suggests that providing examples al-

lows LLMs to better consider the required response

Large Language Models for Student Code Evaluation: Insights and Accuracy

541

Table 6: Analysis of the rating differences between the ratings generated by the language models and those assigned by

humans, based on different prompting strategies. This table shows the mean difference (Average Diff) and standard deviation

(S.D.) between the ratings, the percentage of identical ratings ( Equal Grade) and the percentage of differences of less than 1

point (0 < Difference ≤ 1) calculated over the total number of assignments, including those for which the LLM was unable to

give an evaluation.

LLM Strategy Average Diff S.D. Equal Grade 0 < Difference ≤ 1

gpt-3.5 1.86 1,40 8,86% 24,19%

zero-shot 1.72 1,41 10,29% 28,57%

one-shot 1.86 1,22 7,71% 21,14%

few-shot 1.98 1,52 8,57% 22,86%

gpt-4o 1.82 1,48 2,57% 11,90%

zero-shot 1.72 1,31 2,86% 14,57%

one-shot 1.58 1,21 2,29% 8,00%

few-shot 2.06 1,74 2,57% 13,14%

claude-sonnet-20241022 1,41 1,28 18,42% 28,07%

zero-shot 1,38 1,28 12,82% 28,21%

one-shot 1,17 1,25 31,58% 28,95%

few-shot 1,67 1,28 10,81% 27,03%

format. In this direction, the addition of further exam-

ples including invalid formats, or even an agent veri-

fying the syntax of the response, possibly alerting the

model to errors to be corrected, can be analysed.

Moreover, the analysis of the comments generated

in natural language by the various LLMs indicated

that high variability was associated with difﬁculties

in comprehending the code under evaluation. When

the assessment diverged by more than 1.5 points from

the human judgment, the LLMs tended to penalize the

structure of the comments or excessively emphasize

secondary elements. Future research will involve a

more detailed examination of the models’ responses

during the evaluation process.

In general, the claude-sonnet-20241022 model

was seen to obtain the best results, despite being

slower than the gpt-3.5 gpt-4o and llama3 models.

llama3 in particular did not obtain analysable results

for this study: the same prompts used for the other

models, with llama3 did not generate acceptable an-

swers in any case, but only comments on the code.

Again, as mentioned above, the system could bene-

ﬁt from the use of a validator that requests consistent

responses from the LLM concerning the format and

task required.

The ﬁndings of this study suggest that prompt-

ing alone is inadequate for managing the complex-

ity of assignment evaluation and generating consis-

tent results. To address this limitation, future work

should explore new prompting strategies and develop

a model speciﬁcally tailored for this purpose. Ad-

ditionally, future research should consider incorpo-

rating other large language models beyond those in-

cluded in this study to further enhance the evaluation

and generation of consistent results.

Furthermore, the evaluation of an assignment is a

multifaceted task that involves more than simply as-

signing a grade. In this study, we asked the large lan-

guage models to provide the rationale behind the as-

signed grades. However, a preliminary analysis sug-

gests that the LLMs’ responses either merely reiterate

the comments from the evaluation rubric or offer an

extended analysis of the provided code, which may

not be particularly useful in an educational context.

Instead, students require constructive feedback that

enables them to comprehend their errors and enhance

their solutions, a consideration that will be further

explored in future work. (Hundhausen et al., 2013;

Huang et al., 2018)

5.1 Ethical Concerns

In this study, assignments carried out by students in

an academic year 2023 class were used: none of

these generated assessments were provided to stu-

dents. However, the dissemination of increasingly ac-

curate LLM models also requires ethical evaluations

of how such a tool should be used by a teacher or a

student, in the case of self-assessment. The teacher

establishes a relationship with the learner and in the

correction of assignments may detect progressions

or regressions that a model might not highlight, or

CSEDU 2025 - 17th International Conference on Computer Supported Education

542

worse, provide, unless he or she is adequately trained

for this. LLM alone could compromise this relation-

ship, impacting the effectiveness of didactic personal-

isation.

However, these tools have undeniable potential as

valuable allies in personalising teaching and support-

ing student self-assessment. Leveraging techniques

such as Retrieval-Augmented Generation (RAG),

which combines the generative power of language

models with the precision of retrieving relevant and

up-to-date information from external sources (such

as teacher-provided learning materials), they can pro-

vide students with quick, accurate and personalised

feedback on their level of preparation, improving

learning outcomes and adaptability. (Lewis et al.,

2020).

6 CONCLUSIONS

In this study, we presented an experiment on the use

of Large Language Models for correcting and evalu-

ating student assignments from the Programming and

Data Structures course. In the course of the experi-

ment, 7 assignments carried out by 50 students, total-

ing 350 submissions, were evaluated using four dif-

ferent LLMs: gpt-4o, gpt-3.5, llama3, claude-sonnet-

20241022. The experiment aimed to measure the

ability of the LLMs to generate answers in the re-

quired format (GRADE;COMMENT ) and to measure

the distance between the evaluation generated by the

LLMs and the evaluation already in the possession of

the researchers provided by the teacher who carried

out the manual corrections. The results showed that:

• RQ1: Evaluating LLM Accuracy on Task-

Based Assessments. For this type of task,

the LLMs showed difﬁculties in complying with

the format, particularly in the zero-shot strategy.

llama3 did not provide any evaluation but only

comments. In contrast, the other LLMs (in par-

ticular gpt-3.5 and claude-sonnet-20241022 pro-

vided valid evaluations in 94.57% and 78.95%,

for the few-shot and one-shot strategies, respec-

tively).

• RQ2: Timing Results of LLMs: Performance

Insights. For this type of task, the gpt-3.5 model

was generally faster (taking on average 1.81s per

evaluation, S.D. 1.81s), followed by gpt-4o (mean

3.73s, S.D. 1.63s) and claude-sonnet-20241022

(mean 6.66s., S.D. 2.57). No signiﬁcant differ-

ence in timing was observed between the various

prompting strategies.

• RQ3: The Gap Between LLMs and Human

Evaluation. For this type of assignment, the

average distance between the LLMs’ assessment

and that previously provided by the teacher is

around 1.7 points out of 10, with claude-sonnet-

20241022 having the lowest distance (1.41, S.D.

1.28) and the greatest number of assessments no

more than one point out of 10 apart with the one-

shot strategy (60.53%). Analysing the comments

on the evaluations, it emerged that a high-grade

difference is often related to the evaluators having

a different interpretation of the assessment grid or

to them perceiving errors that are not present in

the work being evaluated. This variability sug-

gests, however, that these instruments need further

improvement before they can be used for these

tasks.

Future work will attempt to analyse the limita-

tions already presented in the previous section. We

will ﬁrst try to analyse the reasons why the proposed

prompting strategies did not work with llama3. We

will also analyse the impact of other prompting strate-

gies, increasing the number of evaluation examples.

Moreover, Retrieval-Augmented Generation (RAG)

has been shown to be suitable in several ﬁelds and also

in education (Liu et al., 2024). We plan to analyse the

impact of RAG in the evaluation of assignments in

order to automatically obtain the best evaluation grid

for different types of assignments. In addition, other

scenarios will be analysed. Finally, we will integrate

additional LLMs, and assess the possibility of train-

ing an ad hoc model for this type of task.

ACKNOWLEDGEMENTS

This work was partially supported by grants

from the University of Salerno (grant number:

300392FRB22COSTA).

REFERENCES

Alahmadi, M. D., Alshangiti, M., and Alsubhi, J. (2024).

Scc-gpt: Source code classiﬁcation based on genera-

tive pre-trained transformers. Mathematics, 12(13).

Anthropic (2024). https://www.anthropic.com/news/claude-

3-5-sonnet - last access 28th october 2024.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan,

J., Dhariwal, P., Neelakantan, A., Shyam, P., Sas-

try, G., Askell, A., Agarwal, S., Herbert-Voss, A.,

Krueger, G., Henighan, T., Child, R., Ramesh, A.,

Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen,

M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark,

J., Berner, C., McCandlish, S., Radford, A., Sutskever,

Large Language Models for Student Code Evaluation: Insights and Accuracy

543

I., and Amodei, D. (2020). Language models are few-

shot learners. In Proceedings of the 34th Interna-

tional Conference on Neural Information Processing

Systems, NIPS ’20, Red Hook, NY, USA. Curran As-

sociates Inc.

Chapman, P. J., Rubio-Gonz

alez, C., and Thakur, A. V.

(2024). Interleaving static analysis and llm prompt-

ing. In Proceedings of the 13th ACM SIGPLAN In-

ternational Workshop on the State Of the Art in Pro-

gram Analysis, SOAP 2024, page 9–17, New York,

NY, USA. Association for Computing Machinery.

Chiang, C.-H., Chen, W.-C., Kuan, C.-Y., Yang, C., and

yi Lee, H. (2024). Large language model as an assign-

ment evaluator: Insights, feedback, and challenges in

a 1000+ student course.

Dong, G., Yuan, H., Lu, K., Li, C., Xue, M., Liu, D., Wang,

W., Yuan, Z., Zhou, C., and Zhou, J. (2024). How

abilities in large language models are affected by su-

pervised ﬁne-tuning data composition. In Ku, L.-W.,

Martins, A., and Srikumar, V., editors, Proceedings

of the 62nd Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 177–198, Bangkok, Thailand. Association for

Computational Linguistics.

Eniser, H. F., Zhang, H., David, C., Wang, M., Christakis,

M., Paulsen, B., Dodds, J., and Kroening, D. (2024).

Towards translating real-world code with llms: A

study of translating to rust.

Henkel, O., Hills, L., Boxer, A., Roberts, B., and Levo-

nian, Z. (2024). Can large language models make

the grade? an empirical study evaluating llms abil-

ity to mark short answer questions in k-12 education.

In Proceedings of the Eleventh ACM Conference on

Learning @ Scale, L@S ’24, page 300–304, New

York, NY, USA. Association for Computing Machin-

ery.

Huang, L., Zhao, H., Yang, K., Liu, Y., and Xiao, Z. (2018).

Learning outcomes-oriented feedback-response peda-

gogy in computer system course. In 2018 13th Inter-

national Conference on Computer Science & Educa-

tion (ICCSE), pages 1–4.

Hundhausen, C. D., Agrawal, A., and Agarwal, P. (2013).

Talking about code: Integrating pedagogical code re-

views into early computing courses. ACM Trans.

Comput. Educ., 13(3).

Jiang, X., Dong, Y., Wang, L., Fang, Z., Shang, Q., Li, G.,

Jin, Z., and Jiao, W. (2024). Self-planning code gener-

ation with large language models. ACM Trans. Softw.

Eng. Methodol., 33(7).

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin,

V., Goyal, N., K

uttler, H., Lewis, M., Yih,

W.-t., Rockt

aschel, T., Riedel, S., and Kiela,

D. (2020). Retrieval-augmented generation for

knowledge-intensive nlp tasks. In Proceedings of the

34th International Conference on Neural Information

Processing Systems, NIPS ’20, Red Hook, NY, USA.

Curran Associates Inc.

Lifﬁton, M., Sheese, B. E., Savelka, J., and Denny, P.

(2024). Codehelp: Using large language models

with guardrails for scalable support in programming

classes. In Proceedings of the 23rd Koli Calling In-

ternational Conference on Computing Education Re-

search, Koli Calling ’23, New York, NY, USA. Asso-

ciation for Computing Machinery.

Liu, C., Hoang, L., Stolman, A., and Wu, B. (2024). Hita: A

rag-based educational platform that centers educators

in the instructional loop. In Olney, A. M., Chounta,

I.-A., Liu, Z., Santos, O. C., and Bittencourt, I. I., ed-

itors, Artiﬁcial Intelligence in Education, pages 405–

412, Cham. Springer Nature Switzerland.

Mahbub, T., Dghaym, D., Shankarnarayanan, A., Syed, T.,

Shapsough, S., and Zualkernan, I. (2024). Can gpt-4

aid in detecting ambiguities, inconsistencies, and in-

completeness in requirements analysis? a comprehen-

sive case study. IEEE Access, pages 1–1.

Meta (2024). https://ai.meta.com/blog/meta-llama-3/ - last

access 28th october 2024.

OpenAI (2022). https://platform.openai.com/docs/models/

gpt-3-5-turbo - last access 28th october 2024.

OpenAI (2024). https://platform.openai.com/docs/models/

gpt-4o - last access 28th october 2024.

Piscitelli, A., Costagliola, G., De Rosa, M., and Fuccella,

V. (2024). Inﬂuence of large language models on pro-

gramming assignments – a user study. In ICETC 2024.

ACM.

Prather, J., Reeves, B. N., Denny, P., Becker, B. A.,

Leinonen, J., Luxton-Reilly, A., Powell, G., Finnie-

Ansley, J., and Santos, E. A. (2023). “it’s weird that

it knows what i want”: Usability and interactions with

copilot for novice programmers. 31(1). Place: New

York, NY, USA Publisher: Association for Comput-

ing Machinery.

Qi, J. Z.-P. L., Hartmann, B., and Norouzi, J. D. N. (2023).

Conversational programming with llm-powered inter-

active support in an introductory computer science

course.

Tabari, P., Piscitelli, A., Costagliola, G., and Rosa, M. D.

(2025). Assessing the potential of an llm-powered sys-

tem for enhancing fhir resource validation. In Studies

in Health Technology and Informatics.

Wang, L., Xu, W., Lan, Y., Hu, Z., Lan, Y., Lee, R. K.-W.,

and Lim, E.-P. (2023). Plan-and-solve prompting: Im-

proving zero-shot chain-of-thought reasoning by large

language models. In Rogers, A., Boyd-Graber, J., and

Okazaki, N., editors, Proceedings of the 61st Annual

Meeting of the Association for Computational Lin-

guistics (Volume 1: Long Papers), pages 2609–2634,

Toronto, Canada. Association for Computational Lin-

guistics.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B.,

Xia, F., Chi, E. H., Le, Q. V., and Zhou, D. (2024).

Chain-of-thought prompting elicits reasoning in large

language models. In Proceedings of the 36th Interna-

tional Conference on Neural Information Processing

Systems, NIPS ’22, Red Hook, NY, USA. Curran As-

sociates Inc.

CSEDU 2025 - 17th International Conference on Computer Supported Education

544