Analysis of the Effectiveness of Large Language Models in Assessing

Argumentative Writing and Generating Feedback

Daisy Cristine Albuquerque da Silva, Carlos Eduardo de Mello and Ana Cristina Bicharra Garcia

Federal University of Rio de Janeiro (UNIRIO), PPGI, Av. Pasteur, 458, Urca, RJ, Brazil

Keywords:

Feedback Generation, Automated Feeback, Large Language Model, Feedback Effectiveness, GPT, LLM.

Abstract:

This study examines the use of Large Language Models (LLMs) like GPT-4 in the evaluation of argumen-

tative writing, particularly opinion articles authored by military school students. It explores the potential

of LLMs to provide instant, personalized feedback across different writing stages and assesses their effec-

tiveness compared to human evaluators. The study utilizes a detailed rubric to guide the LLM evaluation,

focusing on competencies from topic choice to bibliographical references. Initial ﬁndings suggest that GPT-4

can consistently evaluate technical and structural aspects of writing, offering reliable feedback, especially in

the References category. However, its conservative classiﬁcation approach may underestimate article quality,

indicating a need for human oversight. The study also uncovers GPT-4’s challenges with nuanced and contex-

tual elements of opinion writing, evident from variability in precision and low recall in recognizing complete

works. These ﬁndings highlight the evolving role of LLMs as supplementary tools in education that require

integration with human judgment to enhance argumentative writing and critical thinking in academic settings.

1 INTRODUCTION

Argumentative writing stands out for its persuasive

expression on controversial topics, requiring authors

to logically support their claims with evidence and

counter potential objections (Blair, 2011), (Gage,

1987), (Ferretti and Lewis, 2018), (Kennedy, 1998).

Unlike narrative or descriptive writing, it demands a

strategic approach, balancing logical reasoning and

compelling narrative crafting. This writing style in-

volves anticipating and addressing opposing view-

points, which adds to its complexity.

Creating argumentative texts is a dual-level cog-

nitive challenge, combining the establishment of a

coherent argumentative structure with the manage-

ment of logical relations among discussion points

(Kleemola et al., 2022), (Ferretti and Graham, 2019).

The process is iterative, where content development

and structural coherence are dynamically interlinked

(Lovejoy, 2011). While tools like Sparks (Gero et al.,

2022) and CoAuthor (Lee et al., 2022) offer assis-

tance in speciﬁc writing stages, they often focus on

isolated aspects of the writing process. Current tech-

nologies, despite being helpful, generally address ei-

ther the planning, drafting, or revision phases, high-

lighting a need for more comprehensive solutions that

support writers in structuring and reﬁning their ideas

across different phases and levels of abstraction (Al-

masri et al., 2019), (Wambsganss et al., 2021).

To address these gaps, we will undertake an effec-

tive evaluation of argumentative writing by employ-

ing Large Language Models (LLMs) such as GPT,

guided by a detailed rubric. This rubric will pro-

vide clear descriptions of the required competencies

at each stage of the article, ranging from the introduc-

tion to the ﬁnal considerations.

This article investigates the integration of deep

learning language models (LLMs), such as GPT, in

enhancing argumentative writing. We use as a case

study the Mario Travasso Project, a research encour-

agement initiative, where opinion articles written by

military school students undergo a dual evaluation

process: one by instructors and another by an LLM.

In this process, authors receive constructive feedback

based on a set of meticulously deﬁned criteria, tai-

lored to each stage of the development of an opinion

article.

At the core of the investigation are the following

research questions:

• RQ1. What is the effectiveness of Large Lan-

guage Model (LLM) systems in evaluating argu-

mentative writings?

• RQ2. What is the level of readability of the feed-

Albuquerque da Silva, D., Eduardo de Mello, C. and Garcia, A.

Analysis of the Effectiveness of Large Language Models in Assessing Argumentative Writing and Generating Feedback.

DOI: 10.5220/0012466600003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 2, pages 573-582

ISBN: 978-989-758-680-4; ISSN: 2184-433X

573

back provided by Large Language Model (LLM)?

• RQ3. How does the feedback generated by Large

Language Model (LLM) evaluations contribute to

students’ learning in developing more effective ar-

gumentative writing skills?

To answer these research questions, the study was

structured into two distinct phases. Phase one in-

volved a quantitative analysis, where the effectiveness

of feedback from large language models (LLMs) was

juxtaposed against instructors’ evaluations in reﬁning

argumentative texts. This comparison was based on

the predicted scores for various assessed categories

within the texts. Additionally, a qualitative analysis

was conducted in this phase, where instructors scru-

tinized the LLM-generated feedback. The objective

was to explore the potential of incorporating this feed-

back into educational methodologies to enhance stu-

dents’ argumentative writing abilities.

The second phase of the study focused on the

practical application of the LLM-generated feedback.

Students received this detailed, personalized advice

to revise their opinion articles, aiming to elevate the

quality of their work based on the guidance pro-

vided. Post-revision, these articles underwent a sec-

ond round of evaluation. This phase was pivotal in

assessing the students’ learning progression and the

impact of the feedback they received. The overar-

ching goal was to determine how the amalgamation

of LLM’s insights and instructor oversight can ef-

fectively foster students’ argumentation and writing

skills.

Additionally, it seeks to create guidelines based

on human evaluators’ perspectives to improve the rel-

evance and acceptance of AI feedback. This approach

aims to link AI advancements with their educational

application, suggesting a method to effectively use

automated feedback for enhancing argumentation and

writing skills.

This paper is organized as follows: in the next

section, presents a review of the literature on Genera-

tive AI in education by discussing the theoretical per-

spective adopted. After that, in Section 3, details the

study’s methodology, the opinion articles analyzed,

and the competencies assessed, as well as the gen-

eration of automated feedback and the proposed eval-

uation method. Section 4 presents the ﬁndings in re-

sponse to the research questions, highlighting the ca-

pabilities and limitations of LLMs in the evaluative

context. Section 5 addresses the implications of the

results, the limitations identiﬁed, and suggests future

directions, concluding with a comprehensive view of

the impacts and the need for advancement in integrat-

ing LLMs into educational practice, and ﬁnally, in

Section 6, the conclusions are presented.

2 RELATED WORK

The signiﬁcance of feedback in educational settings

is well-established, prompting researchers to delve

into the mechanisms by which feedback inﬂuences

learning and to deﬁne what constitutes effective feed-

back. The model proposed by Hattie and Timperley

(Hattie and Timperley, 2007) is notable in this area;

it’s widely recognized and has been employed exten-

sively to scrutinize textual feedback in various stud-

ies (Cavalcanti et al., 2020), (Lin et al., 2023). This

model (Hattie and Timperley, 2007) categorizes feed-

back into four levels — task-focused (FT), process-

focused (FP), self-regulatory-focused (FR), and self-

focused (FS) — each targeting key learning pro-

cesses: setting learning objectives, evaluating current

performance, and charting the course to achieve de-

sired outcomes.

Within Automated Feedback Systems (AFSs),

there is a tendency to rely on pre-deﬁned rule sets cre-

ated by domain experts (Pardo et al., 2018). While

these systems offer some relief to educators by au-

tomating feedback, their utility is compromised when

faced with open-ended tasks, such as student reports,

which exhibit a wide range of possible responses and

thus necessitate a large, complex set of rules (Jia et al.,

2022). In light of recent advances in artiﬁcial intel-

ligence, there is growing interest in leveraging pre-

trained language models to provide textual feedback

for more intricate tasks. The use of ChatGPT by Ay-

din and Karaarslan (Aydın and Karaarslan, 2022) in

academic writing tasks illustrates this trend. Their

ﬁndings indicate that ChatGPT is capable of generat-

ing literature reviews with minimal match rates when

evaluated by plagiarism tools, showcasing the poten-

tial of such models. The encouraging performance of

ChatGPT in this context makes it an intriguing sub-

ject for further investigation into its ability to offer

elaborate feedback in argumentative writing.

Automated assessment is a long-standing en-

deavor in the ﬁeld of educational technology. The ini-

tial automated assessment tools were geared towards

solvable tasks, such as mathematics or programming

assignments, where evaluation typically relies on unit

tests or direct output comparisons (Hollingsworth,

1960); (Messer et al., 2023). These methods often

overlook less quantiﬁable yet crucial indicators of

learning and understanding, such as design quality,

code maintainability, or areas that may confuse stu-

dents. Modern tools like AutoGrader, which offers

real-time assessment for programming exercises, still

focus narrowly on output correctness and do not ade-

quately account for documentation or maintainability

(Liu et al., 2019).

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

574

Assessing students’ understanding from natural

language responses, however, poses different chal-

lenges and has undergone signiﬁcant evolution. Early

Automated Short Answer Grading (ASAG) models

utilized statistical approaches or domain-speciﬁc neu-

ral networks (Heilman and Madnani, 2013); (Ri-

ordan et al., 2017). In recent years, Large Lan-

guage Models (LLMs) have been shown to outper-

form domain-speciﬁc language models (Mann et al.,

2020); (Chung et al., 2022). LLMs facilitate the

grading of open-ended task responses without the

need for task-speciﬁc ﬁne-tuning (Cao, 2023); (Mizu-

moto and Eguchi, 2023); (Yoon, 2023). Yet, (Korte-

meyer, 2023) disclosed that while LLMs like GPT-

4 can be useful for preliminary grading of introduc-

tory physics tasks, they fall short for natural language

responses required in comprehensive exam assess-

ments. Furthermore, while LLMs such as GitHub

Copilot streamline the code generation and review

process, they may fall short on more nuanced pro-

gramming tasks and open-ended evaluations (Finnie-

Ansley et al., 2022). Therefore, in their current state,

LLMs should be viewed as helpful but fallible tools,

with ﬁnal assessments still under the purview of hu-

man instructors.

It is also essential to consider how students per-

ceive AI graders and the implementation of auto-

mated graders in educational settings (Zhu et al.,

2022). Many discussions revolve around the socio-

technical dynamics of automated assessment, includ-

ing the potential for machine bias introduction (Hsu

et al., 2021). Using NLP for short answer grading is

not a trivial task and has been established as an as-

sessment challenge in its own right.

3 METHOD

3.1 Opinion Article

The article examines a collection of 33 opinion pieces

written by students from Brazilian Army military

schools, focusing on National Defense topics. These

articles aim to persuade readers through strong argu-

mentative writing, drawing on viewpoints supported

by citations from esteemed authors, relevant experi-

ences, and thorough analysis. Authors are required

to present well-reasoned perspectives and demon-

strate deep understanding of the subject matter. They

must take clear stances, knowing their work will face

scrutiny and debate. This practice of argumentation

showcases the author’s analytical and critical skills

and aids in shaping a uniﬁed military discourse, fos-

tering a culture of informed and reﬂective national de-

fense.

3.2 Skills Assessed

In this study, the evaluation of articles was conducted

across multiple categories, each with speciﬁc criteria

and scoring systems:

Choice of Topic. This category examined the rel-

evance, timeliness, originality, and the author’s un-

derstanding of the research topic. Each aspect was

allocated a score of 0.25 points, totaling a maximum

of 1 point.

Presence and Relevance of Keywords. The as-

sessment in this category was based on the quantity of

keywords used, with the scoring ranging from 0 to 1

point.

Quality of the Introduction. Articles were eval-

uated on their introductory presentation, clarity of

objectives, research justiﬁcation and importance, as

well as the underlying hypotheses and methodologies

used. Scores in this category varied from 0 to 2 points.

Development. This comprehensive category fo-

cused on the structured and detailed presentation of

the topic, the theoretical basis, research methodolo-

gies, normative and typographical aspects, and the

overall structure of the article, including pre-textual,

textual, and post-textual elements. Writing qual-

ity, encompassing cohesion, coherence, spelling, and

grammar, as well as adherence to ABNT standards,

was also scrutinized. The score in this category

ranged from 0 to 3 points.

Final Considerations. Here, the conclusion was

evaluated based on how well it interpreted the argu-

ments or elements discussed in the text, addressed the

objectives and hypotheses, and summarized the re-

search results and key information or arguments. The

scoring in this category varied from 0 to 2 points.

References. The ﬁnal category assessed the ac-

curacy and completeness of the bibliographical refer-

ences, including the correct identiﬁcation and alpha-

betical ordering of cited works. The possible score

ranged from 0 to 1 point.

In summary, the evaluation criteria included rele-

vance and originality in topic choice, keyword inclu-

sion, introduction presentation, formulation of the re-

search problem, objectives, military perspective, hy-

pothesis, methodology, and theoretical foundation.

The development section focused on content orga-

nization, data collection and analysis, result discus-

sion, and adherence to presentation standards, includ-

ing writing quality aspects like cohesion and gram-

mar. The ﬁnal considerations assessed the author’s

ability to interpret arguments and summarize results.

References were checked for comprehensiveness, ac-

Analysis of the Effectiveness of Large Language Models in Assessing Argumentative Writing and Generating Feedback

575

curacy, and adherence to formatting rules. This thor-

ough process aimed to ensure the academic quality of

the opinion pieces.

3.3 Generative Pre-Trained

Transformer

The Generative Pre-trained Transformer (GPT) se-

ries, particularly GPT-4, stands out among Large Lan-

guage Models (LLMs), marking a signiﬁcant evolu-

tion from its ﬁrst iteration, GPT-1. OpenAI’s Chat-

GPT, a specialized version of GPT tailored for conver-

sational interactions, is based on the GPT-3 model and

operates with 175 billion parameters, facilitating dy-

namic, interactive conversations. ChatGPT’s popular-

ity stems from its diverse capabilities in content gen-

eration and a user-friendly question-answering inter-

face. Its public accessibility and practical applications

across various domains have led to its widespread

adoption. In education, ChatGPT has been explored

for enhancing teaching and learning, offering person-

alized learning experiences, aiding concept compre-

hension, generating and explaining code, and support-

ing educational assessments (Biswas, 2023), (Firat,

2023), (Lo, 2023), (Rospigliosi, 2023).

In the article evaluation process, we employed the

GPT-4 model. For each competency detailed in sec-

tion 3.2., we developed speciﬁc prompts, aligned with

the nuances and requirements of each of these skills.

These prompts were carefully designed to interact

with the GPT-4, allowing for a detailed and contex-

tualized analysis of the articles.

3.4 Evaluation Method

The proposed method for analyzing the effectiveness

of large language models (LLM) in assessing argu-

mentative writing and generating feedback consists of

two distinct phases.

In the ﬁrst phase, both quantitative and qualita-

tive analyses were conducted, where opinion articles

were evaluated by both an instructor and the LLM.

According to the quantitative analysis, the evalu-

ation resulted in scores for various categories, rang-

ing from the choice of topic, the number of keywords

cited, the construction of the introduction, the devel-

opment of the argument, the crafting of the conclud-

ing remarks, and ﬁnally, the arrangement of the ref-

erences. Depending on the score received, the arti-

cle was classiﬁed as complete, partially complete, or

incomplete. Moreover, at this stage, instructors ex-

amined the feedback generated by the LLM, which

included a score for each category and a guidance

text to help writers improve their writing. Figure 1

presents an example of this feedback provided by the

LLM model.

Figure 1: Feedback generated.

And according to the qualitative analysis, The

study employed readability as a key measure to as-

sess the text quality produced by the model, a com-

mon metric in evaluating written material. An instruc-

tor was enlisted to rate the feedback generated by the

Large Language Model (LLM) using a detailed ﬁve-

point scale. The scale is deﬁned as follows: 0 for

Incomprehensible; 1 for Not ﬂuent and incoherent; 2

for Somewhat ﬂuent but incoherent; 3 for Fluent but

somewhat incoherent; and 4 for Fluent and coherent.

The instructor applied this scale across various cate-

gories, including topic selection, keyword deﬁnition,

introduction evaluation, development, ﬁnal consider-

ations, and references, assigning scores to each. Con-

sequently, a readability score for each feedback piece

provided in the opinion article evaluations was deter-

mined based on this systematic approach (Jia et al.,

2022); (van der Lee et al., 2021).

The purpose of this phase was to explore how

this feedback could be integrated into educational

methodologies to improve students’ skills in argu-

mentative writing.

In the second phase, the students got involved in

the practical application of the feedback they had re-

ceived. They used the detailed and personalized feed-

back generated by the Broad Language Model (LLM)

to reﬁne their opinion pieces, with the aim of raising

the quality of their work in line with the guidance they

had received.

To evaluate the effectiveness of the feedback pro-

vided by GPT-4, we used the theoretical feedback

framework developed by Hattie and Timperley. This

framework facilitated the analysis of the feedback

components to ensure that they were constructive and

beneﬁcial according to the standards set out in the

model. In addition, students were tasked with evaluat-

ing feedback through the lens of the four-level model

proposed by Hattie and Timperley, such as task, pro-

cess, regulation and self. This allowed for a compre-

hensive understanding of the impact of such feedback

on the learning process (Hattie and Timperley, 2007).

The four-level model of feedback, as proposed by

Hattie and Timperley, provides a structured approach

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

576

to evaluating feedback across task relevance, process

clarity, regulatory impact, and personal afﬁrmation.

It scrutinizes how feedback aligns with the task ob-

jectives, aids in information processing and learn-

ing, bolsters the learner’s self-regulation, and sup-

ports their self-esteem and motivation. This compre-

hensive assessment ensures feedback is not only in-

formative and task-speciﬁc but also constructive and

empowering for the student (Hattie and Timperley,

2007).

The overall goal of the study was to determine

how the combination of LLM knowledge and teacher

supervision can effectively improve students’ argu-

mentation and writing skills.

4 RESULTS

The results achieved in this study were organized

based on the three research questions previously es-

tablished.

In response to the RQ1: What is the effectiveness

of Large Language Model (LLM) systems in evalu-

ating argumentative writings?, the chart in Figure 2

illustrates the distribution of scores assigned to opin-

ion articles by two different evaluators: the Instructor

and GPT-4.

Figure 2: The distribution of article evaluation score.

The evaluations carried out by both the instruc-

tor and GPT-4 demonstrate the subjectivity inherent

in the process of evaluating argumentative writing.

There is a signiﬁcant variation in the scores awarded

by the instructor, which is clearly illustrated in the

boxplot by longer whiskers and a larger box, indicat-

ing a greater dispersion of the data. This may reﬂect

individual differences in the interpretation of the eval-

uation criteria or in the perception of the quality of the

articles. On the other hand, the evaluations conducted

by GPT-4 show less variation, with shorter whiskers

and a narrower box in the boxplot, which suggests a

more uniform application of the evaluation criteria.

The higher median of the GPT-4 scores also implies

a tendency of the language model to assign generally

higher evaluations, perhaps due to a more objective

analysis or a different sensitivity to quality elements

in the writing.

The consistency observed in the GPT-4 evalua-

tions highlights the potential of large-scale language

models to provide uniform evaluations of argumenta-

tive writing. This tendency towards a more consis-

tent evaluation may stem from the model’s ability to

apply evaluation criteria in a systematic way, with-

out the variations that can arise from human subjec-

tive judgments. However, GPT-4’s tendency to assign

higher average scores suggests that it may be less crit-

ical or interpret the quality criteria differently com-

pared to the human instructor. Furthermore, the pres-

ence of outliers in the instructor’s evaluations under-

scores the importance of human judgment, particu-

larly in cases where contributions are atypical or ex-

hibit unique characteristics that may not be fully cap-

tured by an automated language model.

Also in response to the RQ1: What is the effec-

tiveness of Large Language Model (LLM) systems in

evaluating argumentative writings?, table in ﬁgure 3

shows the distribution of instructor and GPT-4 gener-

ated feedback.

Figure 3: The distribution of instructor and GPT-4 gener-

ated feedback.

The table presents a comparison between the as-

sessments of a human instructor and those conducted

by GPT-4, a large-scale language model, across six

distinct categories of writing evaluation criteria. Pre-

cision and recall metrics for the GPT-4’s evaluations

are also provided.

Regarding the category of topic choice, GPT-4

identiﬁed all 33 topics as complete, while the instruc-

tor classiﬁed only 16 as complete, with 12 being par-

tially complete and 5 incomplete. This suggests that

GPT-4 may have a broader or less stringent criterion

Analysis of the Effectiveness of Large Language Models in Assessing Argumentative Writing and Generating Feedback

577

for considering a topic as complete. The precision is

0.53, indicating that more than half of GPT-4’s com-

plete classiﬁcations were correct, and the recall is 1,

indicating that GPT-4 identiﬁed all the complete cases

that the instructor classiﬁed.

In analyzing the evaluation of the introduction’s

quality, a signiﬁcant discrepancy was observed, with

GPT-4 classifying more introductions as complete

(33) compared to the instructor (7). Both agree on

21 cases of partially complete introductions, but they

differ on incomplete introductions. A precision of 0.6

for complete and a recall of 0.57 for partially com-

plete indicate a moderate accuracy of GPT-4 aligned

with the instructor’s classiﬁcations.

On the other hand, in the development category,

GPT-4 and the instructor are almost aligned in the de-

velopment assessments, with GPT-4 marking none as

incomplete. Precision is perfect (1) for complete and

partially complete categories, while recall is 0.9 for

partially complete, which is very high.

The biggest discrepancy was in the ﬁnal consid-

erations category, where GPT-4 classiﬁed a large ma-

jority as complete (26) as opposed to the instructor

(6). Recall is perfect (1) for complete, but precision

is very low (0.23), and recall is practically nil (0.017)

for partially complete, which suggests that GPT-4 has

difﬁculty evaluating this criterion properly compared

to the instructor.

Overall, GPT-4 exhibits a tendency towards a con-

servative approach when conﬁrming items as Com-

plete, preferring a more lenient stance towards Par-

tially Complete classiﬁcations. This is further cor-

roborated by high recall scores across several cat-

egories, suggesting a strong likelihood of accuracy

when Complete is assigned. Contrarily, the ﬂuctu-

ating precision rates imply that GPT-4 may prema-

turely or inaccurately assign Complete or Partially

Complete statuses in certain scenarios.

In response to the RQ2: What is the level of read-

ability of the feedback provided by Large Language

Model (LLM)?, the chart in Figure 4 shows the human

instructor’s assessment of feedback texts generated by

GPT-4 across various evaluated competencies.

The line graph depicted in ﬁgure 4 provides a de-

tailed visualization of the evaluation scores for each

of the 33 articles across multiple categories. The

Choice of topic category consistently achieves the

top score, indicating that the Large Language Model

(LLM) consistently assesses the topics’ relevance and

appropriateness very positively.

In the categories of Keywords and Development,

the feedback from the LLM also receives high scores,

although there are minor ﬂuctuations. These varia-

tions suggest that there might be slight inconsisten-

Figure 4: The human instructor’s assessment of feedback

texts generated by GPT-4 across various evaluated compe-

tencies.

cies in the LLM’s feedback regarding these speciﬁc

aspects of the articles.

The Introduction and Final considerations cate-

gories exhibit more signiﬁcant score variations, with

noticeable dips in the ’Introduction’ scores at certain

intervals. This pattern suggests that the LLM may

struggle with consistently capturing an effective be-

ginning of an argument or providing a convincing

conclusion.

The References category displays the most signif-

icant score ﬂuctuations, highlighting potential incon-

sistencies in how the LLM evaluates the adequacy and

relevance of the references within the articles. This

could indicate that the LLM’s ability to provide feed-

back on references is less reliable and may require

further reﬁnement.

The high readability scores awarded by the in-

structor to the feedback from the Large Language

Model (LLM) indicate that language models like

GPT-4 hold promise for text evaluation tasks. Despite

this, the observed variability in the scores for the ’In-

troduction’ suggests that the LLM’s processing and

feedback generation capabilities for certain text ele-

ments could be enhanced.

While the LLM’s feedback generally received

high scores across most categories, this could either

demonstrate its competence in producing coherent

and pertinent content or suggest a possible inclina-

tion toward favorable evaluations. To ensure the feed-

back’s accuracy and prevent potential bias, a discern-

ing approach is warranted.

For a thorough and nuanced assessment of argu-

mentative writing, it may be beneﬁcial to reﬁne the

LLM’s algorithms or to employ a hybrid approach

that combines the model’s automated evaluation with

human review. This would help in maintaining the

integrity and applicability of the feedback provided

by the LLM, especially in areas like the introductory

quality of articles where personalized and context-

aware analysis is crucial.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

578

Table 1: The distribution of four levels in the feedback pro-

vided by GPT-4.

Levels Quantity Frequency

Task 33 100%

Process 21 64%

Regulation 0 0

Self 0 0

In response to the RQ3: How does the feedback

generated by Large Language Model (LLM) evalu-

ations contribute to students’ learning in developing

more effective argumentative writing skills?, the feed-

back generated was assessed using Hattie and Timper-

ley’s four-level model. In this model, the feedback is

evaluated in relation to each of the four levels: task,

processing, regulatory, and self (Hattie and Timper-

ley, 2007).

At the task level, the focus is on evaluating the rel-

evance of the feedback to the speciﬁc activity, check-

ing the accuracy and usefulness of the information

provided to improve the student’s performance in that

speciﬁc task. At the processing level, the assessment

concentrates on the feedback’s ability to enhance the

student’s understanding and processing of informa-

tion, considering the clarity, speciﬁcity, and relevance

of the feedback to the learning process.

At the regulatory and self levels, the assessment

of feedback takes a broader perspective. At the reg-

ulatory level, the feedback is analyzed for its effec-

tiveness in improving the student’s ability to regulate

their own learning, including aspects such as self-

assessment, motivation, and conﬁdence. At the self

level, the focus is on the impact of the feedback on

the student’s self-esteem and self-image, assessing

whether the feedback provides appropriate recogni-

tion and encouragement. This four-level approach to

feedback assessment aims to carefully analyze its ef-

fectiveness across various dimensions of student per-

formance and learning.

Task-level feedback was consistently present, as

indicated by the 100% ﬁgure in the table. This means

that the Large Language Model (LLM), in this case

GPT-4, provided relevant feedback for the task in all

evaluations. This is in line with the model’s objec-

tive to assess the accuracy and utility of information

to enhance the student’s performance on the speciﬁc

task.

Process-level feedback was generated 64% of the

time, indicating that more than half of the reports re-

ceived comments aimed at improving the student’s

comprehension and information processing. This

suggests a signiﬁcant capability of the model to pro-

vide clear and speciﬁc guidance, which is crucial for

the learning process.

Notably, there was no feedback generated at the

regulatory and self levels. This may point to a lim-

itation of the model in addressing aspects of self-

regulation, motivation, conﬁdence, self-esteem, and

self-image. The absence of feedback at these levels

might indicate that the LLM is not equipped to rec-

ognize or generate comments that directly inﬂuence

these more subjective and personal areas of student

learning.

The table 1 suggests that the LLM is effective

in providing feedback related to tasks and processes,

which is positive for helping students improve their

argumentative writing skills. However, for a more

holistic development of student skills, it would be

beneﬁcial if the model could also generate feedback

at the regulatory and self levels, which requires future

attention to enhance the model or to combine its use

with the supervision of a human instructor.

5 DISCUSSION

5.1 Implications

The implications of this study are signiﬁcant for the

application of Large Language Models (LLM) like

GPT-4 in educational settings, particularly in the con-

text of evaluating argumentative writing.

Firstly, the study demonstrates the effectiveness of

LLMs in providing consistent feedback on task-level

elements of writing. This indicates that LLMs can

potentially be used to augment the evaluation process,

offering students immediate, objective, and consistent

feedback on certain technical aspects of their writing.

Such support could be instrumental in helping stu-

dents to understand and meet the basic requirements

of argumentative writing tasks.

The variance in scoring between GPT-4 and hu-

man instructors highlights another important implica-

tion: while LLMs might provide a uniform evalua-

tion, they may do so with a bias towards leniency or

may not align with human instructors’ more critical

and nuanced assessment of quality. This discrepancy

underscores the need for human oversight to ensure

the quality and criticality of evaluations.

Moreover, the study suggests that LLMs could be

reﬁned to better address the full spectrum of feedback

categories, perhaps by incorporating more advanced

natural language understanding and generation capa-

bilities. Until such advancements are achieved, a hy-

brid evaluation approach that combines LLM feed-

back with human review might offer a more balanced

and comprehensive assessment strategy.

Analysis of the Effectiveness of Large Language Models in Assessing Argumentative Writing and Generating Feedback

579

This study suggests that while LLMs have the po-

tential to signiﬁcantly enhance the educational assess-

ment process, particularly for argumentative writing,

they currently require human collaboration to provide

a holistic and nuanced evaluation. The integration of

LLMs into educational practices should therefore be

approached with caution, ensuring that the technol-

ogy is used to complement and not replace the invalu-

able insights provided by human instructors.

5.2 Limitations

This study reveals several limitations in the use of

Large Language Models (LLMs) like GPT-4 for eval-

uating argumentative writings and generating feed-

back.

The ﬁrst limitation identiﬁed is GPT-4’s inability

to provide feedback on the regulatory and self lev-

els, which are essential for students’ comprehensive

development. The absence of feedback in these areas

suggests that GPT-4 may not be fully equipped to han-

dle the more subjective aspects of learning, such as

self-regulation, motivation, and self-conﬁdence. This

indicates that LLMs still need to evolve to provide a

holistic assessment that goes beyond technical con-

tent and addresses the personal and emotional dimen-

sions of learning.

Another limitation observed is the variation in the

precision and recall of the feedback from GPT-4, sug-

gesting that the model may be less critical and have

a tendency to prematurely or inaccurately classify

works as complete. This could lead to overly positive

evaluations that do not adequately challenge students

to improve their writing skills.

Moreover, the discrepancy between the evalua-

tions of GPT-4 and human instructors highlights the

variability and subjectivity in the assessment of ar-

gumentative writing. This underscores the need for

human review to ensure critical and detailed evalua-

tions, particularly in atypical cases or those that ex-

hibit unique characteristics not fully captured by the

automated model.

The research also points to the need to reﬁne the

algorithms of LLMs to improve feedback generation

on speciﬁc text elements, such as the introduction

and conclusion, where there was greater variability in

evaluation scores.

While GPT-4 and other LLMs show consider-

able potential to assist in educational assessment, this

study highlights several areas that require attention

and future development. Human collaboration is still

essential to ensure the quality and relevance of the

feedback provided to students, especially in learn-

ing aspects that go beyond the current capabilities of

LLMs.

6 CONCLUSION

The exploration of Large Language Models (LLMs)

like GPT-4 in the assessment of argumentative writing

within educational settings has yielded both promis-

ing potentials and notable limitations. The study con-

ﬁrms that LLMs are effective in providing systematic

and consistent feedback on task-level elements, which

is crucial for the development of students’ argumen-

tative writing skills. The ability of these models to of-

fer immediate and objective feedback can greatly en-

hance the learning experience by providing students

with clear guidance on technical aspects of their writ-

ing tasks.

However, the study also highlights critical lim-

itations that need to be addressed. The LLMs, in

their current state, lack the capability to engage with

the more nuanced areas of learning, particularly self-

regulation and personal development, which are vi-

tal for a student’s holistic growth. The discrepancy

in evaluation between human instructors and GPT-4

underscores the subjectivity and complexity of argu-

mentative writing assessment, indicating the necessity

for human insight to achieve a critical and nuanced

evaluation.

Furthermore, variations in the precision and recall

of feedback from LLMs point to a possible leniency

in their evaluations, which could potentially lead to

an inadequate challenge for students to improve their

writing. The study also calls attention to the need for

improving LLMs’ algorithms, particularly in gener-

ating feedback for speciﬁc text elements that require

more sophisticated analysis, such as introductions and

conclusions.

In conclusion, while LLMs present a signiﬁcant

step forward in supporting educational assessments,

the study suggests that they should not replace but

rather complement human instruction. Human exper-

tise remains indispensable in providing quality feed-

back, especially for aspects that transcend the current

capabilities of LLMs. Future developments in LLM

technology should aim for a more holistic assessment

tool that can adequately address all levels of feed-

back and be integrated seamlessly into educational

methodologies. As we advance, a collaborative ap-

proach, combining the strengths of LLMs with human

oversight, appears to be the most effective strategy for

enriching student learning and enhancing argumenta-

tive writing skills.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

580

REFERENCES

Almasri, A., Ahmed, A., Almasri, N., Abu Sultan, Y. S.,

Mahmoud, A. Y., Zaqout, I. S., Akkila, A. N., and

Abu-Naser, S. S. (2019). Intelligent tutoring systems

survey for the period 2000-2018.

Aydın,

O. and Karaarslan, E. (2022). Openai chatgpt gen-

erated literature review: Digital twin in healthcare.

Available at SSRN 4308687.

Biswas, S. (2023). Role of chatgpt in computer pro-

gramming.: Chatgpt in computer programming.

Mesopotamian Journal of Computer Science, 2023:8–

16.

Blair, J. A. (2011). Groundwork in the theory of argumenta-

tion: Selected papers of J. Anthony Blair, volume 21.

Springer Science & Business Media.

Cao, C. (2023). Leveraging large language model and

story-based gamiﬁcation in intelligent tutoring sys-

tem to scaffold introductory programming courses:

A design-based research study. arXiv preprint

arXiv:2302.12834.

Cavalcanti, A. P., Diego, A., Mello, R. F., Mangaroska, K.,

Nascimento, A., Freitas, F., and Ga

sevi

c, D. (2020).

How good is my feedback? a content analysis of writ-

ten feedback. In Proceedings of the tenth international

conference on learning analytics & knowledge, pages

428–437.

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fe-

dus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S.,

et al. (2022). Scaling instruction-ﬁnetuned language

models. arXiv preprint arXiv:2210.11416.

Ferretti, R. P. and Graham, S. (2019). Argumentative writ-

ing: Theory, assessment, and instruction. Reading and

Writing, 32:1345–1357.

Ferretti, R. P. and Lewis, W. E. (2018). Argumentative writ-

ing. Best practices in writing instruction, 135.

Finnie-Ansley, J., Denny, P., Becker, B. A., Luxton-Reilly,

A., and Prather, J. (2022). The robots are com-

ing: Exploring the implications of openai codex on

introductory programming. In Proceedings of the

24th Australasian Computing Education Conference,

pages 10–19.

Firat, M. (2023). What chatgpt means for universities: Per-

ceptions of scholars and students. Journal of Applied

Learning and Teaching, 6(1).

Gage, J. T. (1987). The shape of reason: Argumentative

writing in college. (No Title).

Gero, K. I., Liu, V., and Chilton, L. (2022). Sparks: Inspira-

tion for science writing using language models. In De-

signing interactive systems conference, pages 1002–

1019.

Hattie, J. and Timperley, H. (2007). The power of feedback.

Review of educational research, 77(1):81–112.

Heilman, M. and Madnani, N. (2013). Ets: Domain adapta-

tion and stacking for short answer scoring. In Second

Joint Conference on Lexical and Computational Se-

mantics (* SEM), Volume 2: Proceedings of the Sev-

enth International Workshop on Semantic Evaluation

(SemEval 2013), pages 275–279.

Hollingsworth, J. (1960). Automatic graders for pro-

gramming classes. Communications of the ACM,

3(10):528–529.

Hsu, S., Li, T. W., Zhang, Z., Fowler, M., Zilles, C., and

Karahalios, K. (2021). Attitudes surrounding an im-

perfect ai autograder. In Proceedings of the 2021 CHI

conference on human factors in computing systems,

pages 1–15.

Jia, Q., Young, M., Xiao, Y., Cui, J., Liu, C., Rashid, P., and

Gehringer, E. (2022). Insta-reviewer: A data-driven

approach for generating instant feedback on students’

project reports. International Educational Data Min-

ing Society.

Kennedy, M. (1998). Theorizing composition: A criti-

cal sourcebook of theory and scholarship in contem-

porary composition studies. Bloomsbury Publishing

USA.

Kleemola, K., Hyytinen, H., and Toom, A. (2022). The

challenge of position-taking in novice higher educa-

tion students’ argumentative writing. In Frontiers in

education, volume 7, page 885987. Frontiers.

Kortemeyer, G. (2023). Can an ai-tool grade assignments

in an introductory physics course? arXiv preprint

arXiv:2304.11221.

Lee, M., Liang, P., and Yang, Q. (2022). Coauthor: De-

signing a human-ai collaborative writing dataset for

exploring language model capabilities. In Proceed-

ings of the 2022 CHI conference on human factors in

computing systems, pages 1–19.

Lin, J., Dai, W., Lim, L.-A., Tsai, Y.-S., Mello, R. F., Khos-

ravi, H., Gasevic, D., and Chen, G. (2023). Learner-

centred analytics of feedback content in higher educa-

tion. In LAK23: 13th International Learning Analyt-

ics and Knowledge Conference, pages 100–110.

Liu, X., Wang, S., Wang, P., and Wu, D. (2019). Automatic

grading of programming assignments: an approach

based on formal semantics. In 2019 IEEE/ACM 41st

International Conference on Software Engineering:

Software Engineering Education and Training (ICSE-

SEET), pages 126–137. IEEE.

Lo, C. K. (2023). What is the impact of chatgpt on edu-

cation? a rapid review of the literature. Education

Sciences, 13(4):410.

Lovejoy, A. O. (2011). The great chain of being: A study

of the history of an idea. new brunswick.

Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal,

P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A.,

Agarwal, S., et al. (2020). Language models are few-

shot learners. arXiv preprint arXiv:2005.14165.

Messer, M., Brown, N. C., K

olling, M., and Shi, M. (2023).

Automated grading and feedback tools for program-

ming education: A systematic review. arXiv preprint

arXiv:2306.11722.

Mizumoto, A. and Eguchi, M. (2023). Exploring the po-

tential of using an ai language model for automated

essay scoring. Research Methods in Applied Linguis-

tics, 2(2):100050.

Pardo, A., Bartimote, K., Buckingham Shum, S., Daw-

son, S., Gao, J., Ga

sevi

c, D., Leichtweis, S., Liu,

D., Mart

ınez-Maldonado, R., Mirriahi, N., et al.

Analysis of the Effectiveness of Large Language Models in Assessing Argumentative Writing and Generating Feedback

581

(2018). Ontask: Delivering data-informed, personal-

ized learning support actions.

Riordan, B., Horbach, A., Cahill, A., Zesch, T., and Lee,

C. (2017). Investigating neural architectures for short

answer scoring. In Proceedings of the 12th workshop

on innovative use of NLP for building educational ap-

plications, pages 159–168.

Rospigliosi, P. a. (2023). Artiﬁcial intelligence in teaching

and learning: what questions should we ask of chat-

gpt?

van der Lee, C., Gatt, A., van Miltenburg, E., and Krahmer,

E. (2021). Human evaluation of automatically gener-

ated text: Current trends and best practice guidelines.

Computer Speech & Language, 67:101151.

Wambsganss, T., Kueng, T., Soellner, M., and Leimeister,

J. M. (2021). Arguetutor: An adaptive dialog-based

learning system for argumentation skills. In Proceed-

ings of the 2021 CHI conference on human factors in

computing systems, pages 1–13.

Yoon, S.-Y. (2023). Short answer grading using one-shot

prompting and text similarity scoring model. arXiv

preprint arXiv:2305.18638.

Zhu, X., Wu, H., and Zhang, L. (2022). Automatic

short-answer grading via bert-based deep neural net-

works. IEEE Transactions on Learning Technologies,

15(3):364–375.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

582