enhanced BERT with Disentangled Attention,
improves upon BERT by using a disentangled
attention approach along with an improved mask
decoding component. It represents each word with
two distinct vectors for content and position and uses
disentangled matrices to compute attention weights.
This model further utilizes two unsupervised tasks
during pre-training to enhance its semantic
understanding (Asri, 2025).
The relevance of DeBERTa v3 for Automated
Essay Scoring (AES) tasks has been empirically
validated. Studies on English-language datasets have
shown excellent performance (Wang et al., 2024).
Specifically for the Indonesian language, where
earlier models such as BERT and IndoBERT
typically achieved QWK scores in the range of 0.52–
0.55 (Setiawan et al., 2021), DeBERTa v3 has
demonstrated a significant advantage. A recent study
by Rahmat et al. (2024) reported that a fine-tuned
DeBERTa v3 model on Indonesian student essays
achieved an average QWK score of 0.558. This score
reflects a moderate level of agreement with human
raters (Landis & Koch, 1977) and outperforms
popular baseline models for Indonesian AES.
Therefore, this study proposes the use of an optimized
DeBERTa v3 model with QWK as the primary metric
to improve accuracy and fairness in automated essay
evaluation.
Against this backdrop, the present study
explores the application of the DeBERTa v3 model
for the automatic scoring of essays written in
Indonesian. It particularly focuses on optimizing the
model's performance using the Quadratic Weighted
Kappa (QWK) metric, which is well-suited for
assessing agreement on ordinal scales and serves as a
robust indicator of alignment between machine-
predicted and human-assigned scores.
Figure 1: Overview of how Large Language Models
(LLMs) process text input and generate meaningful output.
2 METHODOLOGIES
2.1 Overview of the Methodology
This study implements an AES system using a state-
of-the-art Transformer-based model, namely
DeBERTa v3 (Decoding-enhanced BERT with
disentangled attention), which is known for its
superior ability to understand textual context and
semantics. The proposed system consists of three
major stages: (1) data preprocessing, (2) training and
fine-tuning of the DeBERTa v3 model, and (3)
performance evaluation using the Quadratic
Weighted Kappa (QWK) metric. The research
process is conducted systematically to ensure
objectivity, reliability, and measurable outcomes.
2.2 Dataset Description
The dataset used in this study comprises 500 open-
ended essay responses written in Bahasa Indonesia by
undergraduate students of the Informatics Program at
Universitas Sulawesi Barat. These essays were
collected from actual coursework and assessments
and reflect authentic student writing in an academic
setting. Each essay was manually scored by
university instructors using an ordinal scale ranging
from 1 to 5. Although the specific rubric dimensions
were not standardized across all courses, the scores
generally reflected aspects such as content relevance,
argumentation, coherence, and grammar. These
instructor-assigned scores served as ground truth
labels for supervised training.
In total, the dataset contained 500 data points,
with an average essay length of approximately 20–
120 words. After cleaning and preprocessing, all
samples were retained for model training and
evaluation. A 5-fold cross-validation scheme was
employed, with roughly 100 essays per fold, to ensure
stability and generalizability of the results across data
partitions. Inter-rater agreement was not directly
computed between multiple human scorers; however,
model performance was benchmarked using the
Quadratic Weighted Kappa (QWK) metric to capture
alignment between ordinal predictions and human
scores.
This dataset represents a realistic and domain-
specific resource for advancing Automated Essay
Scoring (AES) research in Bahasa Indonesia.
2.3 Data Preprocessing
The first step in this study involves preprocessing of
Indonesian-language essay texts. This stage is crucial
because raw data often contains noise such as spelling