Assessing Grade Levels of Texts via Local Search over Fine-Tuned LLMs

Changfeng Yu and Jie Wang

Richer Miner School of Computer and Information Sciences, University of Massachusetts, Lowell, MA, U.S.A.

Keywords:

Automatic Grade Assessment, Linguistic Features, Large Language Models, Local Search.

Abstract:

The leading method for determining the grade level of a written work involves training an SVC model on

hundreds of linguistic features (LFs) and a predicted grade generated by a ﬁne-tuned large language model (FT-

LLM). When applied to a diverse dataset of materials for grades 3 through 12 spanning 33 genres, however,

this approach yields a poor accuracy of less than 51%. To address this issue, we devise a novel local-search

algorithm called LS-LLM independent of LFs. LS-LLM employs different FT-LLMs to identify a genre,

predict a genre-aware grade, and compare readability of the text to a randomly selected set of annotated works

from the same genre and grade level. We demonstrate that LS-LLM signiﬁcantly improves accuracy, exceeding

65%, and achieves over 92% accuracy within a one-grade error margin, making it viable for certain practical

applications. To further validate its robustness, we show that LS-LLM also enhances the performance of the

leading method on the WeeBit dataset used in prior research.

1 INTRODUCTION

The leading method for automatic grade assessment

(Lee et al., 2021) trains a multi-label SVC model on

all 255 known LFs and grade predictions from a ﬁne-

tuned BERT model, which produces the best results

to date on the datasets of WeeBit (Vajjala and Meur-

ers, 2012) and Newsela (Xia et al., 2016). WeeBit

consists of texts categorized into ﬁve age groups and

spans a limited range of genres, while Newsela con-

tains only news articles. These datasets fall short of

our requirements for evaluating grade levels across di-

verse genres of written materials.

To address this need, we collected all freely avail-

able written works from the CommonLit Digital Li-

brary (CommonLit.org) along with their genres and

grade levels. This results in a dataset of 1,654 written

works spanning 33 genres for U.S. students in grades

3 through 12. We refer to this dataset as CLDL1654,

or simply CLDL.

Applying the leading method using the code pro-

vided by Lee et al., we train a multi-label SVC model

with all 255 LFs on CLDL and grade levels pre-

dicted by FT-M, with M being, respectively, BERT,

RoBERTa, BART, and GPT-4o. These models all ex-

hibit low accuracy below 51%. We further show that

using only about 10% of the LFs, varying for different

LLMs, the trained SVC model can achieve accuracy

levels comparable to those obtained using all 255 LFs.

This calls for a new approach independent of

LFs. Initially, we attempted to ﬁne-tune a GPT-

4o classiﬁer and use few-shot prompting with exam-

ples of texts at each grade level and genre. How-

ever, experimental results show that the accuracy of

these two approaches is below the SVC-based mod-

els, which is likely due to the complexity introduced

by genre variation–texts from different genres at the

same grade level can vary signiﬁcantly in style, struc-

ture, and vocabulary. Furthermore, a single few-shot

prompt cannot capture all representative examples,

and even if it could, GPT-4o may be inﬂuenced by

conﬂicting signals across genres.

This suggests the necessity of a new way to lever-

age the vast knowledge depository and strong infer-

ence capability of an LLM. To this end, we devise a

local-search method called LS-LLM that employs a

number of FT-LLMs, each tailored to a speciﬁc task.

LS-LLM falls in the framework of AI-oracle ma-

chines (Wang, 2025), which decomposes the grade as-

sessment into sub-tasks of genre identiﬁcation, grade

assessing for texts of a speciﬁc genre, and readabil-

ity comparison for texts in the same genre. We ad-

dress each sub-task using an FT-LLM and apply a

local-search algorithm to determine the appropriate

grade level for a given text through an iterative pro-

cess, guided by the outputs of these sub-tasks.

We show that LS-LLM consistently outperforms

the leading method on CLDL and WeeBit with GPT-

224

Yu, C. and Wang, J.

Assessing Grade Levels of Texts via Local Search over Fine-Tuned LLMs.

DOI: 10.5220/0013674400004000

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2025) - Volume 1: KDIR, pages 224-231

4o and freely available BERT and RoBERTa as the

underlying LLMs.

This paper is organized as follows: Section 2 pro-

vides a brief overview of prior works. Section 3 eval-

uates the prior leading method. Sections 4 and 5 de-

scribe LS-LLM in detail and report evaluation results.

Section 6 concludes the paper.

2 RELATED WORK

Early systems for automatic readability assessment

include Dale-Chall (Jeanne Sternlicht Chall, 1995)

and Fog (Gunning, 1969), which use linear regres-

sions to estimate readability based on lexical fea-

tures of word length, sentence length, syllable count,

and word frequencies. These features, however, fall

short in addressing semantics, discourse structure,

and other nuanced elements of language. Feng et al.

(Feng et al., 2009) analyzed a broader set of cogni-

tively motivated features, such as the number of enti-

ties in a sentence. Tonelli et al. (Tonelli et al., 2012)

reported a set of syntactic features related to part of

speech, phrasal structure, and dependency structure

of the text. These more complex features have been

shown to correlate better with part-of-speech usage

and complex nominal construction.

More sophisticated systems were later developed

using machine learning techniques. For example,

Schwarm (Schwarm and Ostendorf, 2005) and Osten-

dorf employed linguistic features (LFs) such as syn-

tactic complexity, semantic difﬁculty, and discourse

coherence to train an SVM model for predicting text

readability. The performance of these methods de-

pends heavily on how well the LFs capture the infor-

mation related to text readability (Lu, 2010).

Lee et al. (Lee et al., 2021) presented the lead-

ing method that trains an SVC model on 255 LFs

combined with a grade level of a written work pre-

dicted by an FT-PLM. SVC was chosen as the non-

neural classiﬁer as it performs well on classiﬁcation

with small training datasets. They evaluated their

method using WeeBit (Vajjala and Meurers, 2012)

and Newsela (Xia et al., 2016) as training data. Like-

wise, Deutsch et al. (Deutsch et al., 2020) showed that

incorporating only 86 LFs into LLMs can improve the

accuracy, especially with small training datasets. Re-

cent advances in LLMs have led to interest in reliably

assessing and manipulating the readability of the text,

including measuring and modifying the readability of

text (Trott and Rivi

ere, 2024; Engelmann et al., 2024).

3 GRADE ASSESSING WITH LFS

LFs can be computed using the Python library at

https://github.com/brucewlee/lingfeat. for any input

text. We use the code provided by Lee et al. (Lee

et al., 2021) to train an SVC model using all 255 LFs,

employing various FT-LLMs to predict the grade level

of a written work. In particular, we divide CLDL

into a standard 80-20 split for training and testing,

and leverage the Scikit-Learn library. All subsequent

model training, ﬁne-tuning, and evaluation will be

performed using this same 80-20 split.

We ﬁne-tune BERT, RoBERTa, BART, and GPT-

4o separately so that each can assign a grade to a given

written work. To ﬁne-tune BERT, RoBERTa, and

BART, we apply the 5-fold cross validation method

using Hugging Face’s transformers library with 10

epochs and 1 batch size. We use fastai’s learn.lr ﬁnd()

to ﬁnd the optimal learning rate during ﬁne-tuning. To

ﬁne-tune GPT-4o we use default settings of GTP-4o

and the following prompt template (Note that in all

prompts we specify that the user is an experienced as-

sessor of the language and literature curricular for the

public K-12 schools in the US):

User: Your task is to determine the grade level

of the following text. {text}

Assistant: {grade level}

We name the corresponding SVC classiﬁers as

SVC-255/M, where M represents, respectively, FT-

BERT, FT-RoBERTa, FT-BART, and FT-GPT-4o. We

generalize this notation to SVC-k/M to represent a

model trained using k LFs with an FT-LLM M. Fig-

ure 1 depicts the ﬁne-tuning and training processes

and the application of the models.

In addition to exact matches, where the predicted

grade aligns perfectly with the true grade, referred

to adjacent distance-0 (AD-0), we also include cases

where the predicted grade has an error margin of one

grade level, referred to as adjacent distance-1 (AD-1)

(Heilman et al., 2008). This adjustment accounts for

possible inconsistencies and potential imperfections

in human evaluations, providing a more nuanced as-

sessment. Using the same notation, we can deﬁne ad-

jacent distance-2 (AD-2) similarly.

Table 1: Evaluation of the leading method.

Model AD-0 AD-1 AD-2

SVC-255/BERT 0.4988 0.8871 0.9153

SVC-255/RoBERTa 0.5022 0.8915 0.9262

SVC-255/BART 0.4932 0.8902 0.9226

SVC-255/GPT-4o 0.5024 0.8891 0.9324

FT-GPT-4o (no LFs) 0.4512 0.8611 0.8922

Table 1 shows the results on the test data of CLDL

using the four SVC classiﬁers trained on the training

Assessing Grade Levels of Texts via Local Search over Fine-Tuned LLMs

225

Figure 1: Schematics of training and application of the hy-

brid model.

data of CLDL, as well as the result generated by the

ﬁne-tuned GPT-4o without LFs, where ﬁne-tuning is

carried out using the training data of CLDL. The num-

bers in boldface indicate the largest in the underlying

column.

It is evident that SVC-255/GPT-4o achieves the

highest AD-0 accuracy. This result can be regarded

as the performance ceiling when all LFs are incorpo-

rated.

We observe that among the 255 LFs, some are es-

sential, others are redundant, and a few are even coun-

terproductive. This observation motivates the follow-

ing investigation into how many of these features are

primarily responsible for the model’s performance.

According to their deﬁnitions, we select 100 LFs

that appear to be more signiﬁcant and use them,

instead of all 255 LFs, to train an SVC model with an

FT-LLM for predicting grade levels using the leading

method. Our results demonstrate that the SVC model

using these 100 LFs achieves the same accuracy as

that trained on all 255 LFs when paired with the same

FT-LLM for predicting grades. These 100 LFs, along

with their feature names and deﬁnitions, are available

at https://github.com/readability-assessment/ARA/

blob/main/LFs.pdf, classiﬁed into four categories:

semantic, discourse-based, syntactic, and lexical

features.

We further observe that not all these 100 LFs are

necessary to achieve the same level of accuracy. To

identify how many LFs from these 100 LFs are es-

sential, we intend to carry out a grid search as fol-

lows: Enumerate all combinations of these 100 LFs,

and identify the smallest number of LFs such that an

SVC trained on them reaches the performance upper

bound. However, this approach results in an expo-

nential blowup, rendering it intractable to implement.

Moreover, our experiments also indicate that, because

different PLMs are trained differently, the essential

LFs may vary across different PLMs.

To reduce computation time, instead of exhaus-

tively evaluating all combinations of LFs, we conduct

a constrained grid search as follows: (1) For each

PLM, select an approximately equal number of LFs

from each category independently at random, starting

from 0 to max, where max is the largest number of

LFs in a category, with a total number of LFs from

0 to 100. (2) Train an SVC model using these LFs in

the same method as before. (3) After training the SVC

classiﬁer with the reduced number of LFs, we assess

its performance on the test data of CLDL. To ensure

robustness, we repeat the experiment three times for

each value of k and the ﬁnal accuracy reported is the

average across these three runs.

Table 2 depicts the evaluation results of AD-0.

It is evident that increasing the number of LFs does

not necessarily lead to improved AD-0 accuracy, as

some LFs can be counterproductive. For example,

SVC/GPT-4o with 20 LFs achieves an AD-0 accuracy

of 50.5%, which drops to 50.3% when using 24 LFs,

and ultimately falls further to 50.24% when all LFs

are used, as shown in Table 1, where “SVC/M with k

LFs” is deﬁned in the same manner as “SVC/M with

all LFs,” and k represents the number of LFs used.

4 LS-LLM

Let M be the LLM chosen for ﬁne-tuning genre as-

sessors, grade assessors (one for each genre), and text

comparators (one for each genre).

4.1 Genre Assessor

We observe that it is more appropriate to compare

readability between written works in the same genre,

as texts from different genres such as poem and biog-

raphy can vary signiﬁcantly, even at the same grade

level. To support this, we ﬁne-tune M to create a genre

assessor that predicts the genre of a given text.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

226

Table 2: Evaluation of SVC models with k LFs

Model

AD-0 with value of k

0 4 8 12 16 20 24 28 32 36

SVC-k/BERT 0.432 0.463 0.487 0.489 0.496 0.504 0.501 0.493 0.502 0.498

SVC-k/RoBERTa 0.420 0.455 0.461 0.475 0.489 0.506 0.495 0.499 0.502 0.506

SVC-k/BART 0.428 0.452 0.469 0.481 0.484 0.498 0.501 0.499 0.501 0.493

SVC-k/GPT-4o 0.451 0.473 0.489 0.502 0.499 0.505 0.503 0.498 0.499 0.502

If M is a generative model such as BART and

GPT-4o, we ﬁne-tune M using the following prompt

template, where the {genre list} is all genres in Table

3, contained in CLDL:

User: Your task is to determine the genre of

the following text. {text}

The list of genres is given below: {genre list}

Assistant:{the genre of the text}

If M is a non-generative transformer such as

BERT and RoBERTa, we ﬁne-tune M as a classiﬁer

following the standard procedure.

4.2 Partitioning

It is evident from Table 3 that texts in CLDL are un-

evenly distributed across genres, and for certain gen-

res, there is an insufﬁcient number of texts spanning

all grade levels. To address these issues, we group

texts by similar genres to ensure that each genre group

contains an adequate number of texts at each grade

level. To do so, let E = [e

,... ,e

] denote the

list of n genres for the underlying dataset (n = 33

for CLDL), sorted in descending order according to

the percentage, p

, of the number of texts with genre

over the total number of texts in the dataset. Let

K be the smallest number such that

∑

i=1

≥ ∆ for

∆ ∈ (

,1].

We partition E into K clusters: C

,. .. ,C

with genre e

∈ C

for i = 1, .. ., K. We call e

the base

genre of C

. For each remaining genre of e

K+1

,. .. ,e

we place it in C

if it has the highest similarity with the

base genre e

of C

among all clusters. The similarity

of two genres is calculated as the cosine similarity of

the BERT embeddings of sentences describing the re-

spective genres. We generate these sentences using

GPT-3.5 with the following prompt template:

User: Your task is to generate an explanation

of the genre {name of the genre} in one sen-

tence.

Denote by D

for i = 1, .. ., K the subset of texts

and the corresponding grades whose genres are in C

as shown in Figure 2.

Figure 2: Grouping texts according to genre participation.

4.3 Grade Assessors

If M is a generative model, we ﬁne-tune M to predict

grade levels for written works in each subset D

using

the following prompt template, resulting in a grade

assessor denoted as GA

User: Your task is to determine the grade level

of the following text. {text}

Assistant: {grade level}

If M is a non-generative model, we ﬁne-tune M as

a classiﬁer to classify grade levels for ﬁles in D

fol-

lowing the standard procedure, still denoted as GA

4.4 Text Comparators

From each D

, we select independently at random m

(e.g., m = 10) written works at a given grade level g

to create a set of reference texts, denoted by

i,g

= { f

i,g, j

| j = 1, .. ., m}. (1)

Next, we construct a labeled pairwise dataset for

ﬁne-tuning M as follows: For each grade level g ∈

min

max

− ℓ] with ℓ ≥ 1, where g

min

and g

max

de-

note, respectively, the lowest and the highest grade

levels in the dataset (e.g., in CLDL, g

min

= 3 and

max

= 12), let

i,g+ j

= {((x, y), +1) | (x, y) ∈ F

i,g

× F

i,g+ j

−

i,g+ j

= {((x, y), −1) | (x, y) ∈ F

i,g+ j

× F

i,g

where +1 and −1 are labels, and 1 ≤ j ≤ ℓ sets the

range of grade levels. Let

[

min

≤g≤g

max

−ℓ,1≤ j≤ℓ



i,g+ j

−

i,g+ j



. (2)

Finally, if M is a generative model, we ﬁne-tune

it on P

to create a text comparator, denoted by TC

with the following prompt template:

Assessing Grade Levels of Texts via Local Search over Fine-Tuned LLMs

227

Table 3: The genres in CLDL in descending order of percentage, where “R” represents the ranking of a genre in terms of the

number of texts in that genre.

R Genre % R Genre % R Genre %

1 Information text 0.3622 12 Fable 0.0160 23 Science ﬁction 0.0046

2 Poem 0.1709 13 Psychology 0.0153 24 Religious text 0.0038

3 Short story 0.1041 14 Fantasy 0.0122 25 Political theory 0.0038

4 Essay 0.1041 15 Folktale 0.0122 26 Allegory 0.0030

5 Fiction 0.0574 16 Opinion 0.0115 27 Autobiography 0.0030

6 Speech 0.0428 17 Myth 0.0076 28 Legal document 0.0023

7 Biography 0.0383 18 Primary source doc 0.0076 29 Satire 0.0022

8 News 0.0214 19 Historical ﬁction 0.0068 30 Letter 0.0015

9 Memoir 0.0176 20 Philosophy 0.0067 31 Main ideas 0.0007

10 Non-ﬁction 0.0161 21 Drama 0.0054 32 Magical realism 0.0007

11 Interview 0.0161 22 Historical document 0.0053 33 Skill lesson 0.0007

User: You are provided with a pair of texts de-

limited with XML tags. Your task is to deter-

mine which of the two texts is more difﬁcult to

read.

<text 1> {x

} <text 1>

<text 2> {y

} <text 2>

Assistant: {<text 1> or <text 2>}

If M is a non-generative model, we ﬁne-tune M on P

as a binary classiﬁer to determine which of the two

input texts is more difﬁcult to read following the stan-

dard procedure. Figures 3 and 4 depict the process of

ﬁne-tuning these models.

Figure 3: A schematic for ﬁne-tuning the genre assessor.

Figure 4: A schematic for ﬁne-tuning genre-aware grade

assessors and text comparators.

4.5 The Local-Search Algorithm

Let F be an input text. Figure 5 depicts the data ﬂow

of LS-LLM.

1. Use the genre assessor to predict F’s genre, de-

noted by e.

2. Case 1: e ∈ C

for some i (1 ≤ i ≤ K).

(a) Use the grade assessor GA

to predict an initial

grade level of F, denoted as g, and use it as the

starting point for carrying out the local search.

Figure 5: Data ﬂow of LS-LLM.

(b) Select at random m ﬁles from D

with grade

level g, denoted by F

i,g

to compare F with

each ﬁle in F

i,g

. Let n

i,g

denote the number of

texts in F

i,g

with a lower grade level then g. De-

ﬁne the relative difﬁculty index (RDI) by

RDI

i,g

. (3)

Case 1.1: RDI

i,g

< 0.4. If g > g

min

, then set

g ← g − 1, one grade lower, and repeat the al-

gorithm. Otherwise, the local search concludes

with an output “F is easier than Grade g

min

.”

Case 1.2: 0.4 ≤ RDI

g,i

≤ 0.6. The local search

concludes with the current value of g being the

ﬁnal grade of F.

Case 1.3: RDI

i,g

> 0.6. If g < g

max

, then set

g ← g + 1, one grade higher, and repeat the al-

gorithm. Otherwise, the local search concludes

with an output “F is harder than Grade g

max

.”

3. Case 2: e ̸∈ E. Namely, e is unseen in the training

data. We identify the existing genre that is most

similar to e using the same clustering method ap-

plied to genres clustering and proceed as in Case

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

228

1, applying the genre-aware grade assessor and

text comparator associated with that genre.

Remark. While we may randomly select refer-

ence works from the training data for a given grade

on the ﬂy, independent of those used for ﬁne-tuning

the grade comparators, our experiments show that this

approach achieves almost the same accuracy.

5 EVALUATION

We ﬁrst evaluate the accuracy of the genre assessor,

grade assessor, and text comparator on CLDL. We

then evaluate the overall performance of LS-LLM/M

on both CLDL and WeeBit. We would like to apply

LS-LLM to Newsela, but we have not received per-

mission to access Newsela at the time of writing

For brevity, we sometimes refer to a model as a

D-based model if it is trained or ﬁne-tuned on the

dataset D. We carry out evaluations for WeeBit in

two settings: (1) Repeat the same ﬁne-tuning pro-

cess for WeeBit as for CLDL but with ﬁve levels of

readability using a genre-agnostic grade assessor. (2)

Apply the CLDL-based genre assessor, genre-aware

grade assessors, and genre-aware text comparators

to WeeBit. Finally, we compare the performance of

genre-agnostic LS-LLM with genre-aware LS-LLM

on both CLDL and WeeBit, as well as the number of

visits to LLMs and the actual running time.

5.1 Evaluation of CLDL-Based Models

The test sets for the genre assessor and grade asses-

sor are the test data of CLDL. The test set for the text

comparator is constructed in the same way as for con-

structing P

(see Equation 2) with the following set-

ting: For CLDL: g

min

= 3, g

max

= 12, and ℓ = 2. For

WeeBit: g

min

= 1, g

max

= 5, and ℓ = 1, where the

readability level is treated as the grade level. We use

the average precision to measure accuracy. For mea-

suring the genre assessor, a predicted genre is con-

sidered correct if it falls in the correct cluster of gen-

res. Table 4 shows the evaluation results, where GenA

stands for “genre assessor,” GraA for “grade asses-

sor,” and TexC for “text comparator.”

It can be seen that the CLDL-based genre asses-

sor, genre-aware grade assessor, and genre-aware text

comparator using GPT-4o achieve the highest accu-

racy compared to other LLMs, with accuracies ex-

ceeding 82%, 45%, and 85%, respectively. We will

use the CLDL-based genre assessor using GPT-4o as

Access to Newsela requires permission, as does

WeeBit.

Table 4: Evaluation of CLDL-based models.

Model GenA GraA TexC

BERT 0.7435 0.3912 0.7692

RoBERTa 0.7847 0.3968 0.8121

BART 0.7422 0.3975 0.8010

GPT-3.5 0.7833 0.4017 0.8244

GPT-4o

0.8206 0.4512 0.8538

the default genre assessor for its highest accuracy. It

is worth noting that the genre assessor may generate

a new genre not present in the training data.

5.2 Evaluation of LS-LLM on CLDL

and WeeBit

WeeBit doesn’t provide genre information. To re-

solve this, we use the genre assessor trained on CLDL

to generate genres for all 3,115 written works in

WeeBit. Table 5 depicts the results. A total of 857

written works have generated genres not included in

CLDL, highlighting the advantage of using a genera-

tive model over a traditional classiﬁer.

Table 5: Statistical results of predicted genres for WeeBit,

where #Art represents “the number of texts”

R Genre # R Genre #

1 Info text 1424 14 Lang lesson 18

2 News 386 15 Short story 18

3 Advertisement 365 16 Mathematics 14

4 Interview 345 17 Poem 8

5 Information 99 18 Biography 8

6 Science 78 19 Drama 7

7 Statement 71 20 FLLR 5

8 Info technology 68 21 Religious 3

9 Summary 49 22 Recipe 2

10 Education 47 23 Joke 2

11 Literary analysis 37 24 Case study 1

12 Philosophy 30

Character

13 Opinion 29 Analysis

Selecting ∆ = 0.65 and 0.75, respectively, for

CLDL and WeeBit yields K = 4 for both datasets in

genre partitioning, which means that datasets are par-

titioned into four groups, with the top four genres in

Tables 3 and 5 being, respectively, the base genres for

the underlying cluster. This partition provides a sufﬁ-

cient number of works in each D

spanning all grade

levels. We set m = 10 to construct the set of reference

works F

i,g

(see Equation 1) for each D

Table 6 depicts the evaluation results, where GPT-

4o (direct) generates grade levels using a few-shot

prompt, LS-L stands for LS-LLM, /3.5 and /4o stand

for /GPT-3.5 and /PGT-4o, and SVC-255/4o is trained

over, respectively, CLDL and WeeBit.

It can be seen that, for both CLDL and WeeBit

under both AD-0 and AD-1 accuracy, LS-LLM/M

Assessing Grade Levels of Texts via Local Search over Fine-Tuned LLMs

229

Table 6: Evaluation results of various models trained or

ﬁne-tuned on their respective datasets.

Model

AD-0 AD-1

CLDL WeeBit CLDL WeeBit

GPT-4o (Direct) 0.4378 0.7623 0.8420 0.8220

FT-GPT-4o 0.4512 0.8950 0.8611 0.9050

SVC-255/4o 0.5024 0.9187 0.8891 0.9532

LS-L/BERT 0.6387 0.9195 0.9103 0.9593

LS-L/RoBERTa 0.6516 0.9221 0.9179 0.9611

LS-L/BART 0.6425 0.9250 0.9101 0.9678

LS-L/3.5 0.6526 0.9316 0.9174 0.9668

LS-L/4o 0.6542 0.9327 0.9202 0.9697

for all M outperforms the leading method trained

with all LFs, which in turn outperforms ﬁne-tuned

GPT-4o, and ﬁne-tuned GPT-4o outperforms out-of-

the-box GPT-4o. In particular, under the measure

of AD-0, for CLDL, LS-LLM/GPT-4o achieves a

substantial 23.20% improvement. Even the least-

performant model, LS-LLM/BERT, surpasses the

leading method with a notable 21.34% improvement.

For WeeBit, LS-LLM/GPT-4o achieves a 1.50% im-

provement over the leading method.

In can also be seen that all models achieve higher

accuracy on WeeBit compared to CLDL. This is likely

because WeeBit features coarser readability levels, al-

lowing certain grade predictions that are incorrect for

CLDL to be correct for WeeBit.

Table 7: Evaluation of CLDL-based models on WeeBit.

Model AD-0 AD-1

SVC-255/GPT-4o 0.4412 0.5929

LS-LLM/BERT 0.4648 0.6246

LS-LLM/RoBERTa 0.4701 0.6290

LS-LLM/BART 0.4677 0.6263

LS-LLM/GPT-3.5 0.4711 0.6302

LS-LLM/GPT-4o 0.4716 0.6308

Next, we evaluate the transferability of CLDL-

based LS-LLM/M on WeeBit. For a written work F in

the test set of CLDL, if LS-LLM predicts “F is easier

than Grade 3,” we classify F as belonging to Grade

3. Similarly, if LS-LLM predicts “F is harder than

Grade 12,” we classify F as belonging to Grade 12.

We map the predicted grade by LS-LLM/M as fol-

lows: (1) Texts easier than Grade 3 are classiﬁed as

Level 1. (2) Texts at Grades 3 and 4 are classiﬁed as

Level 2. (3) Texts at Grades 5 and 6 are classiﬁed as

Level 3. (4) Texts at Grades 7, 8, and 9 are classiﬁed

as Level 4. (5) Texts at Grades 10, 11, 12, and those

harder than Grade 12 are classiﬁed as Level 5. Table

7 presents the evaluation results, where SVC/GPT-4o

with all LFs is trained on CLDL.

5.3 The Role of Genres

We compare the performance of LS-LLM/GPT-4o

with genre-agnostic and genre-aware grade assessor

GA and text comparator TC ﬁne-tuned on, respec-

tively, where Genre-agnostic models are trained with-

out organizing the training data according to genre.

Table 8 depicts the evaluation results.

Table 8: The average AD-0 accuracy of LS-LLM/GPT-4o.

Method

AD-0

CLDL WeeBit

Genre-agnostic 65.22% 93.25%

Genre-aware 65.42% 93.27%

It appears that the genre-aware method performs

slightly better; however, the advantage is marginal,

which is somewhat counterintuitive. This may be at-

tributed to the imbalance in the dataset across genres,

where a few dominant genres disproportionately in-

ﬂuence the results. To enable a fairer comparison, a

more balanced dataset is necessary for future studies.

Table 9 (a) and (b) show, respectively, the maxi-

mum and average numbers of visits to LLMs and the

running time of LS-LLMs.

Table 9: The number of visits to LLMs and running time,

where G-AG and G-AW stand for, respectively, genre-

agnostic and genre-aware

Maximum Average

CLDL WeeBit CLDL WeeBit

G-AG 41 21 14.8 12.9

G-AW 32 22 12.6 12.5

(a)

Worst-case time Average time

CLDL WeeBit CLDL WeeBit

BERT

G-AG 28.23 14.13 10.21 9.13

G-AW 23.26 15.60 9.04 9.01

GPT-4o

G-AG 42.34 21.7 15.38 13.33

G-AW 33.06 22.76 13.22 12.92

(b)

It can be seen that, in general, the genre-agnostic

approach requires more visits to the underlying ﬁne-

tuned PLM models compared to the genre-aware ap-

proach. This is expected, as the genre-aware approach

is conﬁned to a smaller set of genres, which results in

a faster local search process. Consequently, the ac-

tual running time of LS-LLM using ﬁne-tuned PLMs

like BERT or RoBERTa, which run locally, is signiﬁ-

cantly shorter compared to LS-LLM using ﬁne-tuned

commercial PLMs such as the GPT-4o API. Table 9

depicts the comparison results of running time, where

the ﬁne-tuned BERT models are run on a NVIDIA

GeForce RTX 3090 GPU.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

230

6 CONCLUSIONS

We presented a novel local search method for read-

ability assessment, leveraging ﬁne-tuned models over

a selected PLM for various tasks. Our experiments

demonstrated that the proposed local search method

signiﬁcantly enhances ARA accuracy over the lead-

ing method. Investigations for further improvements

of accuracy can be carried out along the following

lines: (1) Construct a dataset that is larger and more

balanced than CLDL. Speciﬁcally, for each genre, we

aim to collect a sufﬁcient number of written works

that are evenly distributed across all grade levels. This

will eliminate the need to partition the dataset by sim-

ilar genres and enable fairer comparisons between

genre-agnostic and genre-aware grade assessment and

readability evaluation methods. (2) Explore alterna-

tive black-box LLMs with improved ﬁne-tuning ca-

pabilities to enhance the accuracy of various tasks.

(3) Investigate white-box LLMs, such as the LLaMA

models, to optimize ﬁne-tuning for speciﬁc tasks.

REFERENCES

Collins-Thompson, K. (2014). Computational assessment

of text readability: A survey of current and future re-

search running title: Computational assessment of text

readability.

Deutsch, T., Jasbi, M., and Shieber, S. (2020). Linguis-

tic features for readability assessment. In Burstein,

J., Kochmar, E., Leacock, C., Madnani, N., Pil

an, I.,

Yannakoudakis, H., and Zesch, T., editors, Proceed-

ings of the Fifteenth Workshop on Innovative Use of

NLP for Building Educational Applications, pages 1–

17.

Engelmann, B., Kreutz, C. K., Haak, F., and Schaer, P.

(2024). Arts: Assessing readability and text simplic-

ity. In Proceedings of EMNLP.

Feng, L., Elhadad, N., and Huenerfauth, M. (2009). Cog-

nitively motivated features for readability assessment.

In Lascarides, A., Gardent, C., and Nivre, J., editors,

Proceedings of the 12th Conference of the European

Chapter of the ACL (EACL 2009), pages 229–237.

Filighera, A., Steuer, T., and Rensing, C. (2019). Auto-

matic Text Difﬁculty Estimation Using Embeddings

and Neural Networks, pages 335–348.

Gunning, R. (1969). The fog index after twenty years. Jour-

nal of Business Communication, 6:13 – 3.

Hale, J. (2016). Information-theoretical complexity metrics.

Lang. Linguistics Compass, 10:397–412.

Heilman, M., Collins-Thompson, K., and Eskenazi, M.

(2008). An analysis of statistical models and features

for reading difﬁculty prediction. Proceedings of the

Third Workshop on Innovative Use of NLP for Build-

ing Educational Applications.

Holtgraves, T. (1999). Comprehending indirect replies:

When and how are their conveyed meanings acti-

vated? Journal of Memory and Language, 41(4):519–

540.

Jeanne Sternlicht Chall, E. D. (1995). Readability Re-

visited: The New Dale-Chall Readability Formula.

Brookline Books.

Lee, B. W., Jang, Y. S., and Lee, J. (2021). Pushing on text

readability assessment: A transformer meets hand-

crafted linguistic features. In Moens, M.-F., Huang,

X., Specia, L., and Yih, S. W.-t., editors, Proceedings

of the 2021 Conference on Empirical Methods in Nat-

ural Language Processing, pages 10669–10686. Re-

public.

Lee, B. W. and Lee, J. (2020). Lxper index 2.0: Improving

text readability assessment for l2 English learners in

South Korea.

Lu, X. (2010). Automatic analysis of syntactic complexity

in second language writing. International Journal of

Corpus Linguistics, 15:474–496.

Peabody, M. A. and Schaefer, C. (2016). Towards semantic

clarity in play therapy. International Journal of Play

Therapy, 25:197–202.

Schwarm, S. and Ostendorf, M. (2005). Reading level as-

sessment using support vector machines and statisti-

cal language models. In Knight, K., Ng, H. T., and

Oﬂazer, K., editors, Proceedings of the 43rd Annual

Meeting of the Association for Computational Lin-

guistics (ACL’05), pages 523–530.

Tonelli, S., Tran Manh, K., and Pianta, E. (2012). Mak-

ing readability indices readable. In Williams, S., Sid-

dharthan, A., and Nenkova, A., editors, Proceedings

of the First Workshop on Predicting and Improving

Text Readability for target reader populations, pages

40–48. Canada.

Trott, S. and Rivi

ere, P. (2024). Measuring and modifying

the readability of English texts with GPT-4. In Shard-

low, M., Saggion, H., Alva-Manchego, F., Zampieri,

M., North, K.,

Stajner, S., and Stodden, R., editors,

Proceedings of the Third Workshop on Text Simpli-

ﬁcation, Accessibility and Readability (TSAR 2024),

pages 126–134. Linguistics.

Vajjala, S. and Meurers, D. (2012). On improving the ac-

curacy of readability classiﬁcation using insights from

second language acquisition. pages 163—-173.

Wang, J. (2025). Ai-oracle machines for intelligent com-

puting. AI Matters, 11:8–11.

Xia, M., Kochmar, E., and Briscoe, T. (2016). Text

readability assessment for second language learn-

ers. In Tetreault, J., Burstein, J., Leacock, C., and

Yannakoudakis, H., editors, Proceedings of the 11th

Workshop on Innovative Use of NLP for Building Ed-

ucational Applications, pages 12–22.

Assessing Grade Levels of Texts via Local Search over Fine-Tuned LLMs

231