Leveraging Large Language Models for Semantic Evaluation of RDF

Triples

Andr

e Gomes Regino

1 a

, Fernando Rezende Zagatti

1,4 b

, Rodrigo Bonacin

1 c

Victor Jesus Sotelo Chico

2 d

, Victor Hochgreb

2 e

and Julio Cesar dos Reis

3 f

Center for Information Technology Renato Archer, Campinas, S

ao Paulo, Brazil

GoBots, Campinas, S

ao Paulo, Brazil

Institute of Computing, University of Campinas, Campinas, S

ao Paulo, Brazil

Department of Computing, UFScar, S

ao Carlos, Brazil

Keywords:

LLM as a Judge, RDF Triple Generation, RDF Triple Validation.

Abstract:

Knowledge Graphs (KGs) depend on accurate RDF triples, making the quality assurance of these triples a

signiﬁcant challenge. Large Language Models (LLMs) can serve as graders for RDF data, providing scalable

alternatives to human validation. This study evaluates the feasibility of utilizing LLMs to assess the quality

of RDF triples derived from natural language sentences in the e-commerce sector. We analyze 12 LLM con-

ﬁgurations by comparing their Likert-scale ratings of triple quality with human evaluations, focusing on both

complete triples and their individual components (subject, predicate, object). We employ statistical correla-

tion measures (Spearman and Kendall Tau) to quantify the alignment between LLM and expert assessments.

Our study examines whether justiﬁcations generated by LLMs can indicate higher-quality grading. Our ﬁnd-

ings reveal that some LLMs demonstrate moderate agreement with human annotators and none achieve full

alignment. This study presents a replicable evaluation framework and emphasizes the current limitations and

potential of LLMs as semantic validators. These results support efforts to incorporate LLM-based validation

into KG construction processes and suggest avenues for prompt engineering and hybrid human-AI validation

systems.

1 INTRODUCTION

Knowledge Graphs (KGs) are organized representa-

tions of information that support semantic reason-

ing and knowledge discovery in various ﬁelds. KGs

consist of RDF (Resource Description Framework)

triples formatted as (subject, predicate, object) (Bon-

atti et al., 2019). These triples represent factual state-

ments and relationships between entities in a format

that machines can read, which is important for appli-

cations like search engines, recommendation systems,

and question answering. Building and maintaining

a KG effectively relies on the precise extraction and

veriﬁcation of these triples from unstructured or semi-

https://orcid.org/0000-0001-9814-1482

https://orcid.org/0000-0002-7083-5789

https://orcid.org/0000-0003-3441-0887

https://orcid.org/0000-0001-9245-8753

https://orcid.org/0000-0002-0529-7312

https://orcid.org/0000-0002-9545-2098

structured text sources.

In recent years, Large Language Models (LLMs)

like GPT-4 (Achiam et al., 2023), Llama (Touvron

et al., 2023) and DeepSeek (Liu et al., 2024) have

shown signiﬁcant success in various Natural Lan-

guage Processing (NLP) tasks. These models capture

complex linguistic and contextual patterns, leading to

strong effectiveness in summarization (Zhang et al.,

2025), translation (Elshin et al., 2024), and question

answering (Zhuang et al., 2023). Due to their ad-

vanced generative and reasoning abilities, LLMs are

being investigated as tools for automating the creation

of RDF triples from natural language (Regino et al.,

2023), offering a scalable method for constructing and

enhancing KGs.

The outputs of the LLM generation need thorough

evaluation, particularly regarding RDF triples. The

reliability of these triples is not guaranteed, as LLMs

may hallucinate, overlook contextual constraints, or

generate outputs that are incompatible with the un-

Regino, A. G., Zagatti, F. R., Bonacin, R., Chico, V. J. S., Hochgreb, V. and Reis, J. C.

Leveraging Large Language Models for Semantic Evaluation of RDF Triples.

DOI: 10.5220/0013837600004000

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2025) - Volume 2: KEOD and KMIS, pages

74-85

ISBN: 978-989-758-769-6; ISSN: 2184-3228

derlying ontology. Without careful inspection, these

triples can propagate factual errors, biases, or vio-

late domain constraints, compromising the reliability

and trustworthiness of the entire KG. This assessment

is used to establish trust in the system, as users and

downstream applications depend on the accuracy and

relevance of the knowledge produced.

LLM-generated triples must conform to the on-

tological structure and semantics of the target KG.

The evaluation process is typically manual, necessi-

tating expert human reviewers, which renders it time-

consuming, expensive, and challenging to implement

at scale in practical applications. The systematic eval-

uation of these triples, particularly those generated

by LLMs, remains an under-researched area (Regino

and dos Reis, 2025), (Khorashadizadeh et al., 2024).

There is a lack of established metrics, protocols, and

frameworks for assessing triple quality in alignment

with human expectations and ontological consistency.

This intersection presents a valuable opportunity to

investigate whether LLMs can serve as evaluators or

’judges” of structured data, potentially reducing the

need for human reviewers and the heavy task that re-

lies on them.

Based on our exploratory literature review, we

observe that there is currently no widely accepted

framework implemented as a software tool for auto-

matically assessing the quality of RDF triples prior

to their integration into a KG. The absence of auto-

mated and reliable evaluation mechanisms, particu-

larly those that can closely mimic human judgment

with quality, poses an obstacle in the construction and

validation of KGs for real-world applications.

This article examines the effectiveness of LLMs

as automated evaluators, or “judges”, of RDF triples

generated from natural language text. Although

LLMs frequently create structured data from unstruc-

tured sources (Racharak et al., 2024), their ability to

assess the semantic adequacy, clarity, and coherence

of this structured output is not well-studied. We aim

to assess whether current LLMs can achieve human-

level evaluations of RDF triples in terms of their cor-

rectness and ﬁdelity to the original sentences. Rec-

ognizing this capability can reduce the human efforts

needed for triple validation, which is an obstacle in

the construction and maintenance of adequate KGs.

We design and develop an evaluation pipeline

with 12 conﬁgurations of LLMs, which include three

model families (Gemma, Qwen, Sabia), two model

sizes (small, large), and two prompting strategies

(zero-shot, few-shots). Our study assesses each con-

ﬁguration using 300 Portuguese sentence-triple pairs,

scoring them on four dimensions: subject, predicate,

object, and the complete triple. Ten human annota-

tors independently rated these triples to establish a

gold standard. We analyze the ordinal ratings from the

LLMs against the human judgments using two rank-

based correlation metrics to determine the extent to

which LLMs align with human preferences. Our in-

vestigation conducts a meta-evaluation to assess the

clarity and coherence of the justiﬁcations provided by

LLMs, thereby enhancing the quality assurance pro-

cess.

The result analysis demonstrates that LLMs pro-

vide lower average scores than human annotators

overall, except for the “subject” dimension, where

LLMs rate more generously. Few-shot prompting,

small models, and Sabia-family LLMs align most

closely with human evaluations. This suggests that

LLMs can capture some aspects of semantic plausi-

bility; nevertheless, the overall correlation levels re-

main below those typically expected for high-stakes

or expert-level annotation tasks. This highlights that,

although promising, LLMs may not yet match human

evaluators in accurately judging RDF triples across

the dimensions of subject, predicate, object, and over-

all ﬁdelity. Among the four evaluation dimensions,

LLMs showed the strongest agreement with humans

when assessing full triples, suggesting better effec-

tiveness in holistic judgments compared to granular

components, such as predicates or objects. The meta-

evaluation conﬁrms that few-shot prompts resulted in

more coherent and interpretable outputs.

This study offers three key and original contribu-

tions:

1. Benchmark Design: We introduce a new bench-

mark consisting of 300 sentence-triple pairs in

Portuguese language, annotated by experts across

some semantic dimensions, to facilitate system-

atic evaluation of LLMs;

2. Comprehensive Evaluation: We conduct orig-

inal empirical comparisons of 12 LLM conﬁg-

urations against human judgments using ordinal

scales and correlation metrics, revealing patterns

of alignment between models and human evalua-

tions;

3. Meta-Evaluation Strategy: We establish a novel

meta-evaluation step that employs a larger LLM

to evaluate the quality of justiﬁcations from

smaller models, emphasizing clarity and coher-

ence;

The remaining of this article is organized as fol-

lows: Section 2 describes related work on LLM

judges in RDF triple domain. Section 3 describes the

experimental setup. Section 4 presents the results;

Section 5 discusses the results; and Section 6 draws

concluding remarks.

Leveraging Large Language Models for Semantic Evaluation of RDF Triples

2 RELATED WORK

(Zheng et al., 2023) explored LLM as a judge for

other LLMs’ output. They introduced two bench-

marks: MT-Bench, assessing multi-turn conversa-

tional and reasoning abilities, and Chatbot Arena,

a crowdsourced platform for chatbot comparison.

Their work investigated whether LLMs like GPT-4

can reliably approximate human preferences in dia-

logue evaluation. Through over 6, 000 human judg-

ments, they demonstrated that LLMs can achieve over

80% agreement with humans, comparable to human-

human alignment, positioning the ”LLM-as-a-judge”

approach as a scalable alternative for conversational

AI evaluation. While both studies explored LLMs

for evaluation, our present study differs in domain,

method, and focus. We target the semantic assess-

ment of RDF triples derived from natural language, a

task rooted in the Semantic Web rather than dialogue.

Instead of agreement percentages, we measure align-

ment via correlation metrics (Spearman and Kendall

Tau) on ﬁne-grained components (subject, predicate,

object, and whole triple).

Another relevant study, (Guerdan et al., 2025),

addressed the challenges of validating LLM-as-a-

judge systems, particularly in cases where human rat-

ings are ambiguous or lack clear gold labels. Their

work proposed a theoretical and empirical framework

for understanding how rating task design, elicitation,

aggregation schemes, and agreement metrics affect

judge system validation. The authors demonstrated

that current validation methods can select suboptimal

judge systems—sometimes performing up to 34%

worse than alternatives because they overlook the im-

pact of task indeterminacy, where multiple valid rat-

ings may exist. They demonstrated this through an

empirical study using ﬁve commercial LLMs to eval-

uate “toxicity” on the Civil Comments dataset, high-

lighting how existing validation pipelines can produce

misleading conclusions about both judge and target

system performance. Their concern lies in the vali-

dation of LLM judges in the absence of clear ground

truth. In contrast, our study focuses on the semantic

evaluation of RDF triples, where the alignment be-

tween LLM and human judgments is assessed through

statistical correlation.

Another relevant study explored the role of LLMs

as judges in the context of KG construction (Huang

et al., 2024). The proposed GraphJudge framework

leverages open-source and closed-source LLMs to

improve the quality of triples extracted from natural

language. To address typical challenges such as infor-

mation noise, domain knowledge gaps, and hallucina-

tions, GraphJudge introduces three key components:

a module for cleaning irrelevant information, another

to adapt LLMs for triple validation tasks, and one for

ﬁltering incorrect triples. Their experiments on both

general and domain-speciﬁc datasets demonstrated

state-of-the-art effectiveness over baselines, achiev-

ing over 90% accuracy in triple validation. Graph-

Judge uses LLMs to directly ﬁlter and improve ex-

tracted triples as part of KG construction. Our present

study analyzes the ability of LLMs to judge the se-

mantic quality of given RDF triples against human

evaluations. Their evaluation is benchmarked through

classiﬁcation accuracy, while ours relies on correla-

tion metrics between human and LLM assessments.

Another recent study examined the use of LLMs

as curators for validating RDF triple insertions into

existing KGs (Regino and dos Reis, 2025). The au-

thors proposed a systematic validation method cov-

ering four key aspects of RDF validation: class and

property alignment, URI standardization, semantic

consistency, and syntactic correctness. Using prompts

to guide LLMs through each stage, they evaluated

four models across these tasks. Results indicated that

larger models, such as Llama-3 (70B Instruct), consis-

tently outperform smaller ones, achieving high pre-

cision and recall, particularly in syntactic validation

(accuracy of 0.99). They further highlighted practical

challenges, including domain generalization, seman-

tic drift, and the trade-off between cost and accuracy.

To the best of our knowledge, this study is the

ﬁrst to systematically investigate LLMs as semantic

judges for RDF triples. It provides a detailed evalua-

tion at the subject, predicate, object, and triple levels,

and directly compares LLM judgments with human

assessments through correlation analysis. Our cur-

rent research investigates the impact of model size,

language speciﬁcity, and prompting strategies, while

also conducting a secondary evaluation of clarity and

coherence. Our study provides new insights into the

reliability, limitations, and practical applications of

LLMs in KG curation and quality control, address-

ing a gap in previous research that has focused solely

on extraction or structural validation.

3 EXPERIMENTAL EVALUATION

METHOD

This section presents our evaluation method devel-

oped to assess the ability of LLMs to judge the se-

mantic correctness of RDF triples extracted from nat-

ural language texts

. The method compares machine-

The source code is available at:

https://github.com/andreregino/llm-as-a-judge

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

generated judgments to an original gold standard of

human annotations, allowing an investigation of the

reliability, consistency, and alignment of LLM-based

evaluators in the context of semantic information ex-

traction. Figure 1 presents the evaluation methodol-

ogy.

The goal of this evaluation is to determine how

accurately LLMs can act as evaluators of RDF triple

quality. Given a text and its corresponding RDF triple,

LLMs are prompted to judge the semantic plausibility

of each triple and its components.

Subsection 3.1 describes the dataset used in the

evaluation; Subsection 3.2 describes the LLMs; Sub-

section 3.3 reports on the prompt design; Subsec-

tion 3.4 shows the gold standard annotated by hu-

mans, Subsection 3.5 refers to the evaluation proce-

dure; Subsection 3.6 presents the used evaluation met-

rics and Subsection 3.7 illustrates an example of the

evaluation.

3.1 Dataset

The dataset (component 1 of Figure 1) used in this

evaluation was constructed from real-world Brazil-

ian e-commerce and marketplace platforms. It con-

tains summarized text inputs and corresponding RDF

triples generated by the [blind review], a question-

answer-to-RDF-triple generation system developed

for the e-commerce domain [blind review].

Each data instance represents a question-and-

answer pair related to a product, automatically sum-

marized into a single sentence, along with a seman-

tic triple extracted from that summary. The source

content includes customer questions, seller answers,

and product names. These were processed to produce

concise, representative sentences and corresponding

structured triples.

The dataset comprises 300 Portuguese sen-

tence–triple pairs collected from multiple stores and

randomly sampled to ensure a diverse representation.

It includes three distinct intent types: compatibility

(e.g., “Is this tyre compatible with my vehicle?”),

speciﬁcation (e.g., “What is the memory capacity of

this phone?”), and availability (e.g., “Is this item in

stock in size M?”). Each intent type is represented

equally, with 100 sentence–triple pairs per category.

The examples span 20 different product domains, in-

cluding automotive, electronics, apparel, and house-

hold goods, and reﬂect real usage scenarios from

2023.

The RDF triples follow the standard format

subject-predicate-object and aim to capture the core

semantics of the summarized statement. Figure 2

presents two examples of texts and their correspond-

ing RDF triples from the dataset. We use the triple

parts and the triple as a whole in the judgment pro-

cess (component 2 of Figure 1).

3.2 Evaluated Large Language Models

A diverse set of LLMs was selected to act as graders

(component 3 of Figure 1), automated judges respon-

sible for evaluating the semantic correctness of RDF

triples based on input texts. The motivation for us-

ing LLMs as judges stems from the nature of the

content being evaluated. RDF triples are machine-

readable structures intended for system interoperabil-

ity, ontology alignment, and semantic reasoning —

tasks primarily executed by machines rather than hu-

mans. Therefore, assessing whether a triple correctly

captures the meaning of a sentence is, in this context,

more naturally aligned with machine interpretation.

If the evaluated output were intended for human con-

sumption (e.g., a natural language summary), then a

human-centric evaluation would be more appropriate.

The chosen models vary in architecture, size, and

source, enabling a comparative analysis across dif-

ferent model families and capacities. The following

models were used in the main evaluation:

• Gemma 2 9B and 27B – Google’s open-weight

transformer model with 9 and 27 billion parame-

ters;

• Qwen 2.5 7B and 32B – 7B and 32B parame-

ter models from Alibaba’s Qwen series, optimized

for instruction-following;

• Sabiazinho 3 and Sabi

a 3.1 – Both compact and

more powerful Brazilian LLM ﬁne-tuned for Por-

tuguese understanding and instruction following;

The rationale behind selecting these models was:

ﬁrst, they cover a range of parameter sizes (from 3B to

32B), which enables an exploration of how model ca-

pacity inﬂuences evaluation quality; second, they rep-

resent different linguistic and regional orientations:

while some models are general-purpose and multi-

lingual (e.g., Qwen), others are speciﬁcally tuned for

Portuguese and Brazilian contexts (e.g., Sabiazinho

and Sabi

a), which is the language of the dataset used

in the evaluation process.

In addition to the graders listed above, DeepSeek

R1 with 685B parameters, a much larger model, was

used in a complementary role to assess the inter-

nal consistency of the judgments. Speciﬁcally, this

model was tasked with evaluating whether the grading

scores assigned by each LLM were clear and coherent

with the textual justiﬁcations provided by those same

LLMs. We call this a meta-evaluation model.

Leveraging Large Language Models for Semantic Evaluation of RDF Triples

Figure 1: Methodology to evaluate LLMs’ capacity in judging RDF generation from text. The numbers represent components

and the letters the actions among them. A) Evaluation Components Deﬁnition; B) LLMs Setup Deﬁnition; C) Gold Standard

Creation; D) LLM Judgment; E) Metrics Evaluation.

Figure 2: Two examples of summarized texts and their cor-

responding RDF triples.

This meta-evaluation model was chosen due to its

signiﬁcantly higher capacity, which allows for deeper

reasoning and a stronger ability to detect logical in-

consistencies between a justiﬁcation and its corre-

sponding evaluation (So et al., 2025). This consis-

tency check was conducted separately from the main

evaluation to preserve fairness and avoid biasing the

comparative results (cf. Section 3.5).

3.3 Prompt Design

To evaluate the semantic plausibility of RDF triples,

we designed two types of prompts (component 3 of

Figure 1): one for the main evaluation LLMs (acting

as graders), and another for a larger LLM used as a

meta-evaluator (consistency checker).

The primary prompt was written in Portuguese

and designed to simulate a structured evaluation sce-

nario. It instructs the model to behave as an ontol-

ogy and knowledge representation expert tasked with

judging whether a given RDF triple accurately reﬂects

the information in a corresponding natural language

sentence. The input to this prompt consists of a sum-

marized sentence derived from an e-commerce QA

interaction and an RDF triple automatically extracted

from the sentence.

The model is asked to provide four Likert-scale

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

ratings (from 1 to 5), assessing the correctness of the

subject, predicate, and object, as well as the overall

ﬁdelity of the triple to the source sentence. A textual

justiﬁcation, with a maximum of 30 words, is also re-

quired to explain the rationale behind the scores.

Prompt for RDF Triple Generation from

Text Judgment

You are an expert in natural language un-

derstanding and RDF. Your task is to judge

the quality of RDF triples generated from

texts. In other words, given a descriptive sen-

tence and an RDF triple, you should act as a

judger, grading the quality of the triples.

Here is the explanation of each grade:

{likert grades and explanations}

Here is each of the judgments:

{list of judgments}

Below are some examples of the sentences,

triples and their corresponding judgments:

Examples: {examples}

With the examples above as reference, gener-

ate the grades for the following sentence and

triple:

Text: {text}

Triple: {triple}

Judgments:

This prompt format standardizes the evaluation

across all grader LLMs and supports later compari-

son with human annotations.

Two versions of the grader prompt were used in

the experiments:

• Zero-Shot Prompt: Contains only the evaluation

instruction and the speciﬁc text–triple pair;

• Few-Shot Prompt: Includes two randomly

selected examples of evaluated sentences and

triples, followed by the new instance to be judged.

Few-shot prompting, a form of in-context learning

(Brown et al., 2020), was chosen over tuning-based

approaches due to its ﬂexibility and lower data re-

quirements. Fine-tuning would demand thousands of

annotated examples, which is impractical in this case.

Additionally, by testing both zero-shot and few-shot

conﬁgurations, we aim to assess whether the presence

of examples inﬂuences grading behavior or score in-

ﬂation.

To assess the internal consistency of the LLM-

generated judgments, we employed a separate

prompt, that receives as input: the original sentence;

the extracted RDF triple and the judgment provided

by another LLM, consisting of four scores and a jus-

tiﬁcation.

The task of this consistency evaluator is to assess

two aspects:

1. Clarity of the Justiﬁcation: Is the justiﬁcation

provided by the LLM understandable and well-

written?

2. Coherence of the Ratings: Do the numeric

scores logically match the justiﬁcation?

This additional layer of evaluation is intended to

validate whether the LLMs’ justiﬁcations are not only

syntactically well-formed but also logically consistent

with their numerical evaluations. A larger model was

chosen for this step to provide deeper reasoning capa-

bilities and stronger metalinguistic analysis.

Both aspects are rated using a 1–5 Likert scale,

and the output is structured as the following example:

{

"EVALUATION 1": 4,

"EVALUATION 2": 5

}

3.4 Gold Standard

To serve as a point of comparison for the LLM-

generated judgments, we constructed a gold standard

dataset composed of human evaluations (component

4 of Figure 1). These annotations were performed by

a group of ten human evaluators, each one with prior

exposure to Semantic Web technologies, Knowledge

Graphs, and RDF modeling. The group includes pro-

fessionals from the public sector, graduate students

from academic institutions, and employees from pri-

vate companies with technical backgrounds.

Each human evaluator was randomly assigned

30 summarized sentences and their corresponding

RDF triples, totaling 300 unique sentence–triple pairs

across the full dataset. The evaluators were instructed

to assess the quality of the transformation from sen-

tence to triple according to the same four dimensions

used in the LLM evaluation prompts:

1. Subject Correctness: Was the subject correctly

extracted?

2. Predicate Correctness: Does the predicate accu-

rately represent the relationship expressed in the

sentence?

Leveraging Large Language Models for Semantic Evaluation of RDF Triples

3. Object Correctness: Does the object reﬂect the

target of the relation?

4. Triple Fidelity: Does the complete triple faith-

fully represent the sentence’s meaning?

For each question, the evaluators assigned a score

on a 5-point Likert scale, which was clearly deﬁned

in the annotation protocol to minimize ambiguity:

1 – Incorrect

2 – Many errors

3 – Approximately half correct

4 – Few errors

5 – Correct

Given the cognitive efforts required to make accu-

rate semantic judgments, the number of evaluations

per annotator (30) was chosen to balance annotation

reliability with fatigue reduction. The total number of

annotated instances (300) reﬂects the practical con-

straints of recruiting and coordinating qualiﬁed hu-

man annotators familiar with the nuances of RDF

triple structures and the Semantic Web, expertise that

is not widespread even among computer science re-

searchers.

3.5 Evaluation Procedure

The evaluation procedure was designed to systemati-

cally compare the judgments of human evaluators and

multiple LLMs across 300 text-to-triple transforma-

tions. Both types of evaluators assessed the same se-

mantic dimensions, allowing for direct comparison.

We evaluated three different LLM fami-

lies—Gemma, Qwen, and Sabi

a—each in two

sizes (a smaller and a larger version), and under

two prompt engineering strategies: zero-shot and

few-shot. This resulted in 12 distinct evaluation

setups:

3 models × 2 sizes × 2 prompt styles = 12 setups

Each of these 12 setups was used to independently

evaluate all 300 sentence–triple pairs across the four

semantic dimensions, generating a total of 14, 400 in-

dividual scores (12 setups × 300 triples × 4 ratings),

along with 3, 600 textual justiﬁcations.

To quantify agreement levels, we calculated the

average scores for each of the four dimensions in the

human-annotated gold standard. Similarly, we com-

puted the average scores produced by each of the 12

LLM setups. These aggregate values are compared in

Section 4, allowing us to analyze how closely LLM

judgments align with human expectations.

For each of the 3, 600 judgments generated by the

grader models, DeepSeek model was used as meta-

evaluator and prompted to assess the criteria pre-

sented in Section 3.3.

3.6 Metrics

For each of the four evaluation dimensions (subject,

predicate, object, and full triple), we computed the

correlation between the human scores and the scores

produced by each of the 12 LLM setups across all

300 sentence–triple pairs (component 6 of Figure 1).

The use of rank-based correlation metrics allows us

to quantify the strength and direction of association

between human and machine-generated judgments,

even when the absolute values differ.

• Spearman’s ρ (Spearman, 1987) evaluates the

monotonic relationship between two variables. It

is useful when we are interested in whether the or-

dering of scores (rather than their exact values) is

preserved between annotators. This is appropri-

ate in our case, where some variation in scoring

behavior is expected, but consistency in ranking

is desirable. The coefﬁcient ranges from –1 to

1, where 1 indicates perfect agreement in ranking

(i.e., identical order), 0 indicates no correlation,

and –1 indicates a perfectly reversed ranking.

• Kendall’s τ (Kendall, 1938) measures the num-

ber of concordant and discordant pairs between

two rankings. It is more conservative than Spear-

man’s ρ and more sensitive to small rank differ-

ences, making it a complementary metric that pro-

vides robustness to the analysis. Like Spearman’s

ρ, Kendall’s τ ranges from –1 to 1.

Both metrics are well-suited for ordinal data such

as Likert-scale judgments, and together they offer a

comprehensive view of how similarly LLMs and hu-

man evaluators perceive the semantic quality of RDF

triples.

3.7 Running Example

To illustrate the evaluation method in practice, we

present a representative example from the dataset, in-

cluding the original sentence, the corresponding RDF

triple, and the evaluation provided by one of the LLM

graders under the few-shot prompting condition.

Summarized Sentence. “O produto Kit Quadros

Decorativos Sala Quarto Moldura est

a dispon

ıvel em

tamanho maior, desde que seja solicitado o tamanho

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

desejado.” [original sentence in Portuguese]

RDF Triple. Kit Quadros Decorativos Sala Quarto

Moldura — dispon

ıvel — tamanho maior[original

triple in Portuguese]

LLM Judgments. The LLM evaluated the semantic

plausibility of the triple based on four distinct dimen-

sions, assigning the following Likert-scale scores:

• Subject Correctness: 5;

• Predicate Correctness: 5;

• Object Correctness: 5;

• Triple Fidelity: 5.

Justiﬁcation. “The triple correctly represents the

sentence, with the correct subject, predicate, and ob-

ject well extracted from the context.”

This example illustrates a high-agreement case,

where the LLM accurately interprets and assesses the

semantic alignment between the natural language in-

put and the RDF triple. All dimensions received the

maximum score, and the justiﬁcation shows logical

clarity and alignment with the numeric judgments.

4 RESULTS

This section presents the obtained results of the evalu-

ation experiments, organized into three parts. We ﬁrst

report average scores given by human and LLM eval-

uators (Subsection 4.1). Next, we analyze the corre-

lation between LLM and human scores using Spear-

man’s ρ and Kendall’s τ (Subsection 4.2). Finally,

we present the results of the meta-evaluation using a

larger language model to assess the clarity and coher-

ence of the LLMs’ judgments (Subsection 4.3).

4.1 Average Ratings from Humans and

LLMs

Human annotators (Gold Standard, GS) and each

of the 12 LLM setups evaluated the same 300 sen-

tence–triple pairs across four grading dimensions:

subject, predicate, object, and full triple. The ﬁrst step

in our analysis was to compute mean scores for each

grading dimension, both for human evaluators and for

the combined LLMs.

The overall average score assigned by all LLMs

across all dimensions was 4.300, whereas the aver-

age score from the GS was 4.498. This suggests

The Decorative Picture Kit Living Room Bedroom

Frame product is available in a larger size, as long as the

desired size is requested. – translated by the authors.

Decorative Picture Kit Living Room Bedroom Frame

— available — larger size

that human annotators rated the triple transformations

slightly more favorably than the LLMs.

When disaggregating the averages by evaluation

dimension, we observed a pattern (Figure 3): LLMs

assigned higher scores than humans only in the sub-

ject dimension. For predicate, object, and triple, hu-

man evaluators assigned higher grades.

SUBJECT

PREDICATE

OBJECT

TRIPLE

GS LLMs

Figure 3: Grading average by dimensions (triple part) using

Likert scale.

We computed average scores grouped by three

comparative axes: (i) model family (Gemma, Qwen,

Sabia), (ii) model size (small or large), and (iii)

prompt type (zero-shot vs. few-shot). For each

model, averages were calculated over all four execu-

tions (e.g., all versions of Gemma). Figure 4 presents

these comparisons. The closest match to the human

gold standard came from:

• Few-shot prompts (µ = 4.418),

• Sabia models (µ = 4.451), and

• Small models (higher than large counterparts).

Despite these proximities, no LLM setup sur-

passed the human average in any of the categories.

Grades

GEMMA

QWEN

SABIA

SMALL

LARGE

ZERO SHOT

FEW SHOT

4,000 4,100 4,200 4,300 4,400 4,500

Figure 4: Comparison among average gradings of Gold

Standard dataset (green), LLM family (red), LLM size

(blue), and prompt type (yellow).

Leveraging Large Language Models for Semantic Evaluation of RDF Triples

4.2 Correlation Between LLMs and

Human Ratings

To quantify the alignment between human and LLM

ratings, we computed Spearman’s ρ and Kendall’s τ

correlation coefﬁcients for each of the 12 LLM se-

tups, across each of the four evaluation dimensions.

Table 1 (Spearman) and Table 2 (Kendall) present the

attained results.

Table 1: Spearman scores for each component of the triple.

The highest value in each column is highlighted in bold. In

the ﬁrst column, the LLM setup identiﬁer is composed by

three parts: the LLM (e.g GE for Gemma), the size (e.g SM

for small) and prompt type (e.g ZS for zero shot).

LLM Subject Predicate Object Triple

GE SM ZS 0.112 0.216 0.350 0.372

GE SM FS 0.119 0.225 0.435 0.435

GE LA ZS 0.165 0.244 0.428 0.480

GE LA FS 0.134 0.198 0.414 0.443

QW SM ZS 0.075 0.110 -0.055 0.057

QW SM FS 0.104 0.147 0.406 0.426

QW LA ZS 0.560 0.172 0.350 0.358

QW LA FS 0.252 0.183 0.303 0.396

SA SM ZS 0.137 0.157 0.392 0.467

SA SM FS 0.078 0.148 0.460 0.421

SA LA ZS 0.265 0.241 0.310 0.375

SA LA FS 0.333 0.193 0.330 0.491

Table 2: Kendall Tau scores for each component of the

triple. The highest value in each column is highlighted in

bold. In the ﬁrst column, the LLM setup identiﬁer is com-

posed by three parts: the LLM (e.g GE for Gemma), the

size (e.g SM for small) and prompt type (e.g ZS for zero

shot).

LLM Subject Predicate Object Triple

GE SM ZS 0.107 0.199 0.315 0.334

GE SM FS 0.115 0.214 0.405 0.395

GE LA ZS 0.157 0.217 0.388 0.424

GE LA FS 0.129 0.184 0.382 0.395

QW SM ZS 0.070 0.101 -0.050 0.051

QW SM FS 0.100 0.140 0.366 0.382

QW LA ZS 0.530 0.155 0.316 0.314

QW LA FS 0.241 0.167 0.272 0.354

SA SM ZS 0.128 0.143 0.353 0.423

SA SM FS 0.075 0.143 0.436 0.386

SA LA ZS 0.250 0.226 0.293 0.339

SA LA FS 0.322 0.183 0.309 0.451

Highest Correlations per Dimension

• Subject: Highest ρ = 0.560 (Qwen Large Zero-

Shot), τ = 0.530 (same set).

• Predicate: Highest ρ = 0.244 (Gemma Large

Zero-Shot), τ = 0.217 (same set).

• Object: Highest ρ = 0.460 (Sabia Small Few-

Shot), τ = 0.436 (same set).

• Triple: Highest ρ = 0.491 (Sabia Large Few-

Shot), τ = 0.451 (same set).

No correlation score exceeded 0.56 in any case,

indicating a moderate agreement at best. To facilitate

interpretation, we computed:

• Mean correlation per LLM setup (averaging

across the four dimensions), yielding 12 Spear-

man and 12 Kendall values (Figure 5).

• Mean correlation per evaluation dimension

(averaging across the 12 setups), yielding four

mean Spearman and four mean Kendall scores

(Figure 6).

0,000

0,100

0,200

0,300

0,400

GE_SM_ZS

GE_SM_FS

GE_LA_ZS

GE_LA_FS

QW_SM_ZS

QW_SM_FS

QW_LA_ZS

QW_LA_FS

SA_SM_ZS

SA_SM_FS

SA_LA_ZS

SA_LA_FS

Kendall Tau Spearman

Figure 5: Correlation results by each setup. X axis shows

the 12 setups (e.g GE SM ZS stands for Gemma small zero

shot setup) and the y axis shows the values of Kendall and

Spearman.

0,000

0,100

0,200

0,300

0,400

SUBJECT PREDICATE OBJECT TRIPLE

Kendall Tau Spearman

Figure 6: Correlation results by each dimension. X axis

shows the 4 dimensions and the y axis shows the values of

Kendall and Spearman.

The majority of LLM setups (11 out of 12) yielded

mean correlations in the range of 0.2–0.35. When ag-

gregated by dimension, we found that correlations for

subject and predicate were notably lower than for

object and triple. The highest average correlation ap-

peared in the triple dimension:

• ρ = 0.393, τ = 0.354

These results indicate that LLMs tend to align

more closely with human evaluators when judging the

semantic quality of the complete triple rather than its

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

4,2

4,4

4,6

4,8

5,0

GE_SM_ZS

GE_SM_FS

GE_LA_ZS

GE_LA_FS

QW_SM_ZS

QW_SM_FS

QW_LA_ZS

QW_LA_FS

SA_SM_ZS

SA_SM_FS

SA_LA_ZS

SA_LA_FS

Clarity Coherence

Figure 7: Clarity and coherence values by each setup.

individual components. The fact that both correlation

metrics point to the same pattern strengthens the reli-

ability of this conclusion.

4.3 Meta-Evaluation Using a Large

Language Model

To assess the quality of the judgments produced by

the 12 LLM setups, we employed DeepSeek R1 as a

meta-evaluator. This model assessed each LLM judg-

ment based on two criteria:

1. Clarity of the justiﬁcation, and

2. Coherence of the ratings with the justiﬁcation.

Figure 7 presents the results. Average scores per

LLM setup ranged from 4.245 to 4.930 for both di-

mensions.

• Coherence scores were consistently higher than

clarity scores across all models.

• Few-shot setups (denoted by “FS” in the Fig-

ure 7) achieved higher average scores than their

zero-shot counterparts.

These results suggest that the judgments provided

by the smaller LLMs were, on average, both inter-

pretable and internally consistent—particularly when

few-shot prompting was used.

5 DISCUSSION

This study originally explored an evaluation frame-

work that combines expert human judgments with au-

tomated assessments from LLMs across twelve con-

ﬁgurations. Each conﬁguration differed in model

family (Gemma, Qwen, and Sabia), model size (small

vs. large), and prompting strategy (zero-shot vs. few-

shots). This design enabled a thorough examination

of the study’s core question: Can LLMs reliably eval-

uate structured outputs, such as RDF triples? Our

results indicate that although LLMs show promising

capabilities, they do not consistently achieve human-

level discernment across all evaluation criteria.

The credibility of the gold standard annotations

arises from the expertise of all annotators in Semantic

Web technologies and KG construction. A carefully

designed Likert scale facilitated nuanced evaluations

across four dimensions: subject, predicate, object,

and full triple. Each annotator assessed 30 distinct in-

stances to maintain a balanced distribution of labor,

and the overall design emphasized consistency de-

spite the sample size (N=300). Recruiting experts for

this specialized task presented logistical challenges,

highlighting the need to explore automated grading

alternatives while preserving the rigor of the gold

standard.

The LLMs’ efﬁcacy varied signiﬁcantly across

the 12 evaluation setups. Larger models tended to

produce scores more aligned with human judgments,

particularly under the few-shot prompting paradigm.

Leveraging Large Language Models for Semantic Evaluation of RDF Triples

Few-shot setups provided the models with exemplars

that likely improved semantic grounding and scoring

consistency. However, some smaller models unex-

pectedly demonstrated sharper precision in isolated

cases, particularly in predicate evaluation. These in-

consistencies highlight that model size alone is not

a sufﬁcient predictor of alignment; prompting con-

text and model architecture also play important roles.

The results underscore that while LLMs can replicate

some human-like judgment behavior, their reliability

ﬂuctuates depending on setup.

We used Spearman and Kendall Tau metrics to

measure the alignment between human and LLM

scores across all four evaluation axes. Some conﬁgu-

rations, particularly those with large models and few-

shot prompting, displayed moderate positive correla-

tions, especially in subject and full triple evaluations.

Many setups produced low or inconsistent correla-

tions, indicating that LLMs, in their current state and

conﬁguration (open-source, mid-sized models in the

e-commerce domain), are not yet suitable for fully re-

placing human evaluators. Their best application may

be as copilots in the RDF triple generation pipeline,

aiding in draft evaluations that humans subsequently

review.

We utilized DeepSeek to analyze the quality of

30-word justiﬁcations provided by LLM graders, val-

idating their reliability. This secondary meta evalua-

tion focused on two aspects: the clarity of justiﬁca-

tions and the coherence of assigned scores. Notably,

most setups received high scores from DeepSeek, in-

dicating that LLMs often articulated their reasoning

coherently, even when their ratings differed from hu-

man assessments. This ﬁnding highlights that the

quality of justiﬁcation may improve trust and trans-

parency, despite potential misalignment in scores.

However, DeepSeek’s evaluations did not strongly

correlate with human ratings, revealing a gap between

the clarity of explanations and the accuracy of judg-

ments.

Our ﬁndings indicate signiﬁcant implications for

using LLMs in Semantic Web evaluation tasks. Al-

though LLMs cannot yet replace expert annotators,

they can serve as valuable auxiliary evaluators, es-

pecially for large-scale RDF generation from text.

Their capability to produce structured justiﬁcations

is particularly beneﬁcial in hybrid pipelines, where

automated scoring complements human review. Fu-

ture research should investigate multi-agent architec-

tures, enabling several LLMs to collaboratively as-

sess and vote on outputs, which may enhance eval-

uation reliability. Improvements through optimized

prompt design, focused ﬁne-tuning, or human-in-the-

loop frameworks can lead to more trustworthy and ef-

fective systems in semantic evaluation contexts.

Both the dataset and the Sabia models employed

in this study were optimized for Portuguese. This in-

troduces a language-speciﬁc constraint that may limit

the generalizability of our results. While the ﬁnd-

ings provide strong evidence within the Portuguese

domain, further work is needed to validate whether

similar outcomes hold across other languages. As a

natural next step, future research should explore mul-

tilingual evaluation settings, testing the robustness of

the proposed approaches in diverse linguistic contexts

and broadening their applicability to global Semantic

Web scenarios.

6 CONCLUSION

Knowledge Graphs are central to organizing struc-

tured information, making the automation of their val-

idation important for scaling their real-world appli-

cations. Large Language Models can interpret both

natural and formal languages, offering an alternative

to the labor-intensive process of human evaluation.

This study examined 12 conﬁgurations of LLMs as

graders of RDF triples generated from e-commerce

texts, comparing their scores to those from expert hu-

man annotators. We measured the alignment between

LLM and human evaluations using Spearman and

Kendall Tau correlations, analyzing both complete

triples and their components (subject, predicate, ob-

ject). The results indicated that while some conﬁgu-

rations achieved moderate correlation, no model con-

sistently matched human reliability across all com-

ponents. We also assessed justiﬁcation-based rat-

ings using the DeepSeek R1 reasoning model, which

highlighted potential indicators of grader quality. We

found that these indicators did not strongly correlate

with human judgments. Our ﬁndings indicate that

while LLMs cannot fully replace expert reviewers,

they serve as promising supportive tools for seman-

tic validation tasks, providing scalability and facilitat-

ing preliminary assessments. Importantly, the mod-

erate correlation scores highlight that the proposed

approach is not yet ready for deployment in high-

stakes KG validation scenarios where accuracy and

reliability are critical. At present, LLM-based valida-

tion should be regarded as complementary, augment-

ing but not substituting expert judgment. Our eval-

uation analysis study lays the groundwork for future

research that integrates model reasoning, prompt re-

ﬁnement, multilingual testing, and human-in-the-loop

systems to improve the accuracy, trustworthiness, and

efﬁciency of RDF validation workﬂows.

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

ACKNOWLEDGMENTS

We thank the National Council of Technological

and Scientiﬁc Development (CNPq), Brazil, grant

#301337/2025-0.

REFERENCES

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I.,

Aleman, F. L., Almeida, D., Altenschmidt, J., Altman,

S., Anadkat, S., et al. (2023). Gpt-4 technical report.

arXiv preprint arXiv:2303.08774.

Bonatti, P. A., Decker, S., Polleres, A., and Presutti,

V. (2019). Knowledge graphs: New directions

for knowledge representation on the semantic web

(dagstuhl seminar 18371). Dagstuhl reports, 8(9):29–

111.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,

Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,

Askell, A., et al. (2020). Language models are few-

shot learners. Advances in neural information pro-

cessing systems, 33:1877–1901.

Elshin, D., Karpachev, N., Gruzdev, B., Golovanov, I.,

Ivanov, G., Antonov, A., Skachkov, N., Latypova, E.,

Layner, V., Enikeeva, E., et al. (2024). From gen-

eral llm to translation: How we dramatically improve

translation quality using human evaluation data for

llm ﬁnetuning. In Proceedings of the Ninth Confer-

ence on Machine Translation, pages 247–252.

Guerdan, L., Barocas, S., Holstein, K., Wallach, H. M.,

Wu, Z. S., and Chouldechova, A. (2025). Validating

llm-as-a-judge systems in the absence of gold labels.

CoRR.

Huang, H., Chen, C., He, C., Li, Y., Jiang, J., and

Zhang, W. (2024). Can llms be good graph judger

for knowledge graph construction? arXiv preprint

arXiv:2411.17388.

Kendall, M. G. (1938). A new measure of rank correlation.

Biometrika, 30(1-2):81–93.

Khorashadizadeh, H., Amara, F. Z., Ezzabady, M., Ieng,

F., Tiwari, S., Mihindukulasooriya, N., Groppe, J.,

Sahri, S., Benamara, F., and Groppe, S. (2024). Re-

search trends for the interplay between large lan-

guage models and knowledge graphs. arXiv preprint

arXiv:2406.08223.

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C.,

Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.

(2024). Deepseek-v3 technical report. arXiv preprint

arXiv:2412.19437.

Racharak, T., Wang, T., and Jearanaiwongkul, W. (2024).

An automated medical rdf knowledge graph construc-

tion from text using in-context learning. In 2024 16th

International Conference on Knowledge and System

Engineering (KSE), pages 465–471. IEEE.

Regino, A. and dos Reis, J. C. (2025). Can llms be knowl-

edge graph curators for validating triple insertions?

In Genet Asefa Gesese, Harald Sack, H. P. A. M.-P.

and Chen, L., editors, Proceedings of the Workshop

on Generative AI and Knowledge Graphs (GenAIK)

co-located with the 31st International Conference on

Computational Linguistics (COLING 2025), pages

87–99, Abu Dhabi, UAE. International Committee on

Computational Linguistics.

Regino, A. G., Caus, R. O., Hochgreb, V., and dos Reis,

J. C. (2023). From natural language texts to rdf triples:

A novel approach to generating e-commerce knowl-

edge graphs. In Coenen, F., Fred, A., Aveiro, D.,

Dietz, J., Bernardino, J., Masciari, E., and Filipe,

J., editors, Knowledge Discovery, Knowledge Engi-

neering and Knowledge Management, pages 149–174.

Springer Nature Switzerland.

So, C. C., Sun, Y., Wang, J.-M., Yung, S. P., Loh, A.

W. K., and Chau, C. P. (2025). Are large language

models capable of deep relational reasoning? insights

from deepseek-r1 and benchmark comparisons. arXiv

preprint arXiv:2506.23128.

Spearman, C. (1987). The proof and measurement of asso-

ciation between two things. The American journal of

psychology, 100(3/4):441–471.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,

M.-A., Lacroix, T., Rozi

ere, B., Goyal, N., Hambro,

E., Azhar, F., et al. (2023). Llama: Open and ef-

ﬁcient foundation language models. arXiv preprint

arXiv:2302.13971.

Zhang, H., Yu, P. S., and Zhang, J. (2025). A systematic sur-

vey of text summarization: From statistical methods

to large language models. ACM Computing Surveys,

57(11):1–41.

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,

Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.

(2023). Judging llm-as-a-judge with mt-bench and

chatbot arena. Advances in Neural Information Pro-

cessing Systems, 36:46595–46623.

Zhuang, Y., Yu, Y., Wang, K., Sun, H., and Zhang, C.

(2023). Toolqa: A dataset for llm question answering

with external tools. Advances in Neural Information

Processing Systems, 36:50117–50143.

Leveraging Large Language Models for Semantic Evaluation of RDF Triples