Ontology Semantic Disambiguation by LLM

Anastasiia Riabova

, R

emy Kessler

1 a

and Nicolas B

echet

2 b

Univ. d’Avignon, LIA, 339 chemin des Meinajari

es, 84911 Avignon, France

Univ. Bretagne Sud, CNRS, IRISA, Rue Yves Mainguy, 56000 Vannes, France

Keywords:

LLM, CamemBERT, Ontology, Zero/Few-Shot, CoT, Self-Consistency, Prompt Engineering.

Abstract:

Within the BPP project, a combination of statistics and word n-gram extraction enabled the creation of a

bilingual (French/English) ontology in the ﬁeld of e-recruitment. The produced dataset was of good quality,

but it still contained errors. In this paper, we present an approach that explores the use of large language

models (LLMs) to automate the validation and enrichment of ontologies and knowledge graphs. Starting with

a naive prompt and using small language models (SLMs), we tested various approaches, including zero-shot,

few-shot, chain-of-thought (CoT) reasoning, and self-consistency (SC) decoding. The preliminary results are

encouraging, demonstrating the ability of LLMs to make complex distinctions and to evaluate the relationships

derived from our ontology ﬁnely.

1 INTRODUCTION

Integrating knowledge from various sources into on-

tologies and knowledge graphs remains a complex

problem. The Web is now a major resource, with

vast and diverse information, digital encyclopedias,

forums, blogs, public websites, ”social tagging,” and

networks, enabling the generation of ontologies and

knowledge graphs. Yet, the exponential growth of

these bases makes manual veriﬁcation and validation

increasingly time-consuming.

Within the BPP project (Butterﬂy Predictive

Project), statistics and n-gram extraction supported

the creation of an ontology for e-recruitment in En-

glish and French, covering 440 job titles across 27

sectors, accessible here

. Part of it was manually

evaluated with good results (0.8 precision), but the

large number of candidate terms prevented full vali-

dation.

The rise of Large Language Models (LLMs) opens

new perspectives for ontology enrichment. They can

detect errors, inconsistencies, ill-deﬁned concepts,

and missing relations Petroni et al. (2019). Lever-

aging their analysis and generation abilities makes

creating consistent and user-friendly knowledge bases

more feasible.

https://orcid.org/0000-0002-9947-3048

https://orcid.org/0000-0001-9425-5570

https://www-labs.iro.umontreal.ca/

∼

lapalme/LBJ/

BPPontologie/

This study presents initial results to aid semantic

disambiguation of concepts and relations within this

noisy ontology. After related work (Section 2), we

present the data (Section 3), then detail our approach

(Section 4) and ﬁrst results (Section 5), before con-

cluding (Section 6).

2 RELATED WORKS

Recent advances in natural language processing

(NLP) have improved the modeling of semantic re-

lationships in ontologies. Resources like ESCO or

ROME provide structured information, but keeping

them up to date requires signiﬁcant manual work. To

address this, deep learning approaches based on trans-

formers have emerged Vaswani et al. (2017), either

encoder-only (e.g. BERT Devlin et al. (2018)) or

decoder-only (e.g. GPT Brown and al. (2020)). En-

coders project entities into a vector space to estimate

semantic proximity, while LLMs enable explicit rea-

soning methods such as zero-shot, few-shot, or chain-

of-thought prompting (CoT). In this article, we com-

pare these methods for semantic matching on noisy

recruitment ontology data.

LLM performance has grown with model scal-

ing, but larger models face limits in energy con-

sumption and deployment complexity. Alterna-

tive methods—few-shot prompting Brown and al.

(2020), CoT Wei et al. (2022); Kojima et al. (2022),

Riabova, A., Kessler, R. and Béchet, N.

Ontology Semantic Disambiguation by LLM.

DOI: 10.5220/0013834800004000

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2025) - Volume 2: KEOD and KMIS, pages

175-183

ISBN: 978-989-758-769-6; ISSN: 2184-3228

175

instruction-tuned CoT Ranaldi and Freitas (2024),

or self-consistency Wang et al. (2022); Chen et al.

(2023) improve results without increasing size. Given

the time-intensive nature of ontology construction,

adopting LLMs appears both expected and justiﬁed.

Several works explore this direction. Meyer et

al. Meyer et al. (2024) tested ChatGPT for query

generation and knowledge extraction. Kommineni et

al. Kommineni et al. (2024) combined ChatGPT for

competency questions with Mixtral 8x7B for entity

extraction, building a knowledge graph via a RAG-

based workﬂow. Abolhasani and Pan Abolhasani and

Pan (2024) developed OntoKGen, which uses an it-

erative CoT algorithm with user validation for auto-

mated graph generation in Neo4j.

Unlike these works, which focus on building on-

tologies from scratch, we address an already popu-

lated ontology, emphasizing LLM-based validation of

entities and relationships.

3 DATA AND STATISTICS

In this section, we present the data from the BPP

project. A bilingual (English/French) ontology of 440

occupations from 27 activity domains was developed

for the e-recruitment sector.

Each occupation is linked to the necessary skills

for its practice, totaling approximately 6,000 different

skills. This data is organized according to the ESCO

modeling le Vrang et al. (2014)

, a multilingual Euro-

pean classiﬁcation project for skills, occupations, and

qualiﬁcations, aiming to create European harmoniza-

tion in recruitment. Table 1 presents some descriptive

statistics for this ontology.

Table 1: Descriptive statistics for the ontology.

Ontology

in French in English

Unigrams 9,335 3,810

Bigrams 5,995 3,785

Trigrams 305 2,421

Unique skills 2,962 4,015

No. of occupations 312 127

Table 2 presents an example of evaluated n-grams

for the occupation of ’Analyste ﬁnancier’ (Financial

analyst), categorized by transversal skills

and tech-

European Skills Competences and Occupations https:

//ec.europa.eu/esco/

Transversal skills, also known as soft skills, are per-

sonal and social skills, oriented towards human interactions,

nical skills

. Each occupation is thus linked to a set

of word n-grams (from 1 to 3) ranked by TF-IDF.

4 METHODOLOGY

This section is divided into three subsections, cor-

responding to the three phases of our experiments.

First, we present the methodology used for ﬁne-

tuning a BERT-based model. Second, we will de-

scribe the methodology behind our experiment in-

volving prompt engineering. And thirdly, we present

our ﬁnal pipeline, which yielded the best results.

Fine-Tuning an Encoder-Only Model. This sub-

section presents experiments conducted with encoder-

only models, whose objective was to measure the se-

mantic similarity between occupation and skill de-

scriptions. These models represent each element (oc-

cupation or skill) as a vector in a latent space and esti-

mate their proximity using a measure such as the co-

sine similarity. To train an encoder model to classify

occupation-skill pairs as relevant or not, we formu-

lated the problem as a binary classiﬁcation task. Posi-

tive examples were extracted directly from the ESCO

ontology. To generate negative examples, we com-

pared three techniques: negative random sampling,

easy negative mining and hard negative mining. Fig-

ure 1 illustrates the three methods.

Random Negative Sampling. For each occupation,

we randomly selected unrelated skills in ESCO, con-

sidering them as irrelevant. This method assumes that

relationships absent in the ontology correspond to ab-

sences of semantic link, which can introduce latent

noise if some skills, relevant, are simply not listed.

Easy Negative Mining. A more controlled variant

of random negative sampling consists of selecting,

among the skills unrelated to a given occupation,

those that are most semantically distant in the vector

space. This method, known as easy negative mining,

enables the creation of high-quality negative exam-

ples while minimizing the risk of false negatives.

To this end, we used vector representations de-

rived from the [CLS] token of CamemBERT Martin

et al. (2020), which serves as a global summary of the

which can be considered relevant regardless of the occupa-

tion.

Technical skills, also known as hard skills, are for-

mally demonstrable skills resulting from technical learning,

often academic, and evidenced by grades, diplomas, or cer-

tiﬁcates.

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

176

Table 2: List of skills, classiﬁed by unigrams, bigrams, and trigrams, for the occupation ’ﬁnancial analyst’ obtained from 497

job offers. N-grams considered irrelevant have been struck through. Skills are sorted in descending order of score.

Financial analyst

Soft skills Hard skills

ﬁnancial, business, support, management,

process, reports, data, project, including,

projects ...

accounting, analysis, ﬁnance, reporting, cpa,

cma, budget, cga, end, forecast ...

analytical skills, communication skills,

problem solving, ability work, internal

external, real estate, decision making,

interpersonal skills, ﬁnancial services, verbal

written ...

ﬁnancial analyst, ﬁnancial analysis, ﬁnancial

reporting, ﬁnancial statements, variance

analysis, ﬁnance accounting, accounting

ﬁnance, balance sheet, journal entries, ﬁnancial

modelling ...

analytical problem solving, problem solving

skills, verbal written communication, ability

work independently, key performance

indicators, fast paced environment, oral written

communication, time management skills,

communication interpersonal skills,

interpersonal communication skills ...

ﬁnancial planning analysis, ad hoc reporting,

ﬁnancial reporting analysis, ﬁnancial analysis

reporting, ﬁnancial statement preparation, year

end close, consolidated ﬁnancial statements,

planning budgeting forecasting, business case

analysis, possess strong analytical ...

sequence. Each occupation and skill was individually

encoded and represented by the [CLS] token vector

extracted from the model’s ﬁnal layer. We then com-

puted the cosine distance between each occupation’s

[CLS] vector and those of all unrelated skills in the

ESCO ontology. For each occupation, the most dis-

tant skills were selected as easy negatives and labeled

as 0 in our classiﬁcation dataset.

Hard Negative Mining. To complement the pre-

vious methods, we explored a hard negative mining

strategy, inspired by contrastive learning Robinson

et al. (2021), to generate more difﬁcult negative ex-

amples, i.e. skills that are semantically close to the

occupations but not actually related. The objective is

to encourage the model to learn ﬁner distinctions be-

tween truly relevant and ambiguous cases.

We used the positive-aware hard negative mining

proposed in de Souza P. Moreira et al. (2025), in

particular Top-k with percentage to positive thresh-

old (TopKPercPos). This method helps to reduce the

number of potential false negatives, which represent a

fairly common problem when performing hard nega-

tive mining, taking advantage of information from the

positive relevance score (percentage in this case).

Unlike the previous method, which relied on

CamemBERT for embeddings, here we employed the

intﬂoat/multilingual-e5-large model, a multilingual

encoder trained with contrastive learning and compat-

ible with the Sentence Transformers library Reimers

and Gurevych (2019). This choice ensured consis-

tency with the family of models used in our reference

article.

For each occupation, we proceeded as follows:

• We computed the embeddings of all skills and oc-

cupations using the E5 model.

• We identiﬁed the highest cosine similarity score

among the positive skills for the given profession.

• We then selected, among the unrelated skills,

those whose score was less than 95% of this posi-

tive score, as hard negatives.

However, with a threshold set at 95% of the pos-

itive score, we observed that many relevant, yet not

explicitly related, skills were falsely considered neg-

ative. We therefore lowered the threshold to 90% of

the positive score, which preserved the difﬁculty of

the negative examples while reducing the occurrence

of false negatives.

4.1 Prompt Engineering

In this subsection, we present the LLMs used for

evaluating occupation–skill relationships. Unlike en-

coder models, which produce vector representations

for static pairs, LLMs are used here in a generative

setting, responding to carefully designed prompts.

This setup enables us to leverage their ability to fol-

low instructions, reason over input, generalize, and

generate explanations.

We used several models, belonging to different

families and sizes: open-source models that can be

deployed locally (Mistral, Gemma, DeepSeek, Qwen,

Phi) via the Ollama tool, as well as proprietary mod-

els accessible via a web interface and API (GPT, Le

Chat (Mistral)).

They were tested with different prompt conﬁgura-

tions: zero-shot, few-shot, chain-of-thought, and self-

consistency. It will be discussed in more detail in the

following subsections.

Ontology Semantic Disambiguation by LLM

177

Figure 1: Schematic representation of the negative sampling methods in the embedding space for a given occupation (yellow

circle).

4.1.1 Zero-Shot and Few-Shot Prompting

We began in a zero-shot setting, i.e., without pro-

viding examples or detailed instructions. This ap-

proach is not only computationally lightweight but

also allows us to directly evaluate the model’s implicit

knowledge and potential biases.

Few-shot prompting involves presenting the

model with a small number of annotated examples di-

rectly within the prompt, in order to guide its behav-

ior without additional training. As demonstrated by

Brown et al. Brown and al. (2020) with GPT-3, this

approach often achieves results comparable to ﬁne-

tuning, while avoiding the costs associated with cre-

ating annotated datasets.

In our study, few-shot prompting was used as an

intermediate step between zero-shot prompting and

more advanced techniques such as chain-of-thought

prompting. The objective was twofold: to assess the

potential improvement over the zero-shot setting; to

observe the effects of example formatting and distri-

bution on model responses.

Few-shot prompting was subsequently reused in

a more elaborate form in the context of chain-of-

thought prompting.

4.1.2 Chain-of-Thought Prompting

Conventional zero-shot and few-shot prompting ap-

proaches show their limitations when a task requires

explicit reasoning. To address this shortcoming,

Wei et al. Wei et al. (2022) introduced Chain-of-

Thought (CoT) prompting, which involves enriching

the prompt with a natural language reasoning chain

that explicitly exposes the logical steps leading to the

answer. This method has achieved signiﬁcant im-

provements on several complex reasoning tasks, but

primarily with very large models (over 100 billion pa-

rameters).

By contrast, small models (Small Language Mod-

els, SLMs) tend to produce superﬁcially coherent but

logically incorrect reasoning Gudibande et al. (2023),

often leading to worse performance than conventional

prompting. To address this limitation without re-

sorting to resource-intensive techniques such as ﬁne-

tuning or knowledge distillation, we adopted a strat-

egy of directly injecting pre-constructed reasoning

chains into the few-shot prompt.

Three variants of CoT prompting were tested:

1. Zero-shot Chain-of-Thought (zero-CoT), based

on the approach of Kojima et al. Kojima et al.

(2022), simply adds the phrase “Let’s think step

by step” to the original prompt.

2. CoT generated by Mistral Le Chat (in-family):

We asked the model to generate complete justi-

ﬁcations for occupation–skill pairs, which were

then inserted into a few-shot prompt. The goal

was to test the hypothesis proposed by Ranaldi

and Freitas (2024), namely that a student model

beneﬁts more from exposure to reasoning gener-

ated by a model from the same family (in-family

alignment).

3. ChatGPT-generated CoT (out-of-family): This

variant followed the same approach as above, but

used ChatGPT-4 as the generator. GPT-4 acted

here as an out-of-family teacher model, following

a logic inspired by knowledge distillation with-

out training, but by injecting curated reasoning

examples. All generated outputs were manually

reviewed by an expert annotator.

4.1.3 Self-Consistency Decoding

The self-consistency (SC) method, introduced by

Wang et al. Wang et al. (2022), aims to improve the

robustness of CoT prompting. Instead of relying on a

single greedy response, this approach samples multi-

ple reasoning chains via stochastic decoding and de-

termines the ﬁnal answer by majority voting. The un-

derlying intuition is that for complex tasks, correct

reasoning—though diverse—tends to converge on the

same conclusion more often than incorrect reasoning.

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

178

Figure 2: Self-consistency using a CoT prompt generated

by the teacher LLM and validated by a human.

In this study, we did not use the classic probabilis-

tic decoding mechanism, but instead adapted the self-

consistency concept for our context. More speciﬁ-

cally, we implemented two variants:

Majority Voting. For each occupation–skill pair,

we generated nine independent responses from the

same model using an identical few-shot CoT prompt.

The ﬁnal prediction (yes or no) was determined by a

majority vote across the nine responses.

Universal Self-Consistency (USC). Inspired by

Chen et al. Chen et al. (2023), this variant consists of

submitting the nine generated responses to a follow-

up prompt in which the same model is asked to select

the most consistent answer, according to its judgment.

Figure 2 illustrates the overall process of imple-

menting SC using a CoT prompt generated by the

teacher LLM.

4.1.4 Reformulating the Instructions

Although the techniques explored in the previous sub-

sections improved model performance, the results re-

mained insufﬁcient to achieve satisfactory ﬁltering

quality. In particular, the Mistral 7B model et al.

(2023) — despite its efﬁciency and speed — fre-

quently produced contradictory responses, sometimes

accompanied by incorrect or internally inconsistent

explanations.

As also observed by Gudibande et al. Gudibande

et al. (2023), SLMs tend to mimic the reasoning struc-

ture of large teacher models without really under-

standing the underlying logic. This limitation reduces

their ability to accurately detect errors in noisy data.

We then formulated two hypotheses to explain the

persistence of false positives:

Prompt Wording: Our initial prompt asked

whether a “skill” was required for a given job, while

some of the inputs to be evaluated were not skills at

all (e.g., “job search,” “10,” “thread”). This lexical

bias may have led the model to validate such terms by

default.

Model Size: Despite recent advances, a model’s

overall knowledge and reasoning capabilities remain

strongly correlated with its size. Smaller models

struggle to generalize or to effectively exploit limited

contextual clues.

To test the ﬁrst hypothesis, we designed a revised

prompt that emphasizes the automatic and potentially

noisy nature of the candidate terms to be assessed.

The aim was to free the model from the implicit as-

sumption that “this term is a skill” and encourage it to

more readily reject vague or irrelevant inputs:

You are a job market expert. You are given a

job and a candidate skill. This skill was ex-

tracted automatically and may be incorrect, ir-

relevant, or too vague. Your task is to deter-

mine whether this skill is:

yes: a technical skill that is truly necessary for

this job; no: a behavioral or transversal skill,

or a skill that does not correspond to this job

(e.g., vague, redundant, irrelevant, etc.).

Answer only ”yes” or ”no”, followed by a

short explanation.

4.2 Ensembling LLMs

To test the second hypothesis (the effect of model

size), we compared the performance of Mistral 7B

with ﬁve larger models, each evaluated using both the

initial and the revised prompt.

Then, inspired by the performance gains observed

with self-consistency, we explored the potential of

model ensembling using these larger models, which

already demonstrated satisfactory results individually.

For each occupation–skill pair, we performed a ma-

jority vote across the predictions of the ﬁve models.

Figure 2 illustrates the overall principle of this voting

process.

Figure 3: Schematic representation of the ensembling.

Ontology Semantic Disambiguation by LLM

179

5 EXPERIMENTS

In this section, we present our experiments and re-

sults, organized into three parts corresponding to the

main stages of our methodology.

5.1 Experimental and Evaluation

Protocol

We focused on a subset of the complete ontology

by selecting only the occupations and skills from the

”Telecoms, Hosting, Internet” activity domain. This

subset includes 294 skills linked to 5 occupations. Af-

ter a thorough manual analysis, we determined that

some repeated skills could be grouped by occupation,

resulting in a ﬁnal set of 289 distinct skills. A man-

ual evaluation by three domain experts was performed

on the obtained results. The inter-annotator agree-

ment, measured using Fleiss’ Kappa Fleiss (1971)

reached 0.75, indicating a substantial level of agree-

ment, while also highlighting the inherent difﬁculty

of the task. We conducted a second review to reassess

the points of disagreement. Most of these centered on

skills that also tended to confuse the language mod-

els, falling primarily into two categories: soft skills

and vague or ambiguous terms. Given that the ontol-

ogy already provides a separate list of soft skills for

each occupation, we chose to retain only hard (techni-

cal) skills. Consequently, vague terms such as “inte-

gration” or “infrastructure,” as well as soft skills like

“team leader,” were annotated as “no.”

Given that the subset of data used is imbalanced,

with 173 instances labeled ”yes” and 116 ”no,” we

report Precision, Recall, F-score metrics for both

classes and overall accuracy to ensure a more reliable

evaluation.

5.2 Results

5.2.1 Fine-Tuning an Encoder-Only Model

Since the data subset is in French, we selected

CamemBERT-base Martin et al. (2020), a RoBERTa-

based model, for ﬁne-tuning. The dataset (occupa-

tion–skill pairs) was framed as a binary classiﬁca-

tion problem: 1 = positive, 0 = negative. Positive

pairs came from ESCO. As ESCO lacks negatives, we

tested three negative mining strategies (cf. method-

ology). Two dataset versions were used: (i) a pre-

processed one with lexical noise; (ii) a cleaned ver-

sion where skills were rewritten automatically (e.g.,

respect echeanciers → respect des

ech

eanciers).

Training used identical hyperparameters (3

epochs, batch size: 32, learning rate: 1e-5). Results

are shown in Table 3.

5.2.2 Prompt Engineering

Zero-Shot. With Mistral 7B, the basic prompt “Does

occupation X require skill Y?” revealed: (i) a strong

bias toward “yes”; (ii) systematic inclusion of soft

skills; (iii) lexical variability in outputs; (iv) sensitiv-

ity to wording.

Few-Shot. Using four annotated examples, Mis-

tral reproduced the demo format but also inherited the

class imbalance, yielding extreme “yes” bias (only 15

“no” predictions).

CoT and SC. Chain-of-Thought methods gave

modest gains but increased costs. Comparing Uni-

versal Self-Consistency (USC) and majority-vote SC

showed SC was usually superior with zero-CoT

prompts. Table 4 shows results.

New Prompt. To improve results, we designed a

revised prompt (see Methodology). Tested zero-shot

with several LLMs, it signiﬁcantly boosted accuracy

(Table 5).

5.3 Discussion

5.3.1 Negative Mining Strategies

For encoder-only models, we observed that: Easy

negatives caused overﬁtting: near-perfect accuracy

on ESCO but poor generalization. Random nega-

tives offered the best trade-off, with balanced pre-

cision/recall and stable training. Hard negatives

slowed convergence but improved robustness, espe-

cially on our manual dataset.

Fine-tuned models consistently performed better

on Mistral-corrected inputs, conﬁrming the role of

lexical clarity. BERT models underperform LLMs,

partly because ESCO skills are long and generic,

while our ontology emphasizes concise, technical

terms. This limits BERT’s vocabulary alignment and

cross-dataset generalization.

5.3.2 LLM-Based Evaluations

LLMs showed both strengths and limitations. Small

models (e.g., Mistral 7B) lacked reasoning depth and

were prompt-sensitive. Larger ones (12–14B) showed

biases and often accepted vague terms. Intra-model

variability remained an issue across runs.

These ﬁndings motivated an ensemble strategy us-

ing majority voting, which reduced inconsistencies.

Compared with GPT-4o/4.1, large open-source mod-

els were already competitive, yet our ensemble con-

sistently outperformed both them and individual GPT

baselines (Table 6).

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

180

Table 3: Results with ﬁne-tuned CamemBERT.

Model Data Method Precision Recall F1 Accuracy

yes no yes no yes no

CamemBERT

Preproc. random neg. 0.63 0.54 0.87 0.23 0.73 0.33 0.61

easy neg. 0.59 0.09 0.94 0.01 0.72 0.02 0.57

hard 95% 0.66 0.53 0.75 0.42 0.70 0.47 0.62

hard 90% 0.81 0.35 0.65 0.55 0.72 0.43 0.63

Cleaned random neg. 0.63 0.77 0.97 0.17 0.77 0.28 0.65

easy neg. 0.60 – 1.00 0.00 0.75 – 0.60

hard 95% 0.65 0.64 0.89 0.29 0.75 0.40 0.65

hard 90% 0.66 0.78 0.95 0.27 0.78 0.39 0.68

Table 4: Performance of Mistral 7B with prompting methods.

Model Method Precision Recall F1 Accuracy

yes no yes no yes no

Mistral 7B

zero-shot 0.62 0.53 0.87 0.22 0.72 0.31 0.61

few-shot 0.65 0.74 0.94 0.25 0.77 0.37 0.66

zero-shot CoT 0.68 0.56 0.75 0.47 0.71 0.51 0.64

CoT Mistral 0.65 0.65 0.90 0.28 0.76 0.39 0.65

CoT ChatGPT 0.65 0.80 0.96 0.24 0.78 0.37 0.67

CoT SC 0.67 0.70 0.90 0.34 0.77 0.46 0.67

CoT USC 0.64 0.60 0.89 0.25 0.74 0.35 0.63

Table 5: Performance with old vs. new zero-shot prompt.

Model Prompt Precision Recall F1 Accuracy

yes no yes no yes no

Mistral 7B old 0.62 0.53 0.87 0.22 0.72 0.31 0.61

new 0.74 0.66 0.80 0.58 0.77 0.61 0.71

Gemma-3 12B old 0.68 0.69 0.89 0.36 0.77 0.48 0.68

new 0.88 0.91 0.95 0.81 0.91 0.86 0.89

DeepSeek-R1 14B old 0.68 0.53 0.68 0.53 0.68 0.53 0.62

new 0.85 0.85 0.91 0.77 0.88 0.81 0.85

Phi-4 14B old 0.61 0.42 0.55 0.48 0.58 0.45 0.52

new 0.88 0.90 0.94 0.82 0.91 0.86 0.89

Qwen-3 14B old 0.72 0.51 0.58 0.65 0.64 0.58 0.61

new 0.91 0.81 0.86 0.87 0.88 0.84 0.86

Table 6: Ensemble vs GPT models with new zero-shot prompt.

Model Prompt Precision Recall F1 Accuracy

yes no yes no yes no

Ensemble new 0.93 0.91 0.94 0.90 0.94 0.90 0.92

GPT-4o new 0.77 0.92 0.97 0.57 0.86 0.70 0.81

GPT-4.1 new 0.86 0.88 0.93 0.77 0.89 0.82 0.86

6 CONCLUSION

This study explored several strategies to improve

the automatic validation of occupation–skill relations

in an ontology, combining ﬁne-tuned encoder-based

models and prompt-based LLM evaluations. We

demonstrated that hard negative mining yields more

robust classiﬁcation for encoder models, especially

when coupled with input correction. In parallel,

prompt engineering and reasoning-based prompting

(CoT, self-consistency) improved LLM performance,

though limitations persisted—particularly in smaller

Ontology Semantic Disambiguation by LLM

181

models. To address these, we proposed an ensemble

approach that outperformed all individual models, in-

cluding proprietary LLMs like GPT-4, highlighting its

potential as a lightweight yet effective alternative for

ontology curation tasks.

Despite remaining challenges, our work opens

promising directions for automating knowledge base

validation and enrichment. In future work, we aim to

investigate ﬁne-tuning strategies for LLMs to improve

their reasoning on domain-speciﬁc tasks. Another

perspective involves adapting our methods to differ-

ent domains and ontological structures. We also see

potential in integrating external knowledge sources,

such as curated databases of occupations and skills, to

enhance LLM interpretability and decision-making.

Finally, assessing the impact of these methods on real-

world applications, like recommendation systems or

career guidance platforms, would be an essential step

toward validating their practical value.

ACKNOWLEDGEMENTS

The authors gratefully acknowledge InterMEDIUS

for the partial funding of this work.

REFERENCES

Abolhasani, M. S. and Pan, R. (2024). Leveraging llm for

automated ontology extraction and knowledge graph

generation.

Brown, T. B. and al. (2020). Language Models are Few-

Shot Learners. arXiv:2005.14165 [cs] version: 1.

Chen, X., Aksitov, R., Alon, U., Ren, J., Xiao, K., Yin,

P., Prakash, S., Sutton, C., Wang, X., and Zhou,

D. (2023). Universal Self-Consistency for Large

Language Model Generation. arXiv e-prints, page

arXiv:2311.17311.

de Souza P. Moreira, G., Osmulski, R., Xu, M., Ak, R.,

Schifferer, B., and Oldridge, E. (2025). Nv-retriever:

Improving text embedding models with effective hard-

negative mining.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova,

K. (2018). BERT: Pre-training of Deep Bidirec-

tional Transformers for Language Understanding.

arXiv:1810.04805 [cs] version: 1.

et al., A. Q. J. (2023). Mistral 7b.

Fleiss, J. L. (1971). Measuring nominal scale agree-

ment among many raters. Psychological Bulletin,

76(5):378–382. Place: US Publisher: American Psy-

chological Association.

Gudibande, A., Wallace, E., Snell, C., Geng, X., Liu, H.,

Abbeel, P., Levine, S., and Song, D. (2023). The

False Promise of Imitating Proprietary LLMs. arXiv

e-prints, page arXiv:2305.15717.

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y.

(2022). Large language models are zero-shot reason-

ers. In Koyejo, S., Mohamed, S., Agarwal, A., Bel-

grave, D., Cho, K., and Oh, A., editors, Advances in

Neural Information Processing Systems, volume 35,

pages 22199–22213. Curran Associates, Inc.

Kommineni, V. K., K

onig-Ries, B., and Samuel, S. (2024).

From human experts to machines: An llm supported

approach to ontology and knowledge graph construc-

tion.

le Vrang, M., Papantoniou, A., Pauwels, E., Fannes, P., Van-

densteen, D., and De Smedt, J. (2014). Esco: Boosting

job matching in europe with semantic interoperability.

Computer, 47(10):57–64.

Martin, L., Muller, B., Ortiz Su

arez, P. J., Dupont, Y., Ro-

mary, L., de la Clergerie,

E., Seddah, D., and Sagot, B.

(2020). CamemBERT: a tasty French language model.

In Jurafsky, D., Chai, J., Schluter, N., and Tetreault,

J., editors, Proceedings of the 58th Annual Meet-

ing of the Association for Computational Linguistics,

pages 7203–7219, Online. Association for Computa-

tional Linguistics.

Meyer, L.-P., Stadler, C., Frey, J., Radtke, N., Junghanns,

K., Meissner, R., Dziwis, G., Bulert, K., and Martin,

M. (2024). Llm-assisted knowledge graph engineer-

ing: Experiments with chatgpt. In Zinke-Wehlmann,

C. and Friedrich, J., editors, First Working Con-

ference on Artiﬁcial Intelligence Development for a

Resilient and Sustainable Tomorrow, pages 103–115,

Wiesbaden. Springer Fachmedien Wiesbaden.

Petroni, F., Rockt

aschel, T., Lewis, P., Bakhtin, A., Wu,

Y., Miller, A. H., and Riedel, S. (2019). Language

models as knowledge bases? arXiv preprint

arXiv:1909.01066.

Ranaldi, L. and Freitas, A. (2024). Aligning Large and

Small Language Models via Chain-of-Thought Rea-

soning. In Graham, Y. and Purver, M., editors, Pro-

ceedings of the 18th Conference of the European

Chapter of the Association for Computational Lin-

guistics (Volume 1: Long Papers), pages 1812–1827,

St. Julian’s, Malta. Association for Computational

Linguistics.

Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sen-

tence embeddings using Siamese BERT-networks. In

Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Pro-

ceedings of the 2019 Conference on Empirical Meth-

ods in Natural Language Processing and the 9th Inter-

national Joint Conference on Natural Language Pro-

cessing (EMNLP-IJCNLP), pages 3982–3992, Hong

Kong, China. Association for Computational Linguis-

tics.

Robinson, J., Chuang, C.-Y., Sra, S., and Jegelka, S. (2021).

Contrastive learning with hard negative samples.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L. u., and Polosukhin,

I. (2017). Attention is all you need. In Guyon,

I., Luxburg, U. V., Bengio, S., Wallach, H., Fer-

gus, R., Vishwanathan, S., and Garnett, R., editors,

Advances in Neural Information Processing Systems,

volume 30. Curran Associates, Inc.

KEOD 2025 - 17th International Conference on Knowledge Engineering and Ontology Development

182

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E.,

and Zhou, D. (2022). Self-Consistency Improves

Chain of Thought Reasoning in Language Models.

arXiv:2203.11171 [cs] version: 1.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E.,

Le, Q., and Zhou, D. (2022). Chain of Thought

Prompting Elicits Reasoning in Large Language Mod-

els. arXiv:2201.11903 [cs] version: 1.

Ontology Semantic Disambiguation by LLM

183