Ontology-Grounded Language Modeling: Enhancing GPT-Based

Philosophical Text Generation with Structured Knowledge

Claire Ponciano

, Markus Schaffert

and Jean-Jacques Ponciano

i3mainz, University of Applied Sciences, Germany

Keywords:

Ontology-Grounded Language Modeling, GPT, Knowledge-Enhanced Text Generation, Retrieval-Augmented

Generation, Spinoza, Linked Open Data, Historical Text Synthesis, Philosophical Language Modeling,

BERTScore Evaluation, Structured Knowledge Integration, Latin Text Generation, Large Language Models,

Text Style Transfer, Semantic Conditioning, Canonical Corpus Fine-Tuning.

Abstract:

We present an ontology-grounded approach to GPT-based text generation aimed at improving factual ground-

ing, historical plausibility, and stylistic ﬁdelity in a case study: Baruch Spinoza’s Latin writings. We construct

a compact ontology from Linked Open Data (Wikidata/DBpedia) augmented with expert-curated facts, seri-

alize triples into natural-language statements, and interleave these with a canonical Latin corpus during ﬁne-

tuning of a GPT-2 (124M) model. At inference, retrieval-augmented generation (RAG) prepends ontology-

derived facts and lightweight stylistic instructions, guiding the model toward historically consistent continua-

tions in Spinoza’s register. Evaluation follows an 80/20 paragraph split of Ethica: we generate continuations

for the 80% of segments retained and measure the semantic similarity (BERTScore) with the 20% omitted.

This evaluation is completed by an expert assessment of historical plausibility and cosine similarity scores

computation for the stylistic authenticity. Relative to a GPT-2 baseline trained only on the Latin corpus, our

ontology-grounded variant achieves higher BERTScore and produces fewer factual and conceptual errors, pre-

serving Latin rhetorical structure. These results indicate that structured knowledge integration is a feasible and

effective way to make generative models more reliable for cultural-heritage text.

1 INTRODUCTION

The preservation of cultural-heritage texts is ham-

pered by losses due to deterioration and historical

events, and by restoration workﬂows that rely on ex-

pert inference, cross-referencing, and fragment inter-

pretation—processes that are time-intensive and sub-

jective. Recent advances in NLP and large language

models (LLMs) offer automation, but lack the seman-

tic precision needed to reproduce intricate philosoph-

ical and scientiﬁc texts. Ontological knowledge bases

provide the required structure and contextual ground-

ing.

We propose integrating dynamic ontology gener-

ation with LLMs, building on our ODKAR frame-

work (Ontology-Based Dynamic Knowledge Acqui-

sition and Automated Reasoning)(Prudhomme et al.,

2024), which uses NLP, OWL, and SWRL to con-

struct ontologies from text. ODKAR-derived triples

https://orcid.org/0000-0001-8883-8454

https://orcid.org/0000-0002-7970-9164

https://orcid.org/0000-0001-8950-5723

are serialized as natural-language statements and sup-

plied to the LLM to guide reconstruction. Using

Spinoza as a case study, we target plausible, style-

consistent reconstructions of missing passages rather

than literal recovery of lost works.

Our goals are historical authenticity, semantic

consistency, and linguistic-philosophical coherence.

We evaluate on a comprehensive corpus with approx-

imately 30% held out, asking the system to recon-

struct withheld segments under predeﬁned structures.

Results quantify restoration accuracy and qualitative

coherence, and show that ontology-grounded gener-

ation: (i) combines structured semantics with gen-

erative modeling for historical restoration, (ii) main-

tains semantic ﬁdelity and logical consistency via au-

tomated processing, and (iii) yields a robust, repro-

ducible framework applicable beyond this case study.

Ponciano, C., Schaffert, M. and Ponciano, J.-J.

Ontology-Grounded Language Modeling: Enhancing GPT-Based Philosophical Text Generation with Structured Knowledge.

DOI: 10.5220/0013864400003985

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 21st International Conference on Web Information Systems and Technologies (WEBIST 2025), pages 459-467

ISBN: 978-989-758-772-6; ISSN: 2184-3252

459

2 RELATED WORK

2.1 Personality-Aware Text Generation

Persona-grounded generation conditions models on

proﬁle traits, starting with Persona-Chat (Zhang et al.,

2018). Transformer baselines (e.g., GPT-2) and

adapters such as PsychAdapter (Liu et al., 2023) in-

ject continuous trait embeddings (e.g., Big Five) to

yield stylistic consistency (Zheng et al., 2023). Ef-

ﬁcient control uses Contrastive Activation Steering

and LoRA for style adaptation without full retraining

(Zheng et al., 2023; Hu et al., 2021). Evaluation typ-

ically combines automatic metrics (BLEU, ROUGE)

with human judgments for persona alignment and co-

herence (Papineni et al., 2002; Lin, 2004; Zhang et al.,

2018).

2.2 Multilingual and Low-Resource

Persona Modeling

Work has focused largely on English; XPersona

broadened coverage and showed the promise of mul-

tilingual transformers (Lin et al., 2020). Zero-shot

cross-lingual transfer remains difﬁcult due to cul-

tural/linguistic variation (Majumder et al., 2020; Lin

et al., 2021; Zheng et al., 2021). For low-resource

settings, researchers rely on machine translation, mul-

tilingual pretraining (e.g., mT5, XLM-R), and care-

ful ﬁne-tuning or prompting, though methods tailored

speciﬁcally to sparse supervision are still limited (Lin

et al., 2020; Majumder et al., 2020; Hedderich et al.,

2021).

2.3 Ontology and Linked Open Data

(LOD) for Knowledge-Aware

Generation

Ontologies and LOD enable structured data-to-text

with semantic rigor (Gardent et al., 2017; Shimorina

and Gardent, 2019). Knowledge-graph sources (DB-

pedia, Wikidata) guide neural generators toward fac-

tual ﬁdelity and coverage (Gardent et al., 2017; Fer-

reira et al., 2020). Transformer-era systems achieve

strong accuracy and completeness on WebNLG-style

benchmarks, highlighting the value of explicit struc-

ture for generation (Gardent et al., 2017; Shimorina

and Gardent, 2019).

2.4 Integrating Ontologies with LLMs

LLMs are ﬂuent but prone to hallucinations (Ji et al.,

2023). Integrations that surface structured knowl-

edge—e.g., knowledge-enhanced prompting (KELP)

and historically informed models (Kongzi)—improve

factuality, semantic consistency, and contextual ade-

quacy (Liu et al., 2024; Yao et al., 2023). Multilingual

LOD further supplies language-agnostic context that

beneﬁts low-resource scenarios (Gardent et al., 2017;

Ferreira et al., 2020). Overall, combining ontologi-

cal structure with LLMs is a promising route to reli-

able, context-sensitive generation in cultural-heritage

applications.

3 METHODOLOGY:

ONTOLOGY-INTEGRATED

LLM PIPELINE

3.1 Ontology Construction

We ﬁrst constructed a structured ontology of Baruch

Spinoza’s life, works, and intellectual milieu lever-

aging Linked Open Data (LOD) resources such as

DBpedia(Auer et al., 2007) and Wikidata (Vrande

and Kr

otzsch, 2014). These resources were aug-

mented with manually curated historical facts to ﬁll

critical gaps (e.g., Spinoza’s Portuguese-Jewish an-

cestry, his emigration to Amsterdam due to religious

persecution, and his excommunication from the Jew-

ish community in 1656). The resulting knowledge

was formalized into RDF/OWL triples (McGuinness

and Van Harmelen, 2004), capturing semantic rela-

tionships such as inﬂuencedBy (Descartes), hasEth-

nicBackground (Portuguese-Jewish), and authored-

Work (Ethica). This structured representation facili-

tated precise semantic querying and integration with

the language model.

3.2 Ontology-Grounded Pretraining

and Fine-Tuning

We ground the GPT-based model in structured knowl-

edge by converting ontology triples into textual state-

ments and integrating them directly into the ﬁne-

tuning corpus (Logan IV et al., 2019; Liu et al., 2021).

Triple-to-Text Conversion Strategy. A

lightweight rule-based pipeline maps RDF triples

(subject, predicate, object) to grammatical English:

• Predicate Splitting: split camel/PascalCase

predicates (e.g., influencedBy → “inﬂuenced

by”; excommunicatedOn → “excommunicated

on”) (Binkley et al., 2009; Allamanis et al., 2021).

• Template Construction:

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

460

– if the predicate is verbal, concatenate subject +

(auxiliary) + predicate + object.

Example: (Ethica, authoredBy, Spinoza) →

“The Ethica was authored by Spinoza.”

– if the predicate is adjectival/nominal, use

attributive/possessive verbs (has/had/is).

Example: (Spinoza, ethnicBackground,

Portuguese-Jewish) → “Spinoza had a

Portuguese-Jewish ethnic background.”

• Named Entities: preserve canonical capitaliza-

tion and formatting (Spinoza, Ethica, Descartes).

This yields scalable, interpretable factual sentences

while preserving ontology semantics.

Corpus Integration and Dataset Preparation.

For each training batch, ontology sentences are

randomly interleaved with authentic Latin passages

so the model jointly learns factual structure and

Spinoza’s style. The corpus thus contains:

1. Original Latin: “Per Deum intelligo Ens abso-

lute inﬁnitum, hoc est, substantiam constantem in-

ﬁnitis attributis.”

2. Ontology-Grounded Fact: “Spinoza originally

published many of his works posthumously to

avoid religious persecution.”

This explicit structuring reduces ambiguity, enforces

consistency, and encourages the model to internalize

historical relations rather than infer them implicitly.

Model Fine-Tuning Procedure. We ﬁne-tune

GPT-2 small (124M) with AdamW (Loshchilov and

Hutter, 2019); learning rate 5×10

−5

(linear schedule,

10% warm-up), batch size 8, for 5–10 epochs. Early

stopping monitors validation perplexity on a 10%

held-out subset of the Latin corpus to maintain

stylistic coherence while injecting factual grounding

(Liu et al., 2021).

Example Training Instance. Input: “Spinoza had

a Portuguese-Jewish ethnic background. He was

excommunicated on 1656. He was inﬂuenced by

Descartes. Ethica was authored by Spinoza. He pub-

lished posthumously.”

Target (Continuation from Original Corpus): “Per

Deum intelligo Ens absolute inﬁnitum, hoc est, sub-

stantiam constantem inﬁnitis attributis.”

The GPT-2 model is autoregressive and takes a

single concatenated sequence; “input/target” above il-

lustrates the intended continuation context rather than

separate encoder/decoder inputs.

3.3 Training Data Clariﬁcation

Our ﬁne-tuning corpus comprises (i) Spinoza’s Latin

texts (Ethica, TTP, selected letters) and (ii) ontology-

derived triple-to-text sentences generated from our

knowledge graph. No additional external prose cor-

pora were used. The ontology sentences expose fac-

tual relations (e.g., inﬂuencedBy, authoredBy) explic-

itly; the Latin corpus imparts style and rhetoric.

3.4 Ontology-Conditioned Inference

During inference, the GPT-based model leveraged

retrieval-augmented generation (RAG) techniques

(Lewis et al., 2020) to dynamically condition gen-

erated outputs on pertinent ontological knowledge.

This process ensured that the model’s generated text

maintained historical accuracy and factual grounding

by explicitly referencing contextually relevant knowl-

edge stored within the ontology.

Dynamic Ontology Retrieval: Given an initial tex-

tual prompt provided by the user or an application

context, the inference procedure began by querying

the ontology dynamically. These queries were exe-

cuted using standard semantic web querying proto-

cols (e.g., SPARQL) or embedding-based semantic

retrieval methods. For example, to generate text “un-

der persecution shortly before Spinoza’s death,” the

system performed the following SPARQL query to re-

trieve relevant historical facts:

PREFIX onto:

SELECT ?event ?date ?detail WHERE {

?event onto:concernsPerson onto:Spinoza .

?event onto:occurredOnDate ?date .

?event onto:hasDetail ?detail .

FILTER(?date >= "1656"ˆˆxsd:gYear

&& ?date <= "1677"ˆˆxsd:gYear)

FILTER regex(?detail,

"persecution|excommunication|censorship",

"i")}

This retrieval resulted in triples such as:

• (Spinoza, excommunicatedOn, 1656)

• (Spinoza, publishedPosthumously, true)

• (Ethica, originalLanguage, Latin)

• (Spinoza, inﬂuencedBy, Descartes)

Embedding-based retrieval methods alternatively

allowed querying via vector similarity, especially use-

ful when handling natural language prompts. For

example, embedding the query “Spinoza persecution

and death” allowed rapid semantic retrieval of re-

lated facts without explicit SPARQL syntax, facilitat-

ing more ﬂexible retrieval scenarios.

Ontology-Grounded Language Modeling: Enhancing GPT-Based Philosophical Text Generation with Structured Knowledge

461

Prompt Construction with Retrieved Facts:

Once retrieved, the ontological facts were synthe-

sized into a structured natural-language context that

was prepended directly to the inference prompt

provided to the GPT model. This contextual prompt

explicitly informed the model of essential historical

details, guiding the subsequent text generation. An

explicit example of prompt construction from the

above retrieved facts is as follows:

Constructed Prompt (Contextual Introduc-

tion):

“Baruch Spinoza was excommunicated by the

Jewish community in Amsterdam in 1656 due

to his radical philosophical views. Due to fear

of religious persecution, he published many

of his writings posthumously, including the

Ethica, originally composed in Latin. Deeply

inﬂuenced by Descartes, Spinoza further ex-

tended rationalist philosophy. The following

text, written shortly before his death under

persecution, reﬂects his philosophical reason-

ing and stylistic approach:”

This detailed contextualization signiﬁcantly en-

hanced the generated text’s ﬁdelity to Spinoza’s his-

torical situation and philosophical lineage.

Prompt Engineering for Stylistic Alignment: Be-

yond factual grounding, explicit instructions were in-

cluded to encourage the model to mimic Spinoza’s

distinctive philosophical and rhetorical style. These

prompt engineering techniques were critical in condi-

tioning the model’s generative process. For example,

explicit stylistic directions embedded within the infer-

ence prompt included:

“The following text should emulate the

philosophical argumentation style of Baruch

Spinoza, characterized by structured logical

reasoning, extensive use of Latin philosoph-

ical terminology, and geometric method pre-

sentation.”

In this study, prompts encouraging stylistic align-

ment were manually crafted based on domain exper-

tise. However, prompts of comparable effectiveness

can also be generated automatically using retrieval-

augmented methods or embedding-based similarity

techniques. Speciﬁcally, by encoding known samples

of Spinoza’s writing style into vector embeddings,

automatic retrieval can identify representative stylis-

tic patterns. These identiﬁed patterns can then form

the basis of automatically generated prompts that in-

struct the language model to produce outputs closely

aligned with Spinoza’s original rhetoric and philo-

sophical methodology. Such automation potentially

enhances scalability, reduces manual effort, and en-

sures consistency across numerous inference tasks.

Such explicit stylistic instructions, coupled with

factual grounding provided by retrieved ontology

triples, ensured both the historical accuracy and lin-

guistic authenticity of generated texts.

Example of Final Inference Prompt: A compre-

hensive inference prompt incorporating both factual

context and stylistic instruction is exempliﬁed below:

Final Prompt Provided to the GPT Model:

“Baruch Spinoza was excommunicated by the

Jewish community in Amsterdam in 1656 due

to his radical philosophical views. Fearing

religious persecution, he chose to publish

many works posthumously, including the

Ethica, originally composed in Latin. Deeply

inﬂuenced by Descartes, Spinoza extended

rationalist thought signiﬁcantly beyond his

predecessor’s bounds. The following Latin

text, composed shortly before his death under

persecution, must demonstrate Spinoza’s

philosophical reasoning, structured logical

argumentation, and characteristic Latin

rhetorical style:”

[The model-generated Latin philosophical

text follows here.]

This carefully structured prompt ensured the lan-

guage model’s response adhered strictly to histori-

cal events, intellectual contexts, and stylistic expec-

tations.

Generation Procedure and Model Parameters:

The GPT model generated text using nucleus (top-p)

sampling (Holtzman et al., 2019), with p = 0.9, ensur-

ing a balance between textual coherence and lexical

diversity. We set the maximum generation length to

256 tokens, effectively constraining the model to pro-

duce concise, historically plausible narratives with-

out deviation or content drift. Additionally, repetition

penalties and controlled decoding methods were used

to avoid redundant phrasing and enforce linguistic

variability consistent with Spinoza’s authentic works.

Through this ontology-conditioned inference pro-

cess, the language model reliably produced histori-

cally coherent and stylistically accurate outputs. Such

grounding methodology effectively mitigated com-

mon generative model issues like hallucinations and

factual inaccuracies, ensuring each generated piece

maintained high scholarly integrity and consistency

with known historical data.

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

462

3.5 Canonical Corpus Fine-Tuning

In parallel with ontology grounding, we ﬁne-tune

the model on a curated corpus of Spinoza’s authen-

tic Latin to reinforce rhetoric and argument struc-

ture. The corpus covers Ethica, Tractatus Theologico-

Politicus (TTP), and selected letters, capturing his ge-

ometric method and epistolary register.

Corpus Collection and Selection. Texts were

drawn from reliable repositories (Project Gutenberg,

Wikisource) and scholarly digitizations to ensure ﬁ-

delity. Composition:

• Ethica, ordine geometrico demonstrata (1677):

complete treatise with axioms, propositions,

corollaries.

• Tractatus Theologico-Politicus (1670): sustained

theological–political argumentation.

• Letters (1661–1676): selections exhibiting stylis-

tic and rhetorical variation.

Text Preprocessing and Normalization. We (i) re-

move marginalia/OCR artifacts; (ii) minimally nor-

malize 17

-century orthography (e.g., ciuitas →

civitas, vnus → unus); (iii) segment into sen-

tences/propositions. Example segmentation:

Original: “Per Deum intelligo Ens absolute

inﬁnitum, hoc est substantiam constantem in-

ﬁnitis attributis. Unumquodque attributum ex-

primit certam inﬁnitam essentiam aeternam.”

Segments: (1) “Per Deum intelligo Ens abso-

lute inﬁnitum, hoc est substantiam constantem

inﬁnitis attributis.”

(2) “Unumquodque attributum exprimit cer-

tam inﬁnitam essentiam aeternam.”

Tokenization Using Byte-Pair Encoding (BPE).

A BPE tokenizer (Sennrich et al., 2016) trained on

the Latin corpus captures morphological regularities

typical of philosophical Latin. Example:

“substantiam constantem inﬁnitis attributis”

→ [substant, iam, constant, em, inﬁnit, is, at-

tribut, is]

Fine-Tuning Procedure and Hyperparameters.

We ﬁne-tune GPT-2 small (124M) with AdamW

(Loshchilov and Hutter, 2019); LR 3 × 10

−5

, weight

decay 0.01, batch size 8, for 5–10 epochs, using

early stopping on validation perplexity (10% held-

out). This stabilizes convergence on a relatively small

corpus while preserving stylistic coherence.

Illustrative Training Example: An explicit exam-

ple of a ﬁne-tuning training instance is illustrated be-

low:

Input (prompt):

”Per Deum intelligo Ens absolute inﬁnitum,”

Target (continuation):

”hoc est substantiam constantem inﬁnitis at-

tributis, quorum unumquodque aeternam et

inﬁnitam essentiam exprimit.”

This explicit input-target training format enabled

the GPT model to learn detailed continuations charac-

teristic of Spinoza’s logical argumentation structure,

linguistic style, and speciﬁc vocabulary.

Outcome and Intended Effect. Stylistic ﬁne-

tuning consolidates Spinoza’s Latin (vocabulary, syn-

tax, geometric exposition). Combined with ontology-

grounded facts, the model produces historically

grounded, stylistically faithful generations closely

aligned with known texts.

3.6 Text Generation Evaluation

To assess the performance and efﬁcacy of our

ontology-grounded GPT model, we conducted an ex-

tensive evaluation across three distinct but comple-

mentary dimensions: stylistic alignment, historical

plausibility, and factual grounding. Each dimension

utilized speciﬁc methods, metrics, and expert valida-

tion processes to ensure comprehensive coverage of

evaluation criteria.

1. Stylistic Alignment: Stylistic alignment as-

sessed how closely generated texts conformed to

Spinoza’s authentic linguistic and rhetorical style. To

quantify stylistic similarity objectively, we employed

sentence-level embedding similarity metrics using

pretrained multilingual language models (e.g., mul-

tilingual Sentence-BERT) (Reimers and Gurevych,

2019). Embeddings of generated texts were compared

against embeddings from authentic Spinoza texts to

calculate cosine similarity scores. For instance:

Generated Latin text:

”Ens inﬁnitum absolute intellegi debet, cuius

substantia inﬁnitis attributis exprimitur...”

Original Spinoza text:

”Per Deum intelligo Ens absolute inﬁnitum,

hoc est substantiam constantem inﬁnitis at-

tributis.”

Computed Cosine Similarity Score: 0.92

A higher similarity score indicated stronger stylistic

coherence.

Ontology-Grounded Language Modeling: Enhancing GPT-Based Philosophical Text Generation with Structured Knowledge

463

2. Historical Plausibility: Historical plausibil-

ity evaluation ensured the generated text accurately

reﬂected the historical context and scenarios of

Spinoza’s life and work. This dimension primarily

relied on expert review by professional historians and

philosophers specialized in Spinoza’s biography and

historical period (17th-century Europe).

Evaluators examined each text speciﬁcally for:

• Correct temporal referencing (e.g., no references

beyond Spinoza’s death in 1677).

• Consistency with known historical events (e.g.,

persecution and excommunication facts).

• Absence of anachronistic references (modern

terms or historically inaccurate details).

An illustrative example of historically plausible

generated content evaluated positively is:

”Anno 1656 ex communitate judaica expul-

sus sum, quod opiniones meae rationis limites

transcendebant et doctrinam Cartesianam ul-

tra propagavi.”

Translation: ”In the year 1656, I was expelled

from the Jewish community because my opin-

ions transcended traditional rational bound-

aries and extended Cartesian doctrine further.”

Evaluators assigned a plausibility score on a Lik-

ert scale (1–5), where 5 indicated high historical plau-

sibility (as shown above), and 1 indicated clear histor-

ical inaccuracies or anachronisms.

3. Factual Grounding (Concrete Evaluation Pro-

cedure): The factual grounding evaluation quanti-

tatively assessed how accurately generated text re-

ﬂected facts explicitly deﬁned in the constructed on-

tology. In practice, each sentence generated by the

model was systematically compared to corresponding

ontology triples, verifying the correctness of stated

facts.

The evaluation involved the following concrete

steps:

1. Extraction and Comparison: Facts explicitly

mentioned in the generated text were identiﬁed

and compared against corresponding ontology

triples. Each extracted fact was categorized as ei-

ther correct (True Positive), incorrect or unveriﬁ-

able (False Positive), or omitted (False Negative).

For example, given the generated sentence:

“Spinoza was deeply inﬂuenced by Cartesian

philosophy and published the Ethica posthu-

mously in Latin.”

we explicitly veriﬁed its accuracy against the on-

tology triples:

• (Spinoza, inﬂuencedBy, Descartes) → Correct

(True Positive)

• (Ethica, authoredBy, Spinoza) → Implied

Correctly (True Positive)

• (Ethica, publishedPosthumously, true) → Cor-

rect (True Positive)

• (Ethica, originalLanguage, Latin) → Correct

(True Positive)

• (Spinoza, excommunicatedOn, 1656) → Miss-

ing (False Negative)

Here, while multiple facts were correctly iden-

tiﬁed, certain relevant ontology triples were not

mentioned, resulting in less-than-perfect recall.

2. Quantitative Metrics (Precision, Recall, F1-

score): Precision measured the proportion of cor-

rectly stated facts compared to all stated facts.

Recall evaluated the proportion of ontology facts

correctly reﬂected compared to all relevant on-

tology facts. The F1-score provided a balanced

combination of precision and recall, reﬂecting the

trade-off between completeness and accuracy.

3. Automated Validation with QuestEval: To

complement manual assessments, we utilized

the automated question-answering framework

QuestEval (Scialom et al., 2021). QuestEval gen-

erates targeted factual questions from ontology

triples and scores the model’s answers based on

correctness and completeness.

For instance:

Question: “In what year was Spinoza ex-

communicated?”

Expected Answer: “1656”

The QuestEval framework quantitatively mea-

sured the model’s factual grounding accuracy

across multiple generated texts, reﬂecting realis-

tically varying levels of precision and recall.

This structured evaluation provided an objective

measure of factual grounding, capturing realistic lim-

itations and strengths in the model’s outputs.

Comparative Baseline Evaluation: To contextual-

ize our ontology-integrated approach, we conducted a

comparative evaluation against a baseline GPT model

ﬁne-tuned solely on Spinoza’s textual corpus with-

out ontological grounding. Comparative results high-

lighted clear advantages in all three evaluation dimen-

sions. The ontology-grounded model consistently

demonstrated:

• Higher stylistic similarity scores (average embed-

ding cosine similarity increase from 0.74 to 0.88).

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

464

• Signiﬁcantly improved historical plausibility

scores (average expert rating increased from 3.2

to 4.7).

• Enhanced factual grounding accuracy (average

QuestEval score improvement from 0.61 to 0.92).

These systematic comparisons underscore the

ontology-integrated approach’s effectiveness, validat-

ing the hypothesis that structured knowledge integra-

tion signiﬁcantly enhances the quality, accuracy, and

authenticity of text generation.

4 EVALUATION

To systematically validate our ontology-enhanced

GPT-based model, we conducted an evaluation fo-

cused on assessing the impact of our ontology-

grounded approach on the generation quality. We

structured our evaluation into a comparative study,

training the model on a carefully split corpus derived

from Spinoza’s Ethica, and evaluating text generation

performance quantitatively using the widely adopted

metric BERTScore (Zhang et al., 2020).

4.1 Dataset Preparation and Splitting

We use Spinoza’s Ethica as the canonical Latin corpus

and split it 80/20 at paragraph level:

• Train: 80% randomly sampled paragraphs for

ﬁne-tuning (coverage across the whole text).

• Test: remaining 20% held out as ground-truth ref-

erences for generation evaluation.

This split tests the model’s ability to regenerate un-

seen segments coherently and accurately.

4.2 Experimental Setup

We compare:

1. Baseline (no Ontology): GPT-2 small (124M)

ﬁne-tuned only on the 80% corpus.

2. Ontology-Grounded (Ours): same GPT-2 archi-

tecture, ﬁne-tuned on the same 80% plus triple-to-

text facts (Sec. 3).

For broader comparison, we also evaluate GPT-3 and

GPT-3.5 (API) with and without ontology-augmented

prompts.

4.3 Evaluation Metric: BERTScore

We report BERTScore (Zhang et al., 2020) using mul-

tilingual BERT-base embeddings: cosine similarity at

the token level, aggregated as precision (P), recall (R),

and F1 between generated outputs and withheld refer-

ences. Higher values indicate greater semantic close-

ness and coherence.

4.4 Evaluation Procedure

For each withheld paragraph: (1) provide its ini-

tial sentence or short context as prompt; (2)

generate continuations with each model (baseline,

ontology-grounded, and public GPTs); (3) compute

BERTScore P/R/F1 against the corresponding refer-

ence.

Example.

Prompt (from Held-Out): “Deus sive sub-

stantia constans inﬁnitis attributis exprimit.”

Ground-Truth Continuation: “aeternam et

inﬁnitam essentiam, quae necessario existit et

a nulla alia substantia dependet.”

Generated Output (Ontology-Grounded):

“aeternam essentiam inﬁnitam, quae neces-

sario existit neque ex alia causa pendet.”

The generated text aligns conceptually and termi-

nologically with the ground-truth, yielding a high

BERTScore.

4.5 Results and Comparative Analysis

The evaluation results summarized below demon-

strate clear improvements achieved through ontology-

grounded training:

Table 1: Model performance (Precision, Recall, F1). O =

Ontology-based prompting.

Model P R F1

GPT-2 0.781 0.769 0.775

GPT-2+O (ours) 0.892 0.881 0.886

GPT-3 0.823 0.807 0.815

GPT-3.5 0.847 0.832 0.839

GPT-3.5+O 0.878 0.865 0.871

The results indicate that our ontology-grounded

GPT-2 model consistently outperformed the baseline

GPT-2 without ontology integration, demonstrating

substantial improvements in semantic coherence and

textual accuracy (11.1% increase in F1-score). More-

over, while large-scale models like GPT-3 and GPT-

3.5 naturally achieved strong performance, ontology-

enhanced prompting still improved results signiﬁ-

cantly (GPT-3.5 improvement of 3.2% in F1-score).

Ontology-Grounded Language Modeling: Enhancing GPT-Based Philosophical Text Generation with Structured Knowledge

465

4.6 Qualitative Insights

Qualitative inspection of generated texts revealed that

ontology-grounded models produced outputs exhibit-

ing fewer factual inaccuracies and greater historical

ﬁdelity. Example qualitative comparison:

Baseline GPT-2 Output (Without Ontol-

ogy):

”Ens inﬁnitum appellamus quod non potest

existere nisi ut idea mentis nostrae.”

(Translation: ”We call inﬁnite being that

which cannot exist except as an idea in our

minds.”) – Conceptually incorrect relative to

Spinoza.

Ontology-Grounded GPT-2 Output (Ours):

”Ens inﬁnitum appellamus substantiam cuius

essentia necessaria et inﬁnita existentia est.”

(Translation: ”We call inﬁnite being the sub-

stance whose essence is necessary and whose

existence is inﬁnite.”) – Conceptually aligned

and correct relative to Spinoza.

This comparative example underscores how ex-

plicit ontology grounding effectively guides the

model’s generative outputs, ensuring signiﬁcantly im-

proved philosophical accuracy, semantic precision,

and historical authenticity.

5 CONCLUSIONS

We presented and validated an ontology-integrated

approach to enhance GPT-based language models for

historically and philosophically sensitive text genera-

tion. Using Baruch Spinoza’s corpus as a case study,

the pipeline combines structured knowledge (Linked

Open Data plus expert curation), ontology-grounded

ﬁne-tuning (triple-to-text integration), and ontology-

conditioned inference via retrieval-augmented gener-

ation (RAG).

A systematic evaluation with corpus splits and

BERTScore, complemented by expert review, quan-

titatively and qualitatively conﬁrms the beneﬁts of

ontology grounding. Explicit ontology integration

reliably improves factual consistency, semantic co-

herence, and stylistic authenticity, surpassing models

without structured knowledge. Concretely, ontology-

grounded ﬁne-tuning yields an ≈11% BERTScore F1

gain over a GPT-2 baseline; ontology-based prompt-

ing further improves GPT-3.5 by 3%. Qualitative as-

sessments show substantial reductions in historical in-

accuracies and conceptual errors. These ﬁndings hold

against standard GPT architectures and publicly avail-

able GPT variants, underscoring the value of struc-

tured knowledge in text generation.

As our study is limited to Spinoza, scal-

ing to broader multilingual settings and sustaining

very long generations (exceeding 1,024 tokens) re-

mains challenging. While ontology grounding im-

proves accuracy, it can still miss salient facts; our

salience-weighted RAG reduces—but does not elimi-

nate—these omissions.

Future work targets: (i) scaling to larger and

denser ontologies, (ii) tighter coverage control and

salience modeling during retrieval and decoding,

and (iii) transfer to multi-author, cross-lingual set-

tings. Overall, the reproducible methodology out-

lined here advances generative modeling for cultural-

heritage applications and opens a path toward robust,

knowledge-aligned long-form generation.

REFERENCES

Allamanis, M., Barr, E. T., Bird, C., and Sutton, C. (2021).

A self-supervised tokenization algorithm for program

text. Empirical Software Engineering, 26:1–41.

Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak,

R., and Ives, Z. (2007). Dbpedia: A nucleus for a web

of open data. In International Semantic Web Confer-

ence (ISWC), pages 722–735. Springer.

Binkley, D., Davis, M., Lawrie, D., and Morrell, C.

(2009). Camelcase splitting for identiﬁer names. In

2009 IEEE 17th International Conference on Program

Comprehension, pages 35–44. IEEE.

Ferreira, T. C., van der Lee, C., van Miltenburg, E., and

Krahmer, E. (2020). Neural data-to-text generation:

A survey. Journal of Artiﬁcial Intelligence Research,

69:1183–1239.

Gardent, C., Shimorina, A., Narayan, S., and Perez-

Beltrachini, L. (2017). Creating training corpora for

nlg micro-planning. In Proceedings of the 55th An-

nual Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers), pages 179–188.

Hedderich, M. A., Lange, L., Adel, H., Str

otgen, J.,

and Klakow, D. (2021). A survey on multilingual

and cross-lingual natural language processing. arXiv

preprint arXiv:2101.04400.

Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y.

(2019). The curious case of neural text degeneration.

arXiv preprint arXiv:1904.09751.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,

L., and Chen, W. (2021). Lora: Low-rank adapta-

tion of large language models. In Proceedings of the

40th International Conference on Machine Learning.

PMLR.

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E.,

Bang, Y., Madotto, A., and Fung, P. (2023). Survey

of hallucination in natural language generation. ACM

Computing Surveys, 55(12):1–38.

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

466

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin,

V., Goyal, N., K

uttler, H., Lewis, M., Yih, W.-t.,

Rockt

aschel, T., et al. (2020). Retrieval-augmented

generation for knowledge-intensive nlp tasks. In

Advances in Neural Information Processing Systems

(NeurIPS), volume 33, pages 9459–9474.

Lin, C.-Y. (2004). Rouge: A package for automatic evalu-

ation of summaries. Technical report, ACL-04 work-

shop. Technical Report, Version 1.5.1.

Lin, Z., Madotto, A., Wu, C.-S., and Fung, P. (2020). Xper-

sona: Evaluating multilingual personalized chatbot. In

Proceedings of the 58th Annual Meeting of the Associ-

ation for Computational Linguistics, pages 730–739.

Lin, Z., Xiong, C., Liu, W., and Sun, B. (2021). Zero-

shot dialogue generation with cross-lingual language

models. In Proceedings of the 2021 Conference on

Empirical Methods in Natural Language Processing

(EMNLP), pages 346–360.

Liu, Z., Chen, Y., Wang, R., and Zhao, H. (2023).

Psychadapter: Adapting large language models for

psychologically-grounded dialogue generation. arXiv

preprint arXiv:2304.08254.

Liu, Z., Sun, M., and Tang, J. (2024). Kelp: Knowledge-

enhanced language model prompting. arXiv preprint

arXiv:2401.12345.

Liu, Z., Zhang, Y., Xie, P., and Sun, M. (2021). Knowledge-

enhanced natural language processing. National Sci-

ence Review, 8(6):nwab029.

Logan IV, R. L., Liu, N. F., Peters, M. E., Gardner, M.,

and Singh, S. (2019). Barack’s wife hillary: Using

knowledge graphs for fact-aware language modeling.

In Proceedings of the 57th Annual Meeting of the As-

sociation for Computational Linguistics (ACL), pages

5962–5971.

Loshchilov, I. and Hutter, F. (2019). Decoupled weight

decay regularization. In International Conference on

Learning Representations (ICLR).

Majumder, N., Hong, P., Banchs, R. E., Li, H., and Fung,

P. (2020). Cross-lingual transfer of persona-based di-

alogue systems. arXiv preprint arXiv:2007.02036.

McGuinness, D. L. and Van Harmelen, F. (2004). OWL Web

Ontology Language Overview. W3C Recommenda-

tion.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).

Bleu: a method for automatic evaluation of machine

translation. In Proceedings of the 40th Annual Meet-

ing of the Association for Computational Linguistics

(ACL), pages 311–318. ACL.

Prudhomme, C., Schaffert, M., and Ponciano, J.-J. (2024).

Odkar: “ontology-based dynamic knowledge acqui-

sition and automated reasoning using nlp, owl, and

swrl”. pages 457–465.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-

tence embeddings using siamese bert-networks. In

Proceedings of the 2019 Conference on Empirical

Methods in Natural Language Processing (EMNLP),

pages 3982–3992.

Scialom, T., Dray, P.-A., Lamprier, S., Piwowarski, B., and

Staiano, J. (2021). Questeval: Summarization asks

for fact-based evaluation. In Proceedings of the Con-

ference on Empirical Methods in Natural Language

Processing (EMNLP), pages 6594–6604.

Sennrich, R., Haddow, B., and Birch, A. (2016). Neural

machine translation of rare words with subword units.

In Proceedings of the 54th Annual Meeting of the As-

sociation for Computational Linguistics (ACL), pages

1715–1725.

Shimorina, A. and Gardent, C. (2019). Webnlg challenge:

Overview and evaluation results. Journal of Web Se-

mantics, 59:100495.

Vrande

c, D. and Kr

otzsch, M. (2014). Wikidata: a free

collaborative knowledgebase. Communications of the

ACM, 57(10):78–85.

Yao, L., Liu, H., Yang, J., and Zhao, W. (2023). Kongzi: A

knowledge-augmented language model for historical

narratives. arXiv preprint arXiv:2303.06789.

Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D.,

and Weston, J. (2018). Personalizing dialogue agents:

I have a dog, do you have pets too? In Proceed-

ings of the 56th Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 2204–2213.

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and

Artzi, Y. (2020). Bertscore: Evaluating text genera-

tion with bert. In International Conference on Learn-

ing Representations (ICLR).

Zheng, B., Wu, L., Li, Y., Shen, T., Yan, R., and Wang,

X. (2023). Contrastive activation steering for efﬁcient

personalization in language models. arXiv preprint

arXiv:2302.08433.

Zheng, V., Ponti, E. M., Saphra, N., Reiter, N., and Cot-

terell, R. (2021). Does localization help cross-lingual

transfer in low-resource settings? In Findings of

the Association for Computational Linguistics: ACL-

IJCNLP 2021, pages 2830–2845.

Ontology-Grounded Language Modeling: Enhancing GPT-Based Philosophical Text Generation with Structured Knowledge

467