AutoVU-KG: Automated Validation and Updates for Knowledge Graphs

with Web-Search-Augmented LLMs

Amel Gader and Alsayed Algergawy

Chair of Data and Knowledge Engineering, University of Passau, Passau, Germany

Keywords:

Knowledge Graphs, Web-augmented LLMs, KG Update.

Abstract:

Knowledge Graphs (KGs) offer a powerful framework for representing and managing structured information

in many applications. However, when it comes to frequently changing facts, KGs often lag behind real-

world updates. Large Language Models (LLMs) hold promise for enriching and updating KGs, but their

capabilities are limited by static training cutoffs and a tendency to hallucinate or produce outdated information.

To address these concerns, we introduce AutoVUKG: Automated Validation and Updates for Knowledge

Graphs with Web-Search-Augmented LLMs. Our approach comprises: a classiﬁcation module that identiﬁes

facts likely to change and therefore needing updates; An LLM-driven validation and update pipeline, enhanced

with real-time web retrieval to ground assertions in current external sources, and an entity matching and

alignment component that ensures updates maintain internal consistency within the KG. Evaluation on subsets

of Wikidata demonstrates that the proposed approach achieves high accuracy and signiﬁcantly outperforms

vanilla LLMs. Additionally, it reduces the number of outdated facts by up to 60% on one of the datasets. The

source code is available at https://github.com/amal-gader/autovu-kg.

1 INTRODUCTION

A Knowledge Graph (KG) is a structured data model

used for knowledge representation and organization.

It serves as the backbone of web-scale knowledge

and supports a variety of downstream applications,

such as recommender systems, question answering,

and information retrieval (Peng et al., 2023). To en-

sure these applications produce accurate and relevant

results, it is essential that the underlying KGs remain

high-quality and are continuously updated, especially

given the rapid pace at which information evolves and

the increasing dependence on KGs across domains.

One of the most prominent open-source KGs is

Wikidata (Vrande

c and Kr

otzsch, 2014), a huge,

free knowledge base that is built and edited col-

laboratively through crowdsourcing and community-

driven contributions. Wikidata supports applications

like Wikipedia

and many semantic web tools. De-

spite its scale and utility, Wikidata often contains

outdated information, largely due to its reliance on

manual updates by users. With millions of entities

and facts, many of which are dynamic in nature,

manual maintenance becomes increasingly impracti-

https://en.wikipedia.org/wiki/Main Page

cal(Shenoy et al., 2022).

A dynamic fact refers to a piece of information

that changes over time such as the current president of

a country or an organization, the stock price of a ﬁrm,

or the list of drugs used for a speciﬁc treatment. In

contrast, static facts, like the birthplace of a football

player or the capital of a country, remain generally

unchanged over time. Dynamic facts are especially

common in ﬁnancial and medical domains, where up-

dates may be required on a weekly, daily, or even real-

time basis.

Since their emergence, Large Language Models

(LLMs) including closed-source models such as GPT-

, and open-source models like LLaMA (Touvron

et al., 2023), and DeepSeek

have shown impres-

sive performance across a wide range of Natural Lan-

guage Processing (NLP) tasks. Their integration into

tasks such as KG construction, completion, and re-

ﬁnement has shown great promise (Zhu et al., 2024).

The synergy between KGs and LLMs has been ex-

plored bidirectionally, with growing focus on LLM-

enhanced KGs, where LLMs are used to enrich, ver-

ify, or complete knowledge representations (Agrawal

https://platform.openai.com/docs/models/gpt-4.1

https://www.deepseek.com/

Gader, A. and Algergawy, A.

AutoVU-KG: Automated Validation and Updates for Knowledge Graphs with Web-Search-Augmented LLMs.

DOI: 10.5220/0013704600004000

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2025) - Volume 1: KDIR, pages 257-265

257

et al., 2023; Feng et al., 2023; Wei et al., 2024; Yao

et al., 2025).

However, LLMs also come with notable limita-

tions. They are prone to hallucinations, generat-

ing false information and are constrained by a ﬁxed

knowledge cutoff, which limits their usefulness in dy-

namic or time-sensitive applications (Liu et al., 2024;

Mousavi et al., 2024; Sriramanan et al., 2024). These

limitations reduce their reliability as autonomous

agents for updating and maintaining KGs, Figure 1

depicts an example of an outdated dynamic fact on

Wikidata which an LLM may not be able to update.

To address this challenge, Retrieval-Augmented

Generation (RAG) systems have emerged as a hybrid

solution. RAG architectures combine the generative

power of LLMs with real-time retrieval from external,

up-to-date sources such as web search engines, docu-

ment databases, or structured repositories to mitigate

knowledge gaps (Asai et al., 2024; Xie et al., 2024).

Figure 1: A motivating example illustrating the limitations

of LLMs in handling outdated facts in knowledge graphs.

The prompt was submitted to LLaMA3-70B, DeepSeek,

and GPT-3.5 at the time of the experiment, and all mod-

els returned similar responses. (Note: Answers may differ

if models have since been updated.)

Building on this idea, we propose a framework

that leverages LLMs augmented with web-retrieved

information to check, validate, and update outdated

or erroneous facts in KGs. In our approach, the LLM

acts as a reasoning agent, resolving conﬂicts that may

arise in retrieved data by analyzing associated meta-

data such as timestamps, titles, and source credibility.

Our main contributions are:

• We present a lightweight, practical framework,

with publicly available code for the automated

validation and updating of dynamic facts in en-

cyclopedic knowledge graphs such as Wikidata.

• We provide a set of SPARQL queries to extract

time-sensitive facts, which serve as benchmarks

for evaluating update methods.

• We evaluate the framework on these benchmarks

and conduct a comparative analysis of different

model conﬁgurations in terms of update accuracy

and efﬁciency.

2 RELATED WORK

In this section, we review the most relevant research

related to our study. First, we examine works that ex-

plore the integration of web search with Large Lan-

guage Models (LLMs). Next, we discuss previous ef-

forts aimed at automating the process of Knowledge

Graph (KG) updating.

2.1 Web-augmented LLMs

Multiple studies have aimed to train LLMs to mimic

the human-like web search behavior. One of the ear-

liest efforts is WebGPT by OpenAI (Nakano et al.,

2022), which introduced a web-browsing environ-

ment for a ﬁne-tuned version of GPT-3. The model

was trained using two types of data: demonstrations;

consisting of human-generated web search sessions

used during supervised learning, and comparisons;

which involve human feedback on model-generated

answers and are used in the reinforcement learning

phase to optimize performance.

Two other approaches follow a similar paradigm.

WebGLM (Liu et al., 2023) builds on the WebGPT

framework by using a more efﬁcient model architec-

ture and replacing costly human feedback with user

likes from online Q&A as a quality signal. Similarly,

AutoWebGLM (Lai et al., 2024) adopts a two-step

pipeline: an interaction step, where information is re-

trieved from the web, followed by an action step, in

which the language model generates a response to the

query. Its training also relies on reinforcement learn-

ing.

UNI-WEB (Li et al., 2023) belongs to the same

line of research on web-enhanced LLMs. A key

strength of this work is the introduction of a self-

assessment mechanism that allows the model to eval-

uate its conﬁdence in its own answers, with uncer-

tainty quantiﬁed using entropy. When the conﬁdence

is low, it queries web search APIs to retrieve addi-

tional information and improve its answers.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

258

2.2 Automate the Update of KGs

Extensive research has been dedicated to augment-

ing LLMs with web search; however, fewer studies

have addressed the challenge of automating updates

to knowledge graphs.

A notable study is (Tang et al., 2019) which aims

to leverage continuous news streams to dynamically

update and enrich KGs. The proposed approach in-

volves training an encoder-decoder model, where the

encoder integrates relational graph attention mecha-

nisms with text-based attention to ensure the message

from the news snippet pass along the KG structure.

The encoder is used to generate representations for

entities, and for each entity pair, a multi-layer percep-

tron classiﬁer evaluates the probability of a new link

being added or an existing link being removed. To de-

termine the speciﬁc type of relation between entities,

the model employs DistMult (Yang et al., 2015) as the

decoder.

Another related work is (Babaiha et al., 2023);

it focuses on enriching biomedical kGs by automat-

ically extracting causal relationships from biomedi-

cal literature. The approach begins with a keyword-

based search of PubMed

abstracts using search APIs

to retrieve relevant literature. Information extraction

is then performed on the collected abstracts, followed

by the training of an NLP-based extractor relying on

named entity recognition and relation extraction tech-

niques to identify meaningful biomedical entities and

their interactions. Extracted relations are evaluated by

human experts.

Both methods start with information retrieval

from external sources and update the KG when a rele-

vant change is detected. However, our goal is to build

a system capable of proactively identifying and up-

dating all potentially outdated facts in the knowledge

graph.

We plan to merge the goals of the aforementioned

works by updating KGs with web-augmented LLMs.

We are aware of a related work that shares the same

objective (Hatem et al., 2024), but we are taking a dif-

ferent approach for knowledge retrieval and exploring

additional aspects.

3 METHODOLOGY

In this section, we deﬁne the task and detail the dif-

ferent steps of our framework.

https://pubmed.ncbi.nlm.nih.gov/

3.1 Preliminaries

Knowledge Graphs (KGs) evolve continuously over

time. As noted by (Polleres et al., 2023), this evolu-

tion can be analyzed by treating time either as data

(explicitly encoded in the triples) or as metadata (at-

tached to the triples externally).

3.1.1 KG Evolution Dimensions

Temporal Knowledge Graphs explicitly incorporate

time into the triple structure, typically represented as

quadruples (h,r,t,τ), where h, r, and t denote the head

entity, relation, and tail entity of the factual triple, and

τ represents the associated timestamp or time interval

that indicates the validity period of a fact.

In contrast, Time-varying Knowledge Graphs rep-

resent temporal information as metadata, indicating

the transaction time or the time of insertion, rather

than embedding it within the triple itself. Time-

varying KGs can be further categorized into:

• Dynamic KGs, which preserve the full history of

changes as a set G = {(h,r,t, τ

) | τ

∈ T}, where τ

is a metadata ﬁeld representing the time at which

the triple was added or modiﬁed.

• Versioned KGs, which store discrete snapshots

of the graph at speciﬁc time points as G =

,. ..,G

3.1.2 KG Update Paradigms

To update a KG, different strategies can be em-

ployed. For instance, Wikidata primarily relies on

community-driven contributions, where users manu-

ally edit facts. Bots can assist with certain routine

tasks (e.g., property validation), but a fully automated

update mechanism is not in place. We propose a

framework in which Web-augmented Large Language

Models (LLMs) automatically perform KG updates

by retrieving, validating, and integrating new facts

from external sources.

Given an incoming new fact (h, r,t

) observed at

time τ

, the update strategy depends on the KG type:

• For dynamic KGs, the system retains historical

information. If a prior version (h, r,t

) was added

at τ

, the updated KG becomes:

= G

∪ {(h,r,t

,τ

)}

• For versioned KGs, older facts are replaced in the

new snapshot. The KG at τ

is updated as:

= (G

\ {(h,r,t

)}) ∪ {(h, r,t

)}

In this work, we assume a versioned KG setting.

The dynamic scenario can be addressed analogously,

differing in that new facts are created with associated

timestamps rather than removing outdated entries.

AutoVU-KG: Automated Validation and Updates for Knowledge Graphs with Web-Search-Augmented LLMs

259

Figure 2: AutoVU-KG: Overview of the Proposed Framework. The pipeline consists of three main components: (1) clas-

siﬁcation of facts as static or dynamic, (2) extraction of up-to-date facts from the web, and (3) alignment and integration of

entities into the KG.

3.2 AutoVU-KG: Proposed Framework

Our overarching goal, as aforementioned, is to pro-

pose a robust framework for validating and updating

facts in Knowledge Graphs using a web-augmented

LLM-based approach. The process begins with the

classiﬁcation of facts into static and dynamic cate-

gories. We then extract a subgraph containing only

dynamic facts. For each of these, we retrieve relevant

context from the web and attach appropriate metadata.

This information is incorporated into the LLM prompt

to assess or generate updated facts. Finally, newly

predicted entities and relationships are integrated into

the Knowledge Graph, and existing links are updated

accordingly. Figure 2 describes the main steps of the

pipeline.

3.2.1 Classiﬁcation Module

Knowledge Graphs (KGs) often contain both static

and dynamic facts, with the latter being subject to

change over time. The goal of our framework is to

track and update these dynamic facts as they evolve.

Rather than examining every fact in the KG, we clas-

sify facts based on their associated relation type. Our

assumption is that the relation can indicate whether

a fact is likely to change. For instance, the rela-

tion born in is typically static, while occupation or

plays f or are dynamic, as they can change over time.

This classiﬁcation step is crucial because it signiﬁ-

cantly narrows the scope of facts that need valida-

tion, reducing both computational cost and reliance

on web retrieval. Although a language model agent

could theoretically decide when to fetch web data, our

classiﬁcation-based approach proves to be more efﬁ-

cient and produces more accurate results. Consider,

for instance, the HumanWiki dataset (Rosso et al.,

2021), which contains 221 relations in total. By clas-

sifying these relations, we can limit the search space

to just 47.8%, focusing only on facts associated with

the 145 dynamic relations out of the 221.

For this task, we use a backbone pre-trained Large

Language Model (LLM), we feed the relation to clas-

sify and add examples as few-shots to the prompt with

a clear deﬁnition of static and dynamic relations as

depicted in Figure 2.

3.2.2 Web-Augmented LLMs for up-to-date

Fact Retrieval

It has been widely acknowledged that Large Lan-

guage Models (LLMs) are limited by their ﬁxed train-

ing cutoff and can signiﬁcantly beneﬁt from access

to external sources. In time-sensitive tasks, LLMs

often struggle to provide up-to-date information and

may hallucinate facts (Liu et al., 2024; Sriramanan

et al., 2024; Mousavi et al., 2024). To address this,

we integrate Google Search APIs

to retrieve relevant,

real-time context. We use standard search engines,

like Google, since they have strong ranking capabili-

https://serper.dev/

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

260

ties, and they take freshness of the data into account

We reformulate a natural language query, emphasiz-

ing the need for recent information, e.g., “Who is the

current CEO of company name?” or “Who is the cur-

rent governor of National bank name?”, this query is

used to retrieve relevant passages which may include

candidate answers or indicators. We concatenate the

top-k results (with k being a tunable hyperparameter

based on the model’s maximum input context length).

From each retrieved result, we extract the webpage ti-

tle, publication date (if available), and a summarized

version of the content. The title and date are cru-

cial, as they help the model assess the credibility and

recency of the source, which is especially important

when dealing with conﬂicting or recently updated in-

formation.

To reduce hallucinations and enhance the reliabil-

ity of the results, we prompt the model to provide a

conﬁdence score for each prediction, that could be

High, Medium, or Low. Prior research has shown

that LLMs are often capable of estimating their own

certainty which also known as black-box Conﬁdence

Elicitation (Cash et al., 2024; Xiong et al., 2023). We

leverage this self-assessed conﬁdence score to deter-

mine which facts to update in the knowledge graph,

prioritizing changes supported by high-conﬁdence

predictions.

3.2.3 KG Update: Entity Matching and

Alignment

After receiving a response from the model, we extract

both the predicted tail entity t

pred

and the associated

conﬁdence score c. The prediction corresponds to the

tail entity in a triple (h,r,t), where the head h and rela-

tion r are given by the original query. We then assess

the conﬁdence score c to determine how to proceed.

If c is high, we check whether t

pred

already exists in

the current knowledge graph G using fuzzy matching

to handle minor name variations.

If no matching entity is found in G, we normalize

pred

using the Wikidata API

to retrieve its canonical

identiﬁer, and a new node is created accordingly. If

a matching entity exists, the graph is updated by re-

placing the existing triple (h, r,t

old

) with the new one

(h,r,t

pred

), where t

old

is the original tail entity. No up-

date is performed if t

pred

is identical to t

old

. In cases

where the conﬁdence score c is not high, the predic-

tion is ﬂagged for manual review to ensure data in-

tegrity. Algorithm 1.

https://developers.google.com/search/docs/

appearance/ranking-systems-guide

https://www.wikidata.org/w/api.php

Input: t

pred

, t

old

, c, G

Output: G

updated

if c is high then

Perform fuzzy matching with existing

entities in G;

if t

pred

/∈ G then

Normalize t

pred

using Wikidata API;

Create new node for t

pred

;

end

else if t

pred

∈ G and t

pred

̸= t

old

then

Update G: replace (h, r,t

old

) with

(h,r,t

pred

);

end

else if t

pred

= t

old

then

return No update required;

end

else

Flag for manual review;

end

Algorithm 1: Entity Matching and Alignment in KG

Update.

Table 1: Subset Statistics and Outdated Fact Rates.

Subset Size % Outdated Facts

CEOs (Companies) 339 46.4

Bank Governors 25 52.2

Nat’l Football Teams 50 100.0

Int’l Org. Leaders 325 61.6

4 EXPERIMENTAL EVALUATION

In this section we present our experimental setup, im-

plementation details and ﬁndings.

4.1 Settings

Datasets. The HumanWiki dataset, introduced by

(Rosso et al., 2021), is derived from Wikidata by

extracting facts involving entities of type human

(wd:Q5). This knowledge graph contains 221 dis-

tinct relations. Using our classiﬁcation module, we

categorize these relations into static and dynamic.

For our experiments, we focus on a subset of dy-

namic relations, namely: chief executive ofﬁcer

(wdt:P169), chairperson (wdt:P488), and ofﬁceholder

(wdt:P1308).

To construct focused sample datasets, we extract

subgraphs composed of facts where these dynamic

relations are used as predicates. This process is car-

ried out using the Wikidata Query Service (WDQS)

which supports SPARQL-based querying over the

https://query.wikidata.org/

AutoVU-KG: Automated Validation and Updates for Knowledge Graphs with Web-Search-Augmented LLMs

261

Wikidata knowledge graph. We design SPARQL

queries to retrieve the most recent and relevant facts

corresponding to these roles.

We extract four distinct subgraphs: (1) Current

CEOs of companies (2) Current governors of central

banks (3) Current leaders of international organiza-

tions (4) Top-ranked national football teams.

Table 1 summarizes the statistics of the extracted

subsets used in our experiments. The National Foot-

ball Teams Ranking subset is included to illustrate the

varying degrees of data dynamicity, with 100% of the

records requiring updates. For the remainder of our

experiments, we focus on the ﬁrst three datasets, as

they are less prone to rapid changes.

Models. For the classiﬁcation module, we em-

ploy the pre-trained large language model (LLM)

LLaMA3.1-70B (Touvron et al., 2023). For the

new fact retrieval module, we evaluate three differ-

ent LLMs: (1) LLaMA3.1-70B, (2) DeepSeek-R1-

Distill-Llama-70B, a distilled variant of the R1 model

based on LLaMA3.3-70B-Instruct, and (3) GPT-4o-

mini, a cost-efﬁcient, lightweight model. Our open

source models are augmented with web search capa-

bilities via the Serper API

, retrieving up to 10 rel-

evant passages along with their metadata (including

title and publication date). The GPT-4o-mini model

leverages a recently introduced web search preview

tool with a low search context size

. The models are

respectively denoted as follows: llama3.1, r1.llama,

and gpt-4o-mini.

Evaluation. We evaluate model performance using

the standard accuracy metric, which is the ratio of

correct predictions to the total number of predictions.

Each prediction is manually assessed by one human

annotator, who veriﬁes its correctness and temporal

relevance by consulting reliable web sources.

4.2 Main Results

Table 2 presents the results of the retrieval step, com-

paring the performance of the three models across the

three datasets, both with and without integrated web

search. As expected, the models perform poorly with-

out web access, often returning outdated or incorrect

answers, or generic responses such as ”unknown” or

”not available” since the queries concern the current

status of entities.

https://serper.dev/

https://platform.openai.com/docs/guides/

tools-web-search

Introducing contextual web snippets signiﬁcantly

improves performance. For instance, the r1.llama

model achieves an accuracy boost of up to 48.8% on

the CEOs dataset. On the Bank Governors dataset,

the gpt-4o-mini model reaches 100% accuracy, likely

due to the dataset’s small size and the recency of the

reference sources used for veriﬁcation.

On the Int’l Org. Leaders dataset, the models

perform similarly, with gpt-4o-mini leading slightly

by 0.9% over r1.llama, which itself outperforms

llama3.1 by 1.4%. Notably, r1.llama slightly outper-

forms gpt-4o-mini on the CEOs dataset.

The stronger results on the CEOs dataset, com-

pared to the Int’l Org. Leaders dataset, may be at-

tributed to the higher public visibility of CEOs, mak-

ing them more likely to appear in pretraining data.

In contrast, the leaders of international organizations

are less well-known, presenting a greater challenge

for the models. Without web search, performance on

this dataset is particularly low, ranging from 18.9%

to 27.4% with r1.llama performing best and llama3.1

the worst, reﬂecting the difﬁculty the models face

in retrieving accurate information about these lesser-

known ﬁgures.

4.3 Model Comparison

As shown in Table 3, the models differ notably in

their response times and costs. Speciﬁcally, r1.llama

takes approximately 19 seconds on average to gen-

erate an answer, compared to about 2.5 seconds for

llama3.1 and 4 seconds for gpt-4o-mini. This differ-

ence can be attributed to the format of the returned

output: r1.llama includes a detailed explanation of the

reasoning behind its answer. Although we instructed

the model to omit this explanation, it still provides it

within special tokens <think> and </think>, which

adds to the processing time. The Google Search Ser-

per API

, which we use alongside the r1.llama and

llama3.1 models, costs $1 per 1,000 queries. In con-

trast, leveraging the web search feature provided by

OpenAI with the gpt-4o-mini model incurs a cost of

around $30 per 1,000 queries, broken down into $25

for the search itself and approximately $5 for model

input and output tokens.

In terms of the percentage of correctly updated

facts relative to the total number of outdated facts, all

three models perform well across the datasets. How-

ever, gpt-4o-mini shows an edge on the Int’l Org.

Leaders dataset with 64.8%, while r1.llama performs

best on the CEOs dataset with 90.4%. For the Bank

Governors dataset, all models achieve the same up-

https://serper.dev/

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

262

Table 2: Fresh Fact Retrieval Accuracy (%) of Models Across Datasets With and Without Web Search Integration.

Subset Model +web -web

CEOs

llama3.1 86.7 45.7

r1.llama 92.6 43.8

gpt-4o-mini 91.1 47.9

Bank Governors

llama3.1 95.7 52.2

r1.llama 95.5 52.2

gpt-4o-mini 100.0 60.9

Int’l Org. Leaders

llama3.1 70.1 18.9

r1.llama 72.5 27.4

gpt-4o-mini 73.4 21.5

Table 3: Efﬁciency and Update Effectiveness of Models Across Datasets.

Model Correct Updates (%) Time (s) Cost/$1K

Bank

Gov.

CEOs

Int’l

Leaders

llama3.1 100.0 81.5 59.3 2.5 $1

r1.llama 100.0 90.4 62.8 19.0 $1

gpt-4o-mini 100.0 84.1 64.8 4.0 $30

date accuracy of 100%, effectively updating all out-

dated facts in that sample.

Query: Who is the current CEO of LISI?

Initial LLM Prediction (w/o web):

Florent Germain, Conﬁdence: Medium

Web Search Results:

• Governance - LISI Group: Lionel Rivet listed as CEO

• LinkedIn: Emmanuel Viellard as Directeur G

eral

• Craft.co: Emmanuel Neildez – CEO of LISI Aerospace

• Bloomberg Markets (2023): Viellard conﬁrmed as CEO

since 2016

• FII.FR: Viellard listed as CEO of LISI SA

• LISI Automotive: Franc¸ois Liotard as division CEO

Model Reasoning:

Some names refer to division heads (e.g., Aerospace or

Automotive). Emmanuel Viellard appears consistently

across the most recent and credible sources as the CEO

of the overall LISI Group.

Final Answer: Emmanuel Viellard, Conﬁdence: High

Figure 3: Summarized reasoning by r1.llama in handling

conﬂicting CEO data.

4.4 Conﬁdence Elicitation Reliability

The conﬁdence score serves as a key indicator of the

model’s certainty in its predictions. To assess the re-

liability of these scores, we analyze the proportion

of false predictions made with high conﬁdence ver-

sus those made with low conﬁdence, as presented in

Table 4.

Our analysis reveals that llama3.1 exhibits the

lowest rate of high-conﬁdence false predictions

(False-High) and the highest rate of low-conﬁdence

false predictions (False-Low). In contrast, gpt-4o-

mini shows the highest False-High rate, reaching

90.7% on the Int’l Org. Leaders dataset, which sug-

gests a tendency toward overconﬁdence. This vari-

ation can be explained by the number of references

each model uses when generating its ﬁnal predictions.

Both r1.llama and llama3.1 consider around 10 differ-

ent passages from diverse sources, which may some-

times conﬂict and thus provide the model with cues

of uncertainty. Conversely, gpt-4o-mini, limited by a

smaller search context, relies on only one or a few ref-

erences, often resulting in higher conﬁdence scores.

As part of our analysis of conﬁdence elicitation

and interpretability, we present a case study illus-

trating the model’s reasoning process (Figure 3). In

this example, the r1.llama model correctly identiﬁes

the CEO when provided with contextual information

AutoVU-KG: Automated Validation and Updates for Knowledge Graphs with Web-Search-Augmented LLMs

263

Table 4: Conﬁdence score reliability: percentage of false predictions with low conﬁdence and true predictions with high

conﬁdence

Model Dataset % False-High % False-Low

r1.llama

CEOs 54.5 4.5

Int’l Org. Leaders 51.7 19.1

llama3.1

CEOs 26.7 13.3

Int’l Org. Leaders 27.1 38.5

gpt-4o-mini

CEOs 86.7 3.3

Int’l Org. Leaders 90.7 0.0

from the web. Despite encountering conﬂicting can-

didate names across sources, the model uses accom-

panying text and metadata such as titles and publica-

tion dates to make an accurate prediction with high

conﬁdence.

4.5 Limitations and Discussion

In this work, we present a proof of concept for au-

tomating Knowledge Graph (KG) updates using web-

augmented large language models (LLMs). While our

approach shows promising results, several limitations

remain.

First, the datasets used in our experiments are rel-

atively small and do not reﬂect the scale of real-world

KGs, which can contain billions of triples. However,

our classiﬁcation step helps narrow the scope by fo-

cusing on a subset of relations, making the task more

manageable. In practice, practitioners can further pri-

oritize validation by targeting older facts, high-impact

entities, or frequently queried relations.

Second, certain cases present notable challenges

for the models, particularly when entity representa-

tions evolve over time. For instance, ”OL Group” was

rebranded as ”The Eagle Football Group”, which can

lead to confusion in entity matching. Similarly, am-

biguous abbreviations complicate disambiguation. In

the CEOs dataset, “Ada” originally refers to Ada Mo-

tors, yet the web results also included unrelated enti-

ties sharing the same abbreviation, like the American

Diabetes Association.

Third, the conﬁdence elicitation mechanism used

in this study has shown limitations in reliability. More

robust alternatives such as entropy-based uncertainty

estimation, attention-based conﬁdence scores, or en-

semble methods could be employed, particularly in

domains where precision and trustworthiness are crit-

ical.

Fourth, while gpt-4o-mini generally outperforms

the other models in terms of accuracy, it presents chal-

lenges related to transparency and control. Its re-

liance on internal sources within the OpenAI ecosys-

tem limits our ability to inspect or inﬂuence the ref-

erences used during inference. In contrast, models

like r1.llama and llama3.1 are augmented using ex-

ternal web search APIs, offering greater control and

traceability. Additionally, gpt-4o-mini is signiﬁcantly

more expensive compared to the other alternatives.

In summary, the key challenges faced by our ap-

proach involve scaling to real-world KG sizes and en-

suring the reliability and interpretability of both the

web context and the model’s reasoning process.

5 CONCLUSIONS

Our proposed framework, AutoVu-KG, highlights the

potential of web-augmented large language models

(LLMs) for automating the validation and update of

knowledge graphs (KGs). Due to their inherent train-

ing data cutoffs, LLMs alone cannot be fully trusted

for up-to-date or accurate predictions. To address this,

we integrate real-time external sources, such as the

web, to enhance their reliability. Our experiments

demonstrate that open-source solutions can match,

and perform on par with closed-source counterparts,

while offering signiﬁcant advantages in cost and in-

terpretability. This work underscores the powerful ca-

pabilities of LLM agents when combined with mech-

anisms for external control and enrichment, particu-

larly in the domains of data management, quality as-

surance, and knowledge graph evolution. As a future

direction, we aim to scale our approach to larger and

domain-speciﬁc knowledge graphs.

ACKNOWLEDGEMENTS

This work was partially supported by the University

of Passau through Project Kapital 1527 (Title 42951

UT 02).

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

264

REFERENCES

Agrawal, G., Kumarage, T., Alghamdi, Z., and Liu, H.

(2023). Can knowledge graphs reduce hallucinations

in llms?: A survey. arXiv preprint arXiv:2311.07914.

Asai, A., Wu, Z., Wang, Y., Sil, A., and Hajishirzi, H.

(2024). Self-RAG: Learning to retrieve, generate, and

critique through self-reﬂection. In The Twelfth Inter-

national Conference on Learning Representations.

Babaiha, N. S., Elsayed, H., Zhang, B., Kaladharan, A.,

Sethumadhavan, P., Schultz, B., Klein, J., Freuden-

sprung, B., Lage-Rupprecht, V., Kodamullil, A. T.,

Jacobs, M., Geissler, S., Madan, S., and Hofmann-

Apitius, M. (2023). A natural language processing

system for the efﬁcient updating of highly curated

pathophysiology mechanism knowledge graphs. Ar-

tiﬁcial Intelligence in the Life Sciences, 4:100078.

Cash, T. N., Oppenheimer, D. M., and Christie, S. (2024).

Quantifying uncertainty: Testing the accuracy of llms’

conﬁdence judgments.

Feng, C., Zhang, X., and Fei, Z. (2023). Knowledge solver:

Teaching llms to search for domain knowledge from

knowledge graphs. arXiv preprint arXiv:2309.03118.

Hatem, S., Khoriba, G., Gad-Elrab, M. H., and ElHelw,

M. (2024). Up to date: Automatic updating knowl-

edge graphs using llms. Procedia Computer Science,

244:327–334. 6th International Conference on AI in

Computational Linguistics.

Lai, H., Liu, X., Iong, I. L., Yao, S., Chen, Y., Shen, P.,

Yu, H., Zhang, H., Zhang, X., Dong, Y., and Tang,

J. (2024). Autowebglm: A large language model-

based web navigating agent. In Proceedings of the

30th ACM SIGKDD Conference on Knowledge Dis-

covery and Data Mining, KDD ’24, page 5295–5306,

New York, NY, USA. Association for Computing Ma-

chinery.

Li, J., Tang, T., Zhao, W. X., Wang, J., Nie, J.-Y., and Wen,

J.-R. (2023). The web can be your oyster for improv-

ing large language models.

Liu, F., Liu, Y., Shi, L., Huang, H., Wang, R., Yang, Z.,

Zhang, L., Li, Z., and Ma, Y. (2024). Exploring and

evaluating hallucinations in llm-powered code gener-

ation.

Liu, X., Lai, H., Yu, H., Xu, Y., Zeng, A., Du, Z., Zhang,

P., Dong, Y., and Tang, J. (2023). Webglm: Towards

an efﬁcient web-enhanced question answering system

with human preferences.

Mousavi, S. M., Alghisi, S., and Riccardi, G. (2024). Dy-

know: Dynamically verifying time-sensitive factual

knowledge in llms.

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L.,

Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saun-

ders, W., Jiang, X., Cobbe, K., Eloundou, T., Krueger,

G., Button, K., Knight, M., Chess, B., and Schul-

man, J. (2022). Webgpt: Browser-assisted question-

answering with human feedback.

Peng, C., Xia, F., Naseriparsa, M., and Osborne, F. (2023).

Knowledge graphs: Opportunities and challenges. Ar-

tiﬁcial Intelligence Review, 56(11):13071–13102.

Polleres, A., Pernisch, R., Bonifati, A., Dell’Aglio, D., Do-

briy, D., Dumbrava, S., Etcheverry, L., Ferranti, N.,

Hose, K., Jim

enez-Ruiz, E., et al. (2023). How does

knowledge evolve in open knowledge graphs? Trans-

actions on Graph Data and Knowledge, 1(1):11–1.

Rosso, P., Yang, D., Ostapuk, N., and Cudr

e-Mauroux, P.

(2021). Reta: A schema-aware, end-to-end solution

for instance completion in knowledge graphs. In Pro-

ceedings of the Web Conference 2021, pages 845–856.

Shenoy, K., Ilievski, F., Garijo, D., Schwabe, D., and

Szekely, P. (2022). A study of the quality of wikidata.

Journal of Web Semantics, 72:100679.

Sriramanan, G., Bharti, S., Sadasivan, V. S., Saha, S., Kat-

takinda, P., and Feizi, S. (2024). Llm-check: Inves-

tigating detection of hallucinations in large language

models. In Globerson, A., Mackey, L., Belgrave, D.,

Fan, A., Paquet, U., Tomczak, J., and Zhang, C., edi-

tors, Advances in Neural Information Processing Sys-

tems, volume 37, pages 34188–34216. Curran Asso-

ciates, Inc.

Tang, J., Feng, Y., and Zhao, D. (2019). Learning to up-

date knowledge graphs by reading news. In Inui,

K., Jiang, J., Ng, V., and Wan, X., editors, Proceed-

ings of the 2019 Conference on Empirical Methods

in Natural Language Processing and the 9th Inter-

national Joint Conference on Natural Language Pro-

cessing (EMNLP-IJCNLP), pages 2632–2641, Hong

Kong, China. Association for Computational Linguis-

tics.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,

M.-A., Lacroix, T., Rozi

ere, B., Goyal, N., Hambro,

E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E.,

and Lample, G. (2023). Llama: Open and efﬁcient

foundation language models.

Vrande

c, D. and Kr

otzsch, M. (2014). Wikidata: a

free collaborative knowledgebase. Commun. ACM,

57(10):78–85.

Wei, Y., Huang, Q., Kwok, J. T., and Zhang, Y. (2024).

Kicgpt: Large language model with knowledge in

context for knowledge graph completion. arXiv

preprint arXiv:2402.02389.

Xie, W., Liang, X., Liu, Y., Ni, K., Cheng, H., and

Hu, Z. (2024). Weknow-rag: An adaptive approach

for retrieval-augmented generation integrating web

search and knowledge graphs.

Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., and Hooi, B.

(2023). Can llms express their uncertainty? an empir-

ical evaluation of conﬁdence elicitation in llms. arXiv

preprint arXiv:2306.13063.

Yang, B., tau Yih, W., He, X., Gao, J., and Deng, L. (2015).

Embedding entities and relations for learning and in-

ference in knowledge bases.

Yao, L., Peng, J., Mao, C., and Luo, Y. (2025). Exploring

large language models for knowledge graph comple-

tion. In ICASSP 2025-2025 IEEE International Con-

ference on Acoustics, Speech and Signal Processing

(ICASSP), pages 1–5. IEEE.

Zhu, Y., Wang, X., Chen, J., Qiao, S., Ou, Y., Yao, Y.,

Deng, S., Chen, H., and Zhang, N. (2024). Llms for

knowledge graph construction and reasoning: Recent

capabilities and future opportunities. World Wide Web,

27(5):58.

AutoVU-KG: Automated Validation and Updates for Knowledge Graphs with Web-Search-Augmented LLMs

265