Facttrace: Designing a News Fact-Checking Tool with Large

Language Models

Andy Wahyu Mardiansyah, Tiyas Yulita, Susila Windarta, Rahmat Purwoko and I Gede Maha Putra

Cyber Security Engineering, National Cyber and Crypto Polytechnic, Ciseeng, Bogor Regency, Indonesia

Keywords: FactTrace, Large Language Model, Retrieval Augmented Generation, Web Chatbot.

Abstract: The development of information technology has driven digitalization, which has had a positive impact on the

dissemination of information. However, this has also contributed to the increase in hoaxes. The spread of

hoaxes in Indonesia has increased year by year. This can cause division among Indonesian society. The main

problem is the lack of tools to automatically verify news validity. This study developed a news fact-checking

tool using large language models (LLMs) such as GPT-3.5 Turbo 0125, GPT-4o, Llama 3.1-8B, and

DeepSeek-R1-7B, integrated with retrieval-augmented generation (RAG). The approach used to improve

RAG is a hybrid retrieval and reranking method. The news validity verification tool (FactTrace) was designed

using the Waterfall model of the System Development Lifecycle (SDLC), developed with the n8n platform.

Testing was conducted using an evaluation matrix to assess the performance of the four LLMs used based on

accuracy, precision, recall, and F1 score. Based on the test results, FactTrace was implemented into a web

chatbot using LLM GPT-4o, which had an accuracy value of 0.989, precision of 0.991, recall of 0.987, and

F1 Score of 0.989.

1 INTRODUCTION

Digitalization in information technology allows

information to be disseminated widely and quickly,

enabling people to access information anywhere

(Kaliyar et al., 2021). This development has had a

positive impact on humanity, but the information

obtained may not be entirely accurate, which could

increase the spread of fake news (Rahmanto et al.,

2023). Hoax news is invalid information that is

believed to be true by someone without validation,

causing misunderstanding (Phan et al., 2023).

The Indonesian Ministry of Communication and

Digital Affairs recorded 1.923 pieces of fake news

circulating in Indonesia (Kementerian Komunikasi

dan Digital, n.d.). This shows that there is a large

spread of hoaxes in Indonesia. Hoaxes can influence

public opinion, leading to divisions between

individuals, groups, or classes (Kuntarto et al., 2021).

One of the efforts that can be made to curb the surge

in hoax news is the monitoring and control of

information in cyberspace by each individual

(Sahputra et al., 2023).

These efforts have been carried out in the form of

research, with several studies attempting to develop

automated news fact-checking tools based on large

language models (LLMs) and retrieval augmented

generation (RAG). There is research comparing the

performance of LLMs without RAG with LLMs

integrated with RAG for fact-checking tools (Nezafat

& Samet, 2024). There is also research comparing the

GPT 3.5 model with GPT 4, both of which are

integrated with RAG for fact-checking (Baltes et al.,

2024). In addition, there is research comparing the

LLM GPT model integrated with RAG and GPT

without RAG (Uhm et al., 2025). None of these

studies included a news fact-checking tool

specifically for Indonesians.

Based on related research, this study developed a

fact-checking tool called FactTrace with four LLM

models, namely GPT-3.5 Turbo 0125, GPT-4o,

Llama 3.1-8B, and DeepSeek-R1-7B, which are

integrated with RAG. The RAG technique used is a

hybrid retriever that combines a dense retriever

(based on vector representations) and a sparse

retriever (based on keywords), and employs reranking

techniques to reorganize the retrieved information

based on the relevance of the information used (Zhao

et al., 2024). FactTrace was developed using n8n,

employing the Waterfall software development

Mardiansyah, A. W., Yulita, T., Windarta, S., Purwoko, R. and Putra, I. G. M.

FactTrace: Designing a News Fact-Checking Tool with Large Language Models.

DOI: 10.5220/0014270800004928

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Research and Innovations in Information and Engineering Technology (RITECH 2025), pages 203-208

ISBN: 978-989-758-784-9

203

lifecycle (SDLC) method. The n8n platform is an

open-source automation platform that enables users

to connect various applications and services to

automate business operations by leveraging the

capabilities of artificial intelligence, eliminating the

need for source code implementation (n8n, n.d.).

FactTrace was implemented into a web chatbot.

Previous studies have primarily focused on

evaluating the accuracy of Large Language Models

(LLMs) on general natural language processing tasks

or developing chatbot-based fact-checking systems

using a single model. However, there remains a lack

of research directly comparing the performance of

different LLMs in detecting factual versus hoax news,

particularly within the Indonesian news ecosystem.

Moreover, few studies have attempted to bridge the

gap between model benchmarking and real-world

application by integrating evaluation results into a

practical system. This study aims to address these

gaps by conducting a comparative performance

analysis of two state-of-the-art LLMs in identifying

factual and hoax news, and developing an integrated

chatbot system that leverages the best-performing

model for real-time news verification using the

Indonesian language. By combining empirical

evaluation with practical implementation, this work

provides both scientific insights into LLM

performance and a direct contribution to combating

misinformation through a usable tool.

2 METHODOLOGY

This research is a quantitative study that utilizes data

obtained by evaluating the performance of four LLM

models using an evaluation matrix that includes

accuracy, precision, recall, and F1 score. The study

was designed using the waterfall SDLC model, which

is a method for developing software with sequential

and interrelated stages (Saravanos & Curinga, 2023).

The stages of research design are shown in Figure 1.

2.1 Analysis

This stage includes a literature review and

observation to design the FactTrace system. The

literature review is used to gather information about

LLM- and RAG-based news fact-checking

technology from 2020 to 2025. The sources used

include journals, conferences, papers, books, articles,

and websites. Observations were conducted to

determine the components used for FactTrace,

including actual news sources and hoaxes, embedding

models, retriever enhancement methods for RAG,

LLM models, and evaluation matrices for assessing

classification performance.

Figure 1: Software Development Lifecycle Waterfall.

2.2 Design

This stage involves designing the FactTrace system.

FactTrace operates by creating a FactTrace function

that is executed when the user provides an input.

When the user inputs a news topic to be verified, the

input is processed by FactTrace through several

integrated techniques: embedding, which converts

text into vectors; RAG with a hybrid retrieval

approach that combines dense retrievers with sparse

retrievers; and a reranking approach to reorder the

retrieved documents based on their relevance scores.

The system then utilizes an LLM as a generator to

analyze the user’s input in relation to the retrieved

information from the news database. After FactTrace

processes the input, the result is displayed on the

FactTrace interface. The design of FactTrace

incorporates four LLM models, which are compared

to determine the best-performing model for news

classification using evaluation metrics consisting of

accuracy, precision, recall, and F1-score. Figure 2

illustrates the design of FactTrace, which connects

the user input with the FactTrace function.

RITECH 2025 - The International Conference on Research and Innovations in Information and Engineering Technology

204

Figure 2: FactTrace workflow design.

2.3 Implementation

The FactTrace system was implemented on n8n by

integrating all the necessary techniques and

components. The system was developed using four

different LLM models and compared their

performance in checking news validity using the

same news source database.

Before creating FactTrace, a news database was

compiled, comprising factual news and hoax news.

Factual news was obtained from Detik.com and

Kompas Online, the two most widely used news

platforms in Indonesia, according to the Digital News

Report 2025 by the Reuters Institute for the Study of

Journalism. Meanwhile, hoax news was obtained

from TurnBack Hoax, which has official

authorization to publish hoax-related content. The

news data was limited to January 1, 2024, to July 22,

2025. The news source collection process was carried

out using an automated scraping program developed

in the Python programming language and the

Selenium library. The scraping results were stored in

CSV format, with each article labeled as either "fact"

for factual news or "fake" for hoax news. The CSV

file is then converted into a vector using an

embedding technique, which transforms text into a

vector representation. The embedding process is

implemented in the n8n backend using the Text-

Embedding-3-Small model. Next, the vector values

from each news article are automatically entered into

a vector database by n8n. This database is then used

as a knowledge source for the FactTrace system.

Figure 3 shows the specific workflow of the

FactTrace system implemented through n8n.

Figure 3: FactTrace system workflow.

Based on Figure 3, FactTrace begins by entering

a news topic in the chat column. The input is

connected to the FactTrace system. FactTrace first

converts the input into a vector to facilitate

information retrieval using RAG. The embedding

process uses the Text-Embedding-3-Small model.

After being converted into a vector, the input is used

as a reference in information retrieval by RAG with a

hybrid retrieval approach that combines dense

retrievers and sparse retrievers to optimise

information retrieval based on semantic meaning and

keywords from the document (Mandikal & Mooney,

2023). The retrieved information is information that

has similarities, both in semantic meaning through

vector values and keywords that match the user's

input. After the information is retrieved, it is reranked

based on its relevance value to the user's input.

Reranking is helpful in prioritising information that

has the most significant similarity to the user's input,

so that the LLM can process it appropriately. After

sorting, the information is processed by the LLM.

LLM works with engineering prompts provided by

the developer. Prompt engineering aims to provide

specific tasks to the LLM. In this study, LLM is

FactTrace: Designing a News Fact-Checking Tool with Large Language Models

205

specified to classify user input, including factual or

hoax news, based on information obtained from a

news source database. In this study, the LLM is

integrated with RAG, enabling it to make decisions

based on information obtained from the RAG

process. After the LLM processes the input and

information provided, it provides output in the form

of a hoax or factual news classification, accompanied

by a specific explanation based on the news source

database. This output is displayed in the facttrace chat

column with the user.

2.4 Testing

This stage involves testing to compare the

performance of four LLM models (GPT-3.5 Turbo

0125, GPT-4o, Llama 3.1-8B, and DeepSeek-R1-7B)

using an evaluation matrix that includes accuracy,

precision, recall, and F1 score. Testing is conducted

by providing input to the FactTrace system for each

LLM model. The input is generated from a database

of scraped news sources. This is useful for ensuring

the accuracy of the output generated by FactTrace for

each model based on predictions in the news source

database.

Of the 10.073 news items obtained, 8.000 were

selected for testing. This input consisted of 4.000

factual news items and 4.000 hoax news items.

Testing is conducted by inputting each input into an

automated program connected to the FactTrace

system. FactTrace processes each input using a

predetermined workflow, and then the results

obtained for each LLM model are matched with

predictions made based on the news source database.

The matching is intended to calculate the confusion

matrix value, which is then used to calculate the

evaluation matrix for each LLM model. The

performance comparison of each model is explained

from the resulting evaluation matrix value.

2.5 Maintenance

This step helps maintain the FactTrace system in the

event of system failures, whether due to feature issues

or errors during input processing. This ensures that

the FactTrace system remains available to users at all

times.

3 RESULTS

In testing 8.000 inputs for each model, the confusion

matrix values were calculated with facts as positive

and hoaxes as negative, resulting in True Positive

(TP) when the prediction and result were both facts,

True Negative (TN) when both were hoaxes, False

Positive (FP) when the prediction was a hoax. Still,

the result is a False Negative (FN) when the

prediction is fact but the result is a hoax. Based on

these criteria, the evaluation matrix results, consisting

of accuracy, precision, recall, and F1 score for each

model, were obtained. These values are shown in

Table 1.

Table 1: Evaluation matrix results for each LLM model.

Model Accuracy Precision Recall F1 Score

GPT-3.5

Turbo 0125

0.957 0.972 0.941 0.956

GPT-4o 0.989 0.991 0.987 0.989

Llama 3.1-

0.937 0.937 0.938 0.937

DeepSeek-

R1-7B

0.940 0.941 0.938 0.940

Table 1 presents the evaluation results. GPT-4o

achieved the highest performance (Accuracy = 0.989,

F1 = 0.989), followed by GPT-3.5 (0.956), DeepSeek

(0.940), and Llama (0.937).) Based on these results,

the GPT-4o model has the highest evaluation matrix

score for all aspects. Therefore, the LLM model

selected for the FactTrace system is GPT-4o.

Furthermore, the GPT-4o model is implemented as

the FactTrace web chatbot. Its use is explained as

follows.

Figure 4: FactTrace interface.

Figure 4 shows the initial display of FactTrace. To

use it, enter the summary, keywords, or title of the

news item you want to check in the input field at the

bottom of FactTrace. After entering the input,

FactTrace will process it using the flow shown in

Figure 3.

RITECH 2025 - The International Conference on Research and Innovations in Information and Engineering Technology

206

Figure 5: FactTrace output for hoax news.

Figure 6: FactTrace output for fact news.

After processing the input, FactTrace provides

output in the form of a classification indicating

whether the input is fake news or a fact, along with an

accompanying explanation. Figures 5 and 6 show the

FactTrace output for fake news and factual news

4 DISCUSSIONS

Based on Table 1, GPT-4o yields the best results in

classifying hoaxes and facts, achieving the highest

scores in all aspects of the evaluation matrix. GPT-

3.5 Turbo 0125 follows it for all evaluation matrix

scores, but this model has the most outstanding score

in precision, indicating that it can minimize false

positives. Next is the DeepSeek-R1-7B model, which

consistently yields stable results across all evaluation

matrix aspects, despite ranking below the two GPT

models. Last in the ranking is Llama 3.1-8B, though

its scores remain stable across all evaluation matrix

aspects, indicating that this model maintains

consistency in classifying fake news and factual

news.

5 CONCLUSIONS

Based on the analysis conducted, the LLM model

with the best performance among the four models

tested is GPT-4o. This model achieved the highest

and most consistent scores across all evaluation

metrics compared to the other models. This indicates

that GPT-4o is accurate in classifying fake news and

facts, capable of minimizing both false positives and

false negatives, and consistently and stably produces

correct values in line with predictions during

classification. These factors make GPT-4o the chosen

LLM model for FactTrace with the highest accuracy,

thereby serving as a reliable solution for quickly and

accurately verifying the validity of news.

REFERENCES

Baltes, B. A., Cardinale, Y., & Arroquia-Cuadros, B.

(2024). Automated Fact-checking based on Large

Language Models: An application for the press.

Kaliyar, R. K., Fitwe, K., Rajarajeswari, P., & Goswami, A.

(2021). Classification of Hoax/Non-Hoax News

Articles on Social Media using an Effective Deep

Neural Network. Proceedings - 5th International

Conference on Computing Methodologies and

Communication, ICCMC 2021, 935–941.

https://doi.org/10.1109/ICCMC51019.2021.9418282

Kementerian Komunikasi dan Digital. (n.d.). Siaran Pers

No. 08/HM-KKD/01/2025. Retrieved July 20, 2025,

from https://www.komdigi.go.id/berita/siaran-

pers/detail/komdigi-identifikasi-1923-konten-hoaks-

sepanjang-tahun-2024

Kuntarto, Widyaningsih, R., & Chamadi, M. R. (2021). The

Hoax of Sara (Tribe, Religion, Race, and Intergroup) as

a Threat to The Ideology of Pancasila Resilence. Jurnal

Ilmiah Peuradeun, 9(2), 413–434.

https://doi.org/10.26811/peuradeun.v9i2.539

Mandikal, P., & Mooney, R. (2023). Sparse Meets Dense:

A Hybrid Approach to Enhance Scientific Document

Retrieval. https://priyankamandikal.github.io/

n8n. (n.d.). Welcome to n8n Docs. Retrieved May 29, 2025,

from https://docs.n8n.io/#where-to-start

Nezafat, M. V., & Samet, S. (2024). Fake News Detection

with Retrieval Augmented Generative Artificial

Intelligence. 2024 2nd International Conference on

Foundation and Large Language Models (FLLM), 160–

167.

https://doi.org/10.1109/FLLM63129.2024.10852474

Phan, H. T., Nguyen, N. T., & Hwang, D. (2023). Fake

news detection: A survey of graph neural network

methods. In Applied Soft Computing (Vol. 139).

Elsevier Ltd.

https://doi.org/10.1016/j.asoc.2023.110235

Rahmanto, A. N., Yuliarti, M. S., & Naini, A. M. I. (2023).

Fact Checking dan Digital Hygiene: Penguatan Literasi

Digital sebagai Upaya Mewujudkan Masyarakat Cerdas

Anti Hoaks. PARAHITA : Jurnal Pengabdian Kepada

Masyarakat, 3(2), 77–85.

https://doi.org/10.25008/parahita.v3i2.85

Sahputra, I., Pratama, A., Fachrurrazi, S., & Ari Saptari, M.

(2023). Meningkatkan Semangat Literasi Digital Pada

FactTrace: Designing a News Fact-Checking Tool with Large Language Models

207

Generasi Millenial Dalam Penangkalan Berita Hoaks.

Jurnal Malikussaleh Mengabdi, 2(1), 2829–6141.

https://doi.org/10.29103/jmm.v2i1.12358

Saravanos, A., & Curinga, M. X. (2023). Simulating the

Software Development Lifecycle: The Waterfall

Model. Applied System Innovation, 6(6).

https://doi.org/10.3390/asi6060108

Uhm, M., Kim, J., Ahn, S., Jeong, H., & Kim, H. (2025).

Effectiveness of retrieval augmented generation-based

large language models for generating construction

safety information. Automation in Construction, 170.

https://doi.org/10.1016/j.autcon.2024.105926

Zhao, P., Zhang, H., Yu, Q., Wang, Z., Geng, Y., Fu, F.,

Yang, L., Zhang, W., Jiang, J., & Cui, B. (2024).

Retrieval-Augmented Generation for AI-Generated

Content: A Survey. http://arxiv.org/abs/2402.19473

RITECH 2025 - The International Conference on Research and Innovations in Information and Engineering Technology

208