Towards Developing an Agent-Based Framework for Validating the

Trustworthiness of Large Language Models

Johannes Bubeck, Janick Greinacher, Yannik Langer, Tobias Roth and Carsten Lanquillon

Business Information Systems, Heilbronn University of Applied Sciences, Heilbronn, Germany

Keywords:

Large Language Models, ChatGPT, Prompting, Trustworthiness, Validation, Veriﬁcation.

Abstract:

Large language models (LLMs) have revolutionized the ﬁeld of generative artiﬁcial intelligence and strongly

affect human-computer interaction based on natural language. Yet, it is difﬁcult for users to understand how

trustful LLM outputs are. Therefore, this paper develops an agent-based framework by exploring approaches,

methods, and the integration of external data sources. The framework contributes to AI reasearch and usage

by enabling future users to consider LLM outputs more efﬁciently and critically.

1 INTRODUCTION

Large Language Models (LLMs) have signiﬁcantly

transformed the ﬁelds of artiﬁcial intelligence (AI)

and natural language processing (NLP), leading to the

creation of advanced conversational models. With the

increasing complexity and capabilities of these mod-

els, the necessity to ensure their outputs are trust-

worthy and veriﬁable becomes paramount (Ali et al.,

2023). Trust is increasingly recognized as an es-

sential element in user-AI interactions, especially as

more opaque and complex machine learning models

become widespread in AI (Jacovi et al., 2021).

Moreover, the ﬁeld of LLMs is experiencing an

exponential growth in the number of papers published

each month (Zhao et al., 2023). As LLMs gain more

prominence, the validation of their outputs for trust-

worthiness becomes increasingly crucial. This trend

aligns with the strong focus on ethical considerations

in AI and machine learning, urging the need to ensure

trustworthiness in both current and future AI systems

and applications (Future of Life Institute (FLI), 2021).

A viable solution to this challenge is the valida-

tion of LLM outputs. This validation process ensures

that the models operate as intended, producing accu-

rate and reliable results. Validation, in this context,

involves verifying the trustworthiness of a model’s

outputs against established criteria or standards (Bow-

man and Dahl, 2021; Srivastava et al., 2022).

GPT-4 from OpenAI is one of the most advanced

language models currently available. It vigorously ex-

https://orcid.org/0000-0002-9319-1437

empliﬁes the critical need for validating LLM outputs.

Although ChatGPT, with its web interface that facili-

tates easy access to different GPT versions, has shown

remarkable ability in generating coherent and contex-

tually appropriate responses to user queries, instances

of the model delivering inconsistent outputs have also

been reported (Jang and Lukasiewicz, 2023).

Hence, validating the outputs of LLMs is vital to

ensure their reliability across various scenarios. This

leads to our research question:

How can we construct a framework that en-

hances LLM outputs by incorporating exist-

ing validation approaches and external data

sources to increase their trustworthiness?

To advance our research, we have identiﬁed the fol-

lowing subsidiary questions:

1. What suitable validation approaches and external

data sources already exist in the literature?

2. What are a reasonable architecture and process for

developing such a framework?

3. How can a prototype be developed based on this

framework?

The primary contribution of this paper is the devel-

opment of an agent-based framework that augments

the outputs of LLMs, providing users with feedback

on the trustworthiness of these outputs. We have de-

veloped a prototype to test and evaluate our proposed

framework. To address these research questions, we

have created the following artifacts:

• a literature review identifying suitable validation

approaches utilizing external data,

Bubeck, J., Greinacher, J., Langer, Y., Roth, T. and Lanquillon, C.

Towards Developing an Agent-Based Framework for Validating the Tr ustworthiness of Large Language Models.

DOI: 10.5220/0012364000003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 3, pages 527-534

ISBN: 978-989-758-680-4; ISSN: 2184-433X

527

• a multi-agent validation framework,

• a prototype implementation of the framework, and

• an evaluation of the prototype.

The structure of this paper is organized as follows:

First, we discuss the scientiﬁc foundations and theo-

retical background. This is followed by a detailed pre-

sentation of our research methodology and our ﬁnd-

ings. Finally, we conclude with a discussion of our

results and an exploration of potential future research

avenues.

2 THEORETICAL BACKGROUND

2.1 Large Language Models

LLMs are advanced AI models with an extensive

number of parameters. They are pre-trained on vast

text corpora using self-supervised learning methods,

and are often further reﬁned with task-speciﬁc exam-

ples in supervised settings. Their development began

to signiﬁcantly advance around 2018, with both their

parameter count and performance showing exponen-

tial growth since then (Birhane et al., 2023).

Current LLMs employ deep neural networks and

leverage sophisticated machine learning techniques,

such as the transformer model (Vaswani et al., 2017).

They utilize substantial computational resources and

large datasets, including internet content, enabling

them to develop a comprehensive understanding of

language and respond in contextually relevant and co-

herent ways (Teubner et al., 2023).

A major advantage of LLMs is their versatility

in tasks like text composition, question answering,

translation, chatbot operations, and program code

generation. They are distinguished by their scala-

bility and high parameter counts, which contribute

to their cutting-edge performance in natural language

processing. Notably, LLMs can initially learn from

unlabeled data, a self-supervised learning approach

that yields impressive outcomes and broadens their

applicability without extensive supervised training

(Vaswani et al., 2017; Huang et al., 2022).

However, there are several concerns associated

with LLMs. These include challenges in captur-

ing scientiﬁc interpretation and meaning, potential

neglect of value judgments in scientiﬁc texts, risks

of producing inaccurate content, so-called hallucina-

tions, and the potential erosion of trust in the peer re-

view process. These issues underscore the need for

cautious and critical assessment, as well as further re-

search to ensure responsible and appropriate use of

LLMs in scientiﬁc settings (Birhane et al., 2023).

2.2 Trust in Artiﬁcial Intelligence

Trust is a fundamental factor in the adoption and ef-

fective utilization of AI technologies. This is par-

ticularly true for the complex models used in natu-

ral language processing, where high levels of trust

are imperative. Trust enables researchers, develop-

ers, and users to depend on the accuracy and relia-

bility of AI-generated results. It lays the groundwork

for applying AI in diverse areas, including machine

translation, text generation, and information retrieval.

To establish trust, an AI system’s transparency, inter-

pretability, and accountability need to be thoroughly

addressed. This is essential to ensure that an AI sys-

tem is not only reliable but also ethically responsible

(Toreini et al., 2020).

However, the task of ensuring AI systems’ trust-

worthiness is fraught with challenges, primarily due

to the inherent black box nature of these systems. In

particular, LLMs with their complex deep neural net-

work architectures and extensive parameter spaces are

trained on massive datasets. The opacity of these deep

learning systems presents a signiﬁcant challenge, as

their decision-making processes typically remain elu-

sive. The lack of clarity and understanding of these

processes casts doubt on the extent to which these

systems can be trusted. As such, transparency is

paramount in building trust in AI, ensuring that users

can have conﬁdence in their outcomes (von Eschen-

bach, 2021).

3 RESEARCH METHODOLOGY

For the scientiﬁc design of our research process, we

adhere to the established methodology outlined in

(Peffers et al., 2007). Our research process is seg-

mented into ﬁve distinct steps, as depicted in Figure 1

and described in greater detail below.

Our literature review methodology is guided by

(Webster and Watson, 2002) and (vom Brocke et al.,

2009). We conducted title searches using the

terms [“fake checking” AND (“methods” OR “ap-

proaches”)] and [“fact checking” AND (“methods”

OR “approaches”)], with no date restrictions, across

several literature databases including IEEE Xplore

Digital Library, ScienceDirect, SpringerLink, and

Emerald Insight. In addition, we conduct forward and

backward searches (Webster and Watson, 2002).

In the second step, relevant papers are read thor-

oughly and, also, implemented. For the coding pro-

cess of the approaches, the procedure of qualitative

content analysis as proposed by (Mayring and Fenzl,

2010) is performed. All publications are analyzed to

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

528

Literature Search

(vom Brocke et al. 2009, Webster and Watson 2002)

Literature & Conference Databases

111

Identified Papers

Unique Papers

Relevant Papers

Final Papers

Search query

Excluding of

duplicates

Analyzing the

abstracts

Forward and

backward search

Approaches that are connected to external Data

(Mayring 2010)

Developing the Framework for Validating the

Trustworthiness of Responses from LLMs

(based on Peffers et al. 2007)

Literature Analysis

(Mayring 2010, Cruzes and Dyba 2011)

Implementation, Evaluation and Demonstration of

the prototype

(Peffers et al. 2007)

Assignment of

text segments

Papers

Clustering of 21

approaches

References

Identification of Approaches that are

connected to external Data

Feature

Extraction

Approaches

Modelling

Approaches

Approaches that are connected to external data

Feature Extraction

Approaches (5)

Modelling Approaches (16)

TF-IDF

Word2Vec

Fasttext

GloVe

BERT

Graph

Knowledge Graph

Logistic Regression

Linear SVM

KNN

Neural Network

RNN

CNN

LSTM

Anomaly Detection

Decision Tree/Random Forest

Naïve Bayes

XG-Boost

Question Answering

Cosine Similarity

Evidence Retrieval

Evaluation

Figure 1: Overview of the research process (Peffers et al., 2007).

identify approaches and external data sources. Only

text segments describing approaches connected to ex-

ternal data are considered here. Based on this, we

identify several text segments according to the the-

matic synthesis process by (Cruzes and Dyba, 2011).

These segments are condensed into various character-

istics and assigned to high-level categories (Mayring

and Fenzl, 2010). Subsequently, in the third step, we

summarize the segments and the associated publica-

tions in a concept matrix (Webster and Watson, 2002).

The fourth step is dedicated to conceptualizing the

framework. This entails analyzing the identiﬁed ap-

proaches related to external data sources and integrat-

ing them within the framework. A critical aspect of

this conceptual design is the recognition that not all

identiﬁed approaches could be considered in the ini-

tial implementation of the framework. However, the

agent-based structure that we propose will offer the

ﬂexibility to easily incorporate additional approaches

in future iterations.

In the ﬁnal step, the design and development

and, also, the demonstration and evaluation phases

are completed according to (Peffers et al., 2007).

The prototype is being implemented, evaluated, and

demonstrated based on our proposed framework. Par-

ticular emphasis is placed on the structure of the

agents, ensuring that they align well with the over-

arching framework and meet the objectives set out in

our research.

4 FINDINGS

This section delves into the detailed presentation and

discussion of the results and ﬁndings. It encompasses

the literature review, the conceptual design of the

framework, and the development of the prototype.

4.1 Literature Review

The literature search process resulted in 111 publica-

tions. After a thorough review of the abstracts and

keywords, we narrowed down to 25 publications that

were directly relevant to our research focus. The for-

ward and backward searches increased the ﬁnal num-

ber of pertinent publications to 34.

We identiﬁed two main categories: feature ex-

traction and modeling approaches. These categories

Towards Developing an Agent-Based Framework for Validating the Trustworthiness of Large Language Models

529

form the foundation for conceptualizing and develop-

ing both the framework and the prototype. As de-

picted in Figure 1 (step 2), the category feature ex-

traction comprises ﬁve approaches, while modeling

approaches includes 16.

We conducted a comprehensive analysis and eval-

uation of these approaches to assess their suitability.

Some approaches are incompatible with our current

research focus, which is primarily focused on the con-

ceptualization of a prototypical framework. Thus, we

carefully selected suitable approaches and integrated

them into the initial iteration of our project. Nonethe-

less, the knowledge gained from the literature review

is expected to be extremely valuable for future re-

search and the continued development of the frame-

work in its subsequent iterations.

4.2 Framework

The development of our framework is based on the

results and analyses of approaches identiﬁed in the

literature review. We opted for an agent-based frame-

work as the foundational conceptual model because

it allows seamless integration of the approaches we

identiﬁed without the need for complex adaptations.

This ensures the framework’s modularity and scala-

bility. We consider agents as independent entities,

each with speciﬁc tasks, capable of interacting within

the framework. The analysis of the trustworthiness of

LLM outputs is facilitated by individual agents, each

calculating scores for different tasks. A neural net-

work, referred to as the scoring agent, is then used to

compute a weighted overall score.

For the frontend, we selected Streamlit due to its

ability to support the rapid development of visually

appealing prototypes. Figure 2 illustrates our frame-

work, which will be discussed in more detail in the

subsequent sections.

The OpenAI API component serves as a gateway

to the OpenAI API, enabling interaction with GPT-

3.5-Turbo or GPT-4 for text generation. This interface

facilitates sending requests to the OpenAI API and

receiving the corresponding text responses. As a part

of our research, we approached OpenAI to share our

research methodology and were subsequently granted

access to GPT-4, allowing us to test our framework

with its more sophisticated outputs.

The checkworthiness agent assesses the reliability

of statements utilizing the OpenAI API to process any

request. Its run method analyzes the model’s response

and produces a list indicating a statement’s veriﬁabil-

ity along with detailed explanations.

Core is the central component of the LLM valida-

tion framework. It orchestrates various agents, such

as the checkworthiness agent, information extraction

agent, website agent, Wikipedia agent, Google API

agent, GPT queries agent, and scoring agent. Trig-

gered from the Streamlit frontend after conﬁrming a

statement’s trustworthiness, the core’s main function

orchestrates the sequence ﬂow. This enables the inte-

gration of data from different agents and acts as the

central control point of the framework.

In the information extraction agent, an interface

connects to the OpenAI API to conduct speciﬁc

searches for papers, authors, websites, and keywords

used in the model’s outputs. This agent processes this

information with comprehensive error handling to ad-

dress potential inaccuracies or incorrect feedback.

The Wikipedia agent extracts information using

keywords from the information extraction agent as

search queries. It scans the initial articles using these

keywords to identify authors’ names and paper titles.

It then compares these details with known authors and

titles. After this comparison, a scoring mechanism is

applied to measure the level of similarity or relevance

between them.

The website agent scores a given LLM output

based on matching websites and a thorough analysis

based on their content and status. The agent sends

requests to individual websites and analyzes the re-

sulting HTTP status codes as they indicate whether

a website is functioning correctly, undergoing redi-

rection, offering informative content, or facing errors.

The ﬁnal score is determined based on the number of

matching websites found and their status categoriza-

tion.

The Google API agent leverages the Google

Books API to review books and their respective au-

thors. It examines authors and book titles by dis-

patching API requests, evaluating the outcomes, and

compiles distinct lists of veriﬁed authors and books.

Finally, the agent calculates a score based on the fre-

quency of successful Google Books API queries.

The SpringerLink agent gathers information

through an advanced search on the SpringerLink web-

site, explicitly searching for certain authors and ti-

tles. The results it receives are then compared with

the data extracted by the information extraction agent,

which includes details of authors and papers. Similar

to the process in the Wikipedia agent, a score is calcu-

lated for this agent based on the number of successful

matches found between the authors and papers iden-

tiﬁed by the information extraction agent.

The GPT queries agent employs the OpenAI

model to generate trust scores. This agent initializes

the OpenAI object and uses a predeﬁned text as a tem-

plate for evaluating trust. Unlike other agents that de-

rive their scores from external data, this agent calcu-

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

530

Figure 2: Overview of identiﬁed approaches.

lates the LLM’s trustworthiness score internally. The

LLM assesses this score using information extracted

by the information extraction agent. Therefore, its in-

puts are solely its own previously generated outputs,

without any external data.

The scoring agent and the optimizer component

are responsible for calculating the overall trustwor-

thiness score and optimizing it. To do so, the scor-

ing agent activates a neural network module with the

individual agents’ scores. The overall trustworthi-

ness score is computed as a weighted sum of the in-

dividual agent scores. A neural network optimizes

the weights by predicting the speciﬁc weight for each

agent’s score based on the available data. Users can

contribute feedback on the trustworthiness score’s ac-

curacy, which is then incorporated into the neural net-

work’s training data. This retraining occurs when the

framework is initially launched.

Currently, the overall trustworthiness score in-

cludes assessments from ﬁve agents, each determin-

ing its score individually. These individual scores

are based on how closely their ﬁndings align with the

LLM’s output, for which the trustworthiness score is

being calculated. The scores range from 0 to 1, with

0.5 indicating a neutral position. Each score reﬂects

the degree to which the LLM’s output is considered

trustworthy. If an agent is unable to gather useful

information, it adapts by assigning a neutral score

of 0.5. To maintain transparency and comparability,

the individual agents’ algorithms for calculating their

scores always follow a consistent pattern based on the

numbers of matches and their respective quality.

4.3 Implementation of the Prototype

The implementation of the prototype closely follows

the conceptual framework’s description. Figure 3 dis-

plays a section of our sequence diagram. Given the

prototype’s complexity, which includes a multitude of

agents and their interactions, the sequence diagram is

essential for structuring the prototype and the code

ﬂow and enhancing its comprehensibility.

With Python being the programming language of

our choice, an object-oriented approach was adopted,

and each agent is implemented as a distinct object or

class. JSON was employed to consolidate the results

and scores, with only the core module having admin-

istrative rights. Once the framework is initialized,

the core module assumes full control over all other

modules, orchestrating the individual results from the

agents.

The validation process is initiated only if the

LLM’s statement is deemed veriﬁable by the check-

worthiness agent. Access to the LLM is facilitated

through the OpenAI API, where requests are sent to

the designated chat completion endpoint. Authentica-

tion to the API is secured using organization details

and API keys, which are also essential for API billing

purposes. This combination of authentication meth-

ods is crucial to ensure that the model operates within

the context of the initially provided prompt, which is

key to our prompt design strategy. As mentioned in

the previous section, we gained access to the GPT-4

model from OpenAI for this research project, follow-

ing a request and a description of our research focus.

Towards Developing an Agent-Based Framework for Validating the Trustworthiness of Large Language Models

531

LLMVeriﬁcationFramework

User

OpenAIAPI

JSON

Core CheckworthinessAgent InformationExtractionAgent WikipediaAgent WebsiteAgent GoogleAPIAgent SpringerLinkAgent GPTQueriesAgent ScoringAgent ScoreOptimizer NeuralNetwork

TrainData

Prompt

PromptandOutput

LLMOutput

TriggerCheckworthinessScan

CheckLLMOutput

Checkable/NotCheckable

StartFramework

LLMResponse

ExtractionPrompts

DeliverextractedInformation

ReturnPapers,Authors,WebsitesandKeywords

Papers,Authors,Websites,Keywords

Keywords,Authors,Papers

WikipediaScore

Papers,Authors,Score

Websites

WebsiteScore

Websites,Score

Authors,Papers

GoogleScore

Papers,Authors,Score

Authors,Papers

SpringerlinkScore

Papers,Authors,Score

JSONData

request

SendScore

GPTQueriesScore

Score

SendAgent-Scores

PredictWeights

ReturnpredictedWeights

ReturncalculatedtotalScore

TotalScore

TrustworthyScoreandadditionalInformation

DisplaytrustworthyScoreandadditionalInformation

RatingAccuracyofTrustworthinessScore

AppendScores,WeightsandFeedbackScore

TrainModel

Figure 3: Part of the developed sequence diagram.

5 EVALUATION

This section focuses on evaluating the developed

framework and its prototype implementation. It aims

to determine the feasibility of successfully evaluating

the trustworthiness of LLM outputs.

The user workﬂow and the prototype are depicted

in Figure 4, using the example of querying whether

the 2016 U.S. presidential election was rigged. The

prototype, created in Streamlit, functions as a local

web server. Users can input any text or question into

the start screen as requests for a LLM. Once a prompt

is submitted, it is sent to the OpenAI API, and the

response from OpenAI’s GPT-4 model is displayed.

Initially, the user has the option to check if the out-

put is worthy of further scrutiny. A reasoned response

is then provided, indicating whether the output can

be checked. If the LLM’s output is deemed uncheck-

able, the user can initiate a new query. Otherwise, the

user can activate the framework, which then starts the

complex validation process in the background. This

involves the individual agents’ data processing and

information analysis. Once all agents have completed

their tasks and the overall trustworthiness score is cal-

culated, it is displayed appropriately.

Besides the output results, users can view all the

underlying information used in the agents’ calcula-

tions, as illustrated in Figure 5. The detailed results

are provided for questions such as: “Was the moon

landing real?”

Access to detailed information about the output

and the corresponding score enables users to not only

receive the LLM’s answer to their initial prompt, but

also gain additional insights about it. This approach

empowers users to delve into and scrutinize the rele-

vant sources and authors associated with the informa-

tion provided.

6 CONCLUSION

Our research methodology has been an effective foun-

dation for developing an agent-based framework to

validate the trustworthiness of LLMs. In this section,

we summarize and critically analyze the results, while

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

532

Figure 4: Interaction process of the user with the framework.

Figure 5: Detailed results of the calculated trustworthiness score.

also identifying potential avenues for future research.

In our analysis of the literature, we identiﬁed 21

potential approaches for prototypically implementing

an agent-based framework. By adopting the selected

methodologies and integrating external data sources,

our prototype framework computes a trustworthiness

score. This score is calculated based on the vary-

ing weights assigned to the evaluations from individ-

ual agents. It serves as a measure for assessing the

reliability of OpenAI’s GPT-4 model in generating

responses. The trustworthiness score, which ranges

from 0 to 1, acts as a metric to determine the de-

gree of trust that can be placed in the model’s outputs.

A score of 0 indicates a low level of trustworthiness,

whereas a score of 1 represents an exceptionally high

level of trustworthiness.

The agent-based framework we have conceptu-

alized offers a ﬂexible and scalable architecture for

future development. Evaluating the system through

our implemented prototype has yielded valuable in-

sights into its functionality and usability. Since not

all agent approaches identiﬁed in our literature review

have been included in the prototype, there is room for

integrating additional agents in future iterations.

As stated above, our research methodology was

effective in developing the implementation of the

agent-based framework. However, this work is not

without its limitations. Additional approaches could

have been explored for use within the framework to

validate the trustworthiness of LLM outputs. While

we have successfully incorporated various approaches

in the form of agents, there is currently no scientiﬁ-

cally robust justiﬁcation for the selection of these ap-

proaches. Moreover, our framework is currently lim-

ited to six agents due to its prototype status. There-

fore, future developments could involve expanding

the framework by adding new agents, potentially de-

veloping and implementing a separate agent for each

method identiﬁed in our literature analysis.

To the best of our knowledge, this paper presents

a novel approach for validating the trustworthiness of

LLMs. The ﬁndings here lay the groundwork for sub-

sequent research. Future studies might focus on en-

hancing the proposed trustworthiness score calcula-

Towards Developing an Agent-Based Framework for Validating the Trustworthiness of Large Language Models

533

tion. For instance, our neural network could undergo

further iterations to incorporate diverse data sources,

thereby optimizing the weighting of agents in deter-

mining the score. Additionally, future work should

involve reconﬁguring the architecture of the prototype

and subjecting it to empirical evaluation.

In the scope of our work, we have successfully

created a prototypical agent-based framework for as-

sessing the trustworthiness of LLMs. This prototype

establishes a robust foundation and signals promising

directions for future advancements. With this imple-

mentation, we have made a contribution in developing

a validation framework for LLM outputs, marking a

vital step towards its potential future application.

REFERENCES

Ali, S., Abuhmed, T., El-Sappagh, S., Muhammad, K.,

Alonso-Moral, J. M., Confalonieri, R., Guidotti, R.,

Del Ser, J., D

ıaz-Rodr

ıguez, N., and Herrera, F.

(2023). Explainable Artiﬁcial Intelligence (XAI):

What we know and what is left to attain Trust-

worthy Artiﬁcial Intelligence. Information Fusion,

99:101805.

Birhane, A., Kasirzadeh, A., Leslie, D., and Wachter, S.

(2023). Science in the age of large language models.

Nature Reviews Physics, 5(5):277–280. Number: 5

Publisher: Nature Publishing Group.

Bowman, S. R. and Dahl, G. (2021). What Will it Take to

Fix Benchmarking in Natural Language Understand-

ing? In Proceedings of the 2021 Conference of the

North American Chapter of the Association for Com-

putational Linguistics: Human Language Technolo-

gies, pages 4843–4855, Online. Association for Com-

putational Linguistics.

Cruzes, D. S. and Dyba, T. (2011). Recommended steps for

thematic synthesis in software engineering. In 2011

international symposium on empirical software engi-

neering and measurement, pages 275–284. IEEE.

Future of Life Institute (FLI) (2021). The Artiﬁcial Intelli-

gence Act.

Huang, J., Gu, S. S., Hou, L., Wu, Y., Wang, X., Yu, H., and

Han, J. (2022). Large Language Models Can Self-

Improve. arXiv:2210.11610 [cs].

Jacovi, A., Marasovi

c, A., Miller, T., and Goldberg, Y.

(2021). Formalizing Trust in Artiﬁcial Intelligence:

Prerequisites, Causes and Goals of Human Trust in

AI. In Proceedings of the 2021 ACM Conference on

Fairness, Accountability, and Transparency, FAccT

’21, pages 624–635, New York, NY, USA. Associa-

tion for Computing Machinery.

Jang, M. and Lukasiewicz, T. (2023). Consistency Analysis

of ChatGPT. arXiv:2303.06273 [cs].

Mayring, P. and Fenzl, T. (2010). Qualitative inhaltsanalyse

[qualitative content analysis]. Qualitative Forschung

Ein Handbuch (Qualitative Research: A Handbook),

pages 468–475.

Peffers, K., Tuunanen, T., Rothenberger, M. A., and

Chatterjee, S. (2007). A Design Science Re-

search Methodology for Information Systems Re-

search. Journal of Management Information Sys-

tems, 24(3):45–77. Publisher: Routledge eprint:

https://doi.org/10.2753/MIS0742-1222240302.

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M.,

Abid, A., Fisch, A., Brown, A. R., Santoro, A.,

Gupta, A., Garriga-Alonso, A., et al. (2022). Be-

yond the imitation game: Quantifying and extrapolat-

ing the capabilities of language models. arXiv preprint

arXiv:2206.04615.

Teubner, T., Flath, C. M., Weinhardt, C., van der Aalst, W.,

and Hinz, O. (2023). Welcome to the Era of ChatGPT

et al. Business & Information Systems Engineering,

65(2):95–101.

Toreini, E., Aitken, M., Coopamootoo, K., Elliott, K., Ze-

laya, C. G., and van Moorsel, A. (2020). The rela-

tionship between trust in AI and trustworthy machine

learning technologies. In Proceedings of the 2020

Conference on Fairness, Accountability, and Trans-

parency, FAT* ’20, pages 272–283, New York, NY,

USA. Association for Computing Machinery.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L., and Polosukhin, I.

(2017). Attention Is All You Need. arXiv:1706.03762

[cs].

vom Brocke, J., Simons, A., Niehaves, B., and Reimer,

K. (2009). Reconstructing the Giant: On the Impor-

tance of Rigour in Documenting the Literature Search

Proess. In European Conference on Information Sys-

tems (ECIS) 2009.

von Eschenbach, W. J. (2021). Transparency and the Black

Box Problem: Why We Do Not Trust AI. Philosophy

& Technology, 34(4):1607–1622.

Webster, J. and Watson, R. T. (2002). Analyzing the Past to

Prepare for the Future: Writing a Literature Review.

MIS Quarterly, 26(2):xiii–xxiii.

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou,

Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y.,

Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li,

Y., Tang, X., Liu, Z., Liu, P., Nie, J.-Y., and Wen,

J.-R. (2023). A Survey of Large Language Models.

arXiv:2303.18223 [cs].

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

534