Automated Evaluation of Database Conversational Agents

Matheus O. Silva

1 a

, Eduardo R. S. Nascimento

1,2 b

, Yenier T. Izquierdo

2 c

Melissa Lemos

1,2 d

and Marco A. Casanova

1,2 e

Department of Informatics, PUC-Rio, Rio de Janeiro, RJ, Brazil

Tecgraf Institute, PUC-Rio, Rio de Janeiro, RJ, Brazil

Keywords:

Conversational Agents, Database Interfaces, ReAcT, LLM.

Abstract:

Database conversational agents support dialogues to help users interact with databases in their jargon. A

strategy to construct such agents is to adopt an LLM-based architecture. However, evaluating agent-based

systems is complex and lacks a deﬁnitive solution, as responses from such systems are open-ended, with

no direct relationship between input and the expected response. This paper then focuses on the problem of

evaluating LLM-based database conversational agents. It ﬁrst introduces a tool to construct test datasets for

such agents that explores the schema and the data values of the underlying database. The paper then describes

an evaluation agent that behaves like a human user to assess the responses of a database conversational agent on

a test dataset. Finally, the paper includes a proof-of-concept experiment with an implementation of a database

conversational agent over two databases, the Mondial database and an industrial database in production at an

energy company.

1 INTRODUCTION

Database conversational agents support dialogues to

help users interact with databases in their jargon.

They typically accept user questions formulated as

natural language (NL) sentences and return results

phrased in NL. Such agents should also help users

formulate a question step-by-step and express a new

question by referring to an older one or to a previously

deﬁned context.

In general, a conversational agent must maintain

the dialogue state and resolve, among other problems,

anaphora, ellipsis, ambiguities, and topic shifts (Gal-

imzhanova et al., 2023). Large Language Models

(LLMs) ﬁne-tuned to follow instructions help address

these challenges, exactly because they are trained to

maintain context and resolve anaphora, among other

nuances of user dialogues.

In addition, a database conversational agent must

translate user questions, explicitly submitted as NL

sentences in the dialogue or inferred from the dia-

https://orcid.org/0009-0009-1932-296X

https://orcid.org/0009-0005-3391-7813

https://orcid.org/0000-0003-0971-8572

https://orcid.org/0000-0003-1723-9897

https://orcid.org/0000-0003-0765-9636

logue, into SQL queries over the underlying database.

This translation task is often referred to as text-to-

SQL. Therefore, tools that address such task help con-

struct database conversational agents because they are

designed to match user terms with the database vocab-

ulary and perform the translation.

Evaluating database conversational agents, espe-

cially those executing tool calls based on natural lan-

guage dialogues, is crucial to validate their effective-

ness and reliability in real-world scenarios. It is there-

fore necessary to adopt a systematic evaluation pro-

cess capable of measuring the alignment between user

intentions and the database conversational agent’s ac-

tions, including the correctness of the SQL queries

generated.

This paper then focuses on the problem of eval-

uating a database conversational agent over a given

database D.

To address this problem, the paper introduces a

dialogue generation tool that creates a dialogue test

dataset T

for D, an evaluation agent that simulates

a user interacting with the database conversational

agent over D, guided by the dialogue test dataset

, and a collection of dialogue performance met-

rics. This contribution is essential to verify the per-

formance of a database conversational agent over a

Silva, M. O., Nascimento, E. R. S., Izquierdo, Y. T., Lemos, M. and Casanova, M. A.

Automated Evaluation of Database Conversational Agents.

DOI: 10.5220/0013732900003985

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 21st International Conference on Web Information Systems and Technologies (WEBIST 2025), pages 277-288

ISBN: 978-989-758-772-6; ISSN: 2184-3252

277

given database before it goes into production.

The dialogue generation tool traverses the

database schema and generates a set of joins that in-

duce a natural and coherent conversation ﬂow. Fur-

thermore, the set of dialogues the tool generates cov-

ers all database tables. The evaluation agent simu-

lates realistic user interactions, incorporates iterative

feedback mechanisms, and leverages LLMs to judge

a database conversational agent’s ability to align with

the user’s intentions. The dialogue performance met-

rics cover the database conversational agent’s abil-

ity to handle realistic dialogues, and measure if the

text-to-SQL tool produced correct SQL queries when

compared to a set of ground-truth SQL queries.

The second contribution of the paper is a proof-

of-concept experiment with an implementation of a

database conversational agent over two databases, the

Mondial database

and an industrial database in pro-

duction at an energy company. The implementation

uses the LangGraph ReAct (Reasoning and Acting)

template, which features action planning (Yao et al.,

2023) and a memory component that stores the con-

versation history, and LLM-based text-to-SQL tools

(Oliveira et al., 2025; Nascimento et al., 2025a). The

experiment suggests that the dialogue test datasets the

dialogue generation tool creates and the user simu-

lations the evaluation agent implements are valuable

in assessing the performance of a database conversa-

tional agent over a given database, which is the central

problem the paper addresses.

This paper is organized as follows. Section 2 cov-

ers related work. Section 3 details the dialogue gener-

ation tool. Section 4 focuses on the evaluation agent.

Section 5 describes the proof-of-concept experiment.

Finally, Section 6 contains the conclusions.

2 RELATED WORK

Conversational Interfaces. Conversational inter-

faces support user dialogues and may be classiﬁed

as task-oriented or chatbots (Jurafsky and Martin,

2024). Task-oriented conversational interfaces sup-

port user dialogues to accomplish ﬁxed tasks. Chat-

bots are designed to mimic the unstructured conversa-

tions characteristic of human-human interaction. The

database conversational agent used in the experiment

falls into the second category.

Quamar et al. (Quamar et al., 2020) proposed an

ontology-driven task-oriented interface, which uses a

classiﬁer to identify user intentions and a set of tem-

plates to generate database queries. The siwarex plat-

https://www.dbis.informatik.uni-

goettingen.de/Mondial/

form (Fokoue et al., 2024) enables seamless natural

language access to both databases and APIs, typical

of an industrial setting, and was tested on an exten-

sion of the Spider text-to-SQL benchmark.

Wei et al. (Wei et al., 2024) explored what prompt

design factors help create chatbots. The authors as-

sessed four prompt designs in an experiment involv-

ing 48 users. The results indicated that the chatbots

covered 79% of the desired information slots during

conversations, and the designs of prompts and topics

signiﬁcantly inﬂuenced the conversation ﬂows and the

data collection performance.

There are many libraries to help build chat-

bots, such as Google’s DialogFlow

, the AmazonLex

(LexBot) framework

, Rasa

, and the Xatkit Bot Plat-

form

. For example, a chatbot that combines the ﬂex-

ibility of an LLM with the LexBot framework is de-

scribed in (Boris, 2024).

The implementation of the database conversa-

tional agent used in Section 5 adopted the LangGraph

ReAct agent template

. The ReAct (“reasoning and

acting”) framework (Yao et al., 2023) combines chain

of thought reasoning with external tool use. ReAct

served as a basis for several conversational frame-

works. For example, ReAcTable (Zhang et al., 2024b)

is a framework designed for the Table Question An-

swering task inspired by ReAct and relying on exter-

nal tools such as SQL and Python code executors. It

achieved an accuracy of 68.0% on the WikiTQ bench-

mark (Pasupat and Liang, 2015). RAISE (Liu et al.,

2024) is an enhancement of the ReAct framework

and incorporates a dual-component memory system

to maintain context and continuity in conversations.

Text-to-SQL Tools. Comprehensive surveys of text-

to-SQL strategies can be found in (Hong et al., 2025;

Shi et al., 2025), including a discussion of benchmark

datasets, prompt engineering, and ﬁne-tuning meth-

ods. The Awesome Text2SQL Web site

lists the best-

performing text-to-SQL tools on several text-to-SQL

benchmarks, and the DB-GPT-Hub

project explores

how to use LLMs for text-to-SQL. It should also be

remarked that the text-to-SQL problem is considered

far from solved for real-world databases (Floratou

et al., 2024; Lei et al., 2025).

The text-to-SQL tools (Nascimento et al., 2025a;

Oliveira et al., 2025) used in Section 5 achieved good

https://github.com/langchain-ai/react-agent

https://aws.amazon.com/lex/

https://github.com/RasaHQ/

https://github.com/xatkit-bot-platform

https://github.com/langchain-ai/react-agent

https://github.com/eosphoros-ai/Awesome-Text2SQL

https://github.com/eosphoros-ai/DB-GPT-Hub

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

278

results over challenging benchmarks based on the in-

dustrial database and on the Mondial database, re-

spectively.

Benchmarks for Conversational Interfaces. There

are currently several datasets appropriate for training

and evaluating models for conversational interfaces.

Examples are the TREC Conversational Assistance

Track (CAsT) and the TREC Interactive Knowledge

Assistance Track (iKAT) 2023 (Aliannejadi et al.,

2024). MultiWOZ (Budzianowski et al., 2020) is a

fully labeled collection of human-human written con-

versations spanning multiple domains and topics. The

task-oriented dialogues were created from task tem-

plates using an ontology of the back-end database,

which turns out to be very simple. The Awesome NLP

benchmarks for intent-based chatbots Web site

lists

benchmarks to evaluate user intent matching and en-

tity recognition.

DialogStudio (Zhang et al., 2024a) uniﬁed several

dataset collections and instruction-aware models for

conversational interfaces. The authors also developed

conversational AI models, using the dataset collection

to ﬁne-tune T5 and Flan-T5. They also report the use

of ChatGPT to evaluate the quality of the responses

generated.

Contributions. The above conversational bench-

marks typically favor model ﬁne-tuning and are not

designed to test conversational agents over more com-

plex databases. Thus, rather than relying on such

generic benchmarks, Section 3 introduces a dialogue

generation tool to create a dialogue test dataset T

speciﬁcally to assess the performance of a database

conversational agent over a given database D. The ap-

proach taken is rather different from any of the above

efforts to create benchmarks for conversational inter-

faces, insofar as the dialogue generation tool starts

from the selected database D for which one wants to

create a conversational interface and uses the database

schema and the data stored to synthesize a dataset T

containing dialogues that traverse D. The tool auto-

matically extends the synthetic dialogues to include

the user’s intentions and the expected SQL queries.

The procedure is generic and automated, working

with minimal human intervention.

Section 4 describes an evaluation agent to auto-

mate performance assessment, using T

. The evalua-

tion agent simulates realistic user behavior and sys-

tematically interacts with a database conversational

agent, querying and providing feedback when nec-

essary, following an experimental protocol that con-

https://github.com/xatkit-bot-platform/awesome-nlp-

chatbot-benchmarks

sumes the dialogues in T

. The evaluation agent in-

corporates an “AI as Judge” mechanism (Zheng et al.,

2023) to assess a database conversational agent’s abil-

ity to align with the user’s intentions.

The performance metrics cover the database con-

versational agent’s ability to handle dialogues involv-

ing context-dependent follow-up questions and in-

complete information that require inference based on

dialogue memory. They also measure if the text-to-

SQL tool produced correct SQL queries when com-

pared to a set of ground-truth SQL queries. These

metrics are novel and a contribution of the paper to the

assessment of database conversational agents working

over databases.

The procedure to generate a dialogue test dataset

for a given database D and the evaluation

agent then help address the problem of evaluating a

database conversational agent over D.

The experiment in Section 5 tested a database

conversational agent over two databases, the Mon-

dial database and an industrial database in produc-

tion at an energy company. It adopted the evaluation

agent and dialogue test datasets created by the dia-

logue generation tool speciﬁcally for these databases.

Therefore, the experiment illustrates how to evaluate

a database conversational agent over a given database

using the techniques introduced in the paper.

3 DIALOGUE TEST DATASETS

3.1 Deﬁnitions of Dialogue and

Extended Dialogue

For the purposes of this paper, a turn is either a user

turn or a system turn, each composed of NL sen-

tences, called utterances, produced by the user or by

the system. An interaction is a sequence of two or

more turns, starting with a user turn, ending in a sys-

tem turn, and alternating between a user turn and a

system turn. A dialogue is a sequence of interactions.

Intuitively, an interaction is a goal-oriented ex-

change between the user and the system and is typ-

ically centered around a speciﬁc question or task,

which may span several turns. Figure 1 shows a di-

alogue with four interactions, corresponding to Lines

1–4, 5–7, 8–12, and 13–15. All interactions consist

of a user turn followed by a system turn, except that

in Lines 8–12, which has two pairs of a user turn fol-

lowed by a system turn.

The utterances in a user turn express:

• A greeting, that optionally signals the start of a

dialogue or an interaction.

Automated Evaluation of Database Conversational Agents

279

1 User: "Can you show me a list of the major airports,

2 along with the names and capitals of the countries

3 they are located in?"

4 System: [The first 10 major airports are...]

5 User: "Now, can you show me the border details for

6 these countries?"

7 System: [A description of the data requested]

8 User: "I will now focus on The Ville Lumi

ere."

9 System: "I am sorry. I have no location called

10 Ville Lumi

ere."

11 User: "I meant Paris."

12 System: "OK."

13 User: "What is the population?"

14 System: Paris has an estimated population of about

15 2 million residents.

Figure 1: Example of a dialogue.

• A question, that speciﬁes a database request, such

as that in Lines 1–3.

• A partial question, that partially speciﬁes a

database request and that must be expanded into

a (complete) question using the dialogue context,

such as that in Lines 5–6 and Line 13.

• A context modiﬁcation, that modiﬁes the next

questions in the dialogue, such as that in Line 8.

• A clariﬁcation, that responds to a request from the

system for information, such as that in Line 11.

The utterances in a system turn express:

• An acknowledgment, that signals that the system

understood the user, such as that in Line 12.

• A response, that describes data retrieved from the

database, such as that in Lines 14–15.

• A request, that asks the user to clarify his utter-

ance, such as that in Lines 9–10.

• An error message, that signals when the user’s ut-

terance cannot be processed.

The notion of dialogue, introduced above, cap-

tures what the user observes when interacting with

the interface to retrieve data from the database. Still,

it is insufﬁcient to capture what is needed to test a

database conversational agent. The notion of dialogue

is then extended so that a turn now includes additional

ﬁelds to help assess the dialogue performance metrics

introduced in Section 4.3:

• The intention, which is an NL sentence describing

what the user’s utterance expresses.

• A ground truth SQL query, which translates the

user utterance to SQL and, therefore, describes

the system response when the user utterance is a

question or a partial question.

Figure 2 shows, in JSON format, the extended di-

alogue corresponding to the dialogue of Figure 1 (for

brevity, the ﬁgure shows only the ﬁrst two interac-

tions).

3.2 Generation of Dialogue Test

Datasets

A natural language database conversational agent

must perform two challenging tasks: (1) Transform

partial questions into complete questions; and (2)

Modify questions to include explicitly deﬁned con-

text. A dialogue test dataset must then contain dia-

logues that test the ability of an interface to perform

such tasks. This section explains how to create dia-

logues to test an interface for the ﬁrst task and outlines

how to test for the second task.

3.2.1 Testing the Processing of Partial Questions

The strategy to test the processing of partial questions

consists of generating dialogues that traverse valid

join combinations

. Given a valid join combination

c, a dialogue d

is generated so that: (1) Each question

in d

is partial, incorporates a new join in c, and has

as context the previous questions; (2) The ﬁnal ques-

tion corresponds to an SQL query expressing c. Thus,

the dialogue d

progressively incorporates the joins in

c to induce a natural and coherent conversation ﬂow.

Furthermore, the set of dialogues must cover a diverse

set of valid join combinations, that is, it must traverse

the database schema.

The process of creating a dialogue test dataset

starts by identifying the valid join combinations,

which is implemented by an LLM prompt with the

following steps:

1. Deﬁne a structured format for the LLM output.

2. Instantiate an LLM model with advanced reason-

ing capabilities.

3. Provide the database schema.

4. Describe the task to the LLM, including examples

from the database and explaining how the joins

should be constructed.

This approach is based on the premise that an

LLM can effectively reason about the schema and the

task described to generate a list of valid join combina-

tions, as demonstrated by the successful use of LLMs

in tasks such as Named Entity Recognition (Wang

et al., 2025).

However, merely selecting a ﬁxed number of join

combinations without additional constraints results in

A valid join combination is a set of joins that induces a

connected subgraph of the referential integrity graph of the

database schema.

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

280

1 { "experiment_id": "1",

2 "total_expected_interactions": 3,

3 "interactions": [

4 {

5 "utterance": "Can you show me a list of airports along with the names and

6 The capitals of the countries they are located in?",

7 "intention": "Show me a list of airports and include the country’s name and

8 capital for each airport.",

9 "ground_truth_sql":

10 "SELECT A.NAME AS airport_name, A.CITY AS airport_city,

11 C.NAME AS country_name, C.CAPITAL

12 FROM MONDIAL_AIRPORT A JOIN MONDIAL_COUNTRY C

13 ON A.COUNTRY = C.CODE;"

14 } },

15 {

16 "utterance": "Now, can you show me the border details for these countries,

17 such as which countries they border and the length of those borders?",

18 "intention": "Using the list of countries from the previous query (those with

19 airports), show me their bordering countries and the border lengths.",

20 "ground_truth_sql":

21 "SELECT C.NAME AS country_name,

22 B.COUNTRY2 AS neighboring_country,

23 B.LENGTH AS border_length

24 FROM MONDIAL_COUNTRY C JOIN MONDIAL_BORDERS B

25 ON C.CODE = B.COUNTRY1

26 WHERE C.CODE IN (SELECT DISTINCT COUNTRY

27 FROM MONDIAL_AIRPORT);"

28 } },

29 ...

Figure 2: Example of an extended dialogue.

an imbalanced dataset. Some tables become overrep-

resented, while others remain underrepresented or ab-

sent. This imbalance can introduce bias in evaluation,

affecting the database dialogue interface’s ability to

generalize across different SQL queries. A structured

strategy for generating balanced join combinations is

then introduced, ensuring that:

• Each join combination results in a unique ex-

tended dialogue.

• The number of joins per dialogue varies to cover

different levels of complexity.

• The selection of join combinations maintains a

balanced distribution of tables, preventing biases

in dataset representation.

After implementing the balanced join selection

strategy, the ﬁnal join combinations are evaluated

based on three main criteria: (a) ensuring full schema

coverage; (b) achieving a balanced distribution of the

number of joins per combination; (c) distributing ta-

ble usage fairly to avoid overuse of speciﬁc tables.

The ﬁnal process of creating dialogues to test the

processing of partial questions is implemented by an

LLM prompt with the following steps (see Figure 3):

1. Extract the required table deﬁnitions.

2. Collect sample rows to be incorporated when for-

mulating queries.

3. Load the set C of balanced join combinations.

4. For each join combination c in C, instruct the

LLM to create an extended dialogue d

. The in-

structions include:

(a) The table deﬁnitions and sample data involved

in c.

(b) A brief explanation of c.

in c, there should be one interaction in d

, with

just one turn t

, using the previous interactions

as context.

(d) Indications to include in t

the user utterance,

the intention, and the ground-truth SQL query,

and their descriptions.

5. Instruct the LLM on the required output format.

3.2.2 Testing the Processing of Context

Modiﬁcations

Very brieﬂy, the strategy to test the processing of con-

text modiﬁcations consists of generating dialogues

that: (1) Deﬁne contexts by NL sentences that express

SQL restrictions, such as that in Line 8 of Figure 1;

Automated Evaluation of Database Conversational Agents

281

Figure 3: Architecture of the dialogue generation tool.

(2) Include subsequent partial questions that implic-

itly refers to the context deﬁned, as in Line 13.

The process of creating such dialogues starts by

sampling the database schema and the data values to

generate restrictions. The sampling process must gen-

erate restrictions from a wide selection of tables in the

schema, and not concentrate on just a few tables. The

process is implemented by an LLM prompt very sim-

ilar to that which creates dialogues to test the process-

ing of partial questions, and is omitted for brevity.

3.2.3 Constructing the Final Dialogue Test

Dataset

Finally, in its simplest form, the ﬁnal dialogue test

dataset will contain dialogues that test the processing

of partial questions and dialogues that test the pro-

cessing of context modiﬁcations. A more sophisti-

cated strategy can also be developed by combining

both approaches to create dialogues, each of which

tests both tasks.

3.2.4 Implementation Details

The construction of the dialogue test datasets used in

Section 5 adopted the OpenAI o3-mini model (with

high reasoning effort mode) with a Pydantic output

template requiring a list of joins. The model yields

unique single- and multi-joins with an error rate. Sim-

ilar structured prompting proved effective for schema

reasoning tasks (Wang et al., 2025; Aquino, 2024).

A greedy heuristic incrementally selects joins

while optimizing three objectives: (1) 100% table

coverage; (2) near-uniform distribution over 2-, 3-,

and 4-way joins; and (3) bounded per-table frequency.

For each join, a prompt was assembled containing:

the join context; CREATE TABLE DDLs; and up to 20

sample values per column. The LLM was prompted

to produce k user turns—one per hop—together with

validated SQL queries. A Pydantic Experiment

schema forces compliance.

4 EVALUATION OF DATABASE

CONVERSATIONAL AGENTS

4.1 Overview

The methodology uses an evaluation agent that sim-

ulates realistic user behavior. This agent systemati-

cally interacts with the database conversational agent,

querying and providing feedback when necessary,

following an experimental protocol that consumes

predeﬁned extended dialogues from a database test

dataset. Such an approach ensures the standardization

of interactions, enabling controlled evaluation condi-

tions across different queries and scenarios.

To determine whether the database conversational

agent’s responses fulﬁll the intentions, the evaluation

agent incorporates an automatic judgment mechanism

utilizing an LLM acting as evaluator, which is re-

ferred to as the “AI as Judge” methodology. This

methodology has been shown to effectively identify

alignment and correctness between generated queries

and user intentions, reducing human effort and poten-

tial biases during the evaluation process (Zheng et al.,

2023).

The evaluation agent leverages a state-based eval-

uation graph implemented using LangGraph, a frame-

work that orchestrates complex interactions among

language models, agents, and tools. LangGraph is

particularly suitable for managing stateful interac-

tions involving language models and external tools,

enabling the precise orchestration of complex con-

versational evaluation workﬂows (Chase and Team,

2023).

At the core of the evaluation architecture is a care-

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

282

fully designed internal state representation, deﬁned as

a strongly typed dictionary that encapsulates all rele-

vant information and parameters needed to manage

the evaluation interactions.

To control the evaluation ﬂow, conditional edges

determine whether the evaluation process should con-

tinue or be concluded. These decisions depend on

clearly deﬁned criteria, such as exceeding the maxi-

mum allowed retries for a given interaction or com-

pleting all intended interactions within an experiment

scenario.

To summarize, the implementation of the evalua-

tion agent using LangGraph facilitates precise control

over the evaluation workﬂow. This design decision

contributed to the reproducibility, clarity, and scal-

ability of evaluation experiments, offering valuable

insights into the performance of a database conver-

sational agent under varied interaction scenarios and

conditions.

4.2 Evaluation Agent

Figure 4 shows the architecture of the evaluation

agent, illustrating how each interaction progresses

through speciﬁc nodes and decision points within the

evaluation graph.

The evaluation agent receives as input a dialogue

test dataset T

for a database D, preﬁxed by control

data.

It begins at the Test Control Node, which ﬁrst

initializes and conﬁgures all relevant parameters and

states needed for the evaluation, using the control data

in the preﬁx of T

. The parameters include the maxi-

mum allowed retries per interaction, debugging pref-

erences, and model version. The node also initializes

detailed metrics and tracking ﬁelds, such as interac-

tion metrics, timestamps, and initial debug states. Af-

ter initialization, the Test Control Node passes each

extended dialogue d

in T

to the User Interaction

Node.

The User Interaction Node simulates user behav-

ior by presenting the database conversational agent

with a predeﬁned natural language question from a

turn of d

(the ‘utterance’ ﬁeld in Figure 2). The

decision on which natural language question to send

to the database conversational agent depends on an

internal evaluation state. Speciﬁcally, the evaluation

could proceed to a new interaction, or feedback could

be required, based on the previous responses. The

User Interaction Node manages these internal deci-

sions based on the value of a Boolean state variable,

called ‘state["go next interaction"]’.

Upon receiving a response from the database con-

versational agent, the evaluation moves to the Check

Response Node to analyze the response (see also the

deﬁnition of the dialogue performance metrics in Sec-

tion 4.3):

Successful Intention Alignment: the Check Re-

sponse Node evaluates whether the natural

language interpretation u

′

of the user input u

successfully aligned with the intention u

′′

. The

interpretation u

′

is the NL sentence sent to be

translated to SQL to create a response for u, if u is

a question or partial question, or the NL sentence

sent back to the user, in all other cases. The

intention u

′′

associated with u is obtained from

the ‘intention’ ﬁeld (see Figure 2). The NL

sentences u

′

and u

′′

are compared using the “AI

as Judge” approach, employing an LLM (Zheng

et al., 2023).

SQL Query Correctness: the Check Response Node

evaluates whether the SQL query Q

SQL

, generated

by the database conversational agent, reﬂects the

SQL ground truth query Q

′

SQL

, obtained from the

‘ground truth sql’ ﬁeld (see Figure 2). The

veriﬁcation process compares Q

SQL

and Q

′

SQL

in (Nascimento et al., 2025b).

If the database conversational agent’s response is

inadequate, the Check Response Node records the

cause of failure (e.g., alignment failure, SQL correct-

ness failure, decoding errors), incrementing the retry

count accordingly. At the end, the Check Response

Node updates several variables that will deﬁne the

following steps, such as ‘state["proceed"]’ and

‘state["go next interaction"]’.

The ﬁnal step of the evaluation workﬂow for d

involves determining whether the evaluation of d

should continue to the next interaction or be con-

cluded. This decision is primarily based on two crite-

ria:

Retries Exceeded: If the maximum number of re-

tries for a particular interaction is exceeded with-

out achieving a satisfactory response, the eval-

uation process terminates for that interaction,

recording it as unsuccessful.

Completion of Interaction Sequence: If all prede-

ﬁned interactions within d

have been successfully

evaluated or have reached termination conditions,

the evaluation of d

is concluded.

Finally, the evaluation agent outputs an evalu-

ation report with the result of processing the dia-

logue test dataset T

. The report contains JSON ob-

jects, one for each extended dialogue in T

, with

experiment id, total expected interactions,

and an interactions list. Every interaction contains

the user’s utterance, a reﬁned intention, and a gold

Automated Evaluation of Database Conversational Agents

283

Figure 4: The architecture of the evaluation agent and its relation to the database conversational agent.

SQL query. This structure permits computing the di-

alogue performance metrics for T

4.3 Dialogue Performance Metrics

The dialogue performance metrics should cover the

database conversational agent’s ability to handle re-

alistic dialogues, especially those involving context-

dependent follow-up questions and incomplete infor-

mation that require inference based on dialogue mem-

ory. They should also measure if the text-to-SQL tool

produced correct SQL queries when compared to a set

of ground-truth SQL queries.

Given a dialogue test dataset T

for a database D,

the following metrics cover this evaluation scenario.

SQL Query Correctness Rate: The percentage of

questions in the user turns of the extended dialogues

in T

that were correctly translated to an SQL query.

The notion that a user question was correctly

translated into an SQL query was explained in Sec-

tion 4.2.

User Turn Intention Alignment Rate: The percentage

of successfully aligned user turns of the extended di-

alogues in T

The notion that a user turn was successfully

aligned was also explained in Section 4.2.

Dialogue Intention Alignment Rate: The percentage

of successfully aligned extended dialogues in T

. An

extended dialogue d

is successfully aligned iff every

interaction in d

has at least one successfully aligned

user turn.

Note that there might be several user turns in an in-

teraction before reaching a correctly aligned one, but

these extra user turns do not affect the dialogue inten-

tion alignment rate, although they are accounted for

in the user turn intention alignment rate. Therefore, a

dialogue is considered successful only when all its in-

teractions have been adequately understood and han-

dled by the agent. This stringent measure ensures that

the agent can comprehend and appropriately respond

to all user requests within a dialogue.

Average Number of Turn Pairs per Interaction: The

average number of turn pairs (a user turn followed by

a system turn) per interaction in all dialogues in T

This metric quantiﬁes the efﬁciency of interac-

tions in terms of turns, providing insight into how

many exchanges are typically needed to satisfy a user

request. A lower average indicates better user inten-

tion comprehension, as the agent can understand and

execute the appropriate action with fewer clariﬁca-

tions or follow-ups.

5 A PROOF-OF-CONCEPT

EXPERIMENT

5.1 A Database Conversational Agent

Implementation and a Baseline

Agent

Figure 5 depicts the architecture of the database con-

versational agent used in the experiments, organized

around the following components:

• Dialogue Control Agent: controls the dialogue,

reformulates the user question, stores the dialogue

state, and interacts with the text-to-SQL tool.

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

284

• Text-to-SQL Tool: controls the compilation of a

user question into SQL and passes the results back

to the dialogue control agent. It has three major

components (Shi et al., 2025): Schema-Linking,

SQL Compilation, and SQL Execution.

• LLM: executes the tasks passed, guided by

prompts. The dialogue control agent and the text-

to-SQL tool may use different LLMs.

As mentioned in Section 2, the implementation of

the dialogue control agent was based on the Lang-

Graph ReAct agent template, which provides short-

and long-term memory to retain data from recent and

past user sessions. The experiment adopted a different

LLM-based text-to-SQL tool tuned to each database

(Coelho et al., 2024; Oliveira et al., 2025).

The experiment compared the database conversa-

tional agent implementation just described against a

baseline agent without dialogue memory to demon-

strate the critical importance of context retention in

conversational interfaces.

5.2 Experiment with the Mondial

Database

5.2.1 The Mondial Dialogue Test Dataset and

Use of the Evaluation Agent

The experiments used a Mondial dialogue test

dataset, generated from the Mondial database, as ex-

plained in Section 3.2, with 50 extended dialogues

and a total of 149 interactions, where 17 extended di-

alogues have two interactions, 17 have 3, and 16 have

4. All 40 Mondial tables were covered, with a stan-

dard deviation of 1.8 in table frequency. Compared to

a naive sampling baseline, coverage improved from

68% to 100% of the Mondial tables.

The test dataset was manually analyzed to verify

how much its dialogues covered the Mondial schema,

how realistic the automatically generated user utter-

ances and intentions were, and whether the automati-

cally synthesized SQL ground truth queries were cor-

rect. The analysis indicated that the dialogues cov-

ered the Mondial schema, the user utterances and in-

tentions were adequate, but 5 of the 149 ground truth

SQL queries had to be redeﬁned. Minor edits also im-

proved discourse coherence and question variety (ag-

gregations, subqueries, limits).

The experiments evaluated the database conversa-

tional agent implementation and the baseline agent

with the help of the evaluation agent of Section 4.2,

using the Mondial dialogue test dataset. Again, the

evaluation logs were manually analyzed to check if

the evaluation agent properly veriﬁed the intention

alignment and the SQL query correctness. The results

were manually adjusted, when deemed necessary, be-

fore computing the dialogue performance metrics.

A detailed analysis of the Mondial dialogue test

dataset and the evaluation agent logs can be found in

(Silva, 2025).

5.2.2 Results

Table 1 summarizes the results obtained with the

database conversational agent implementation, with

dialogue memory, and the baseline agent, without di-

alogue memory.

User Turn Intention Alignment Rate and SQL Query

Correctness Rate. The results demonstrate that

the memory-enabled database conversational agent

achieved an alignment rate of 100% to properly inter-

pret the user’s intention and an SQL query correctness

rate of 90%.

In stark contrast, the baseline agent, without dia-

logue memory, achieved only a 59.7% user turn align-

ment rate and a 32.6% SQL query correctness rate,

highlighting the severe degradation in performance

when the dialogue context is not maintained.

Dialogue Intention Alignment Rate. The memory-

enabled database conversational agent successfully

processed all dialogues (100%), reinforcing its ro-

bust ability to understand and correctly process user

requests across entire conversations, even in chal-

lenging scenarios involving context-dependent and

follow-up questions.

By comparison, without dialogue memory, the

baseline agent achieved only an 86% dialogue inten-

tion success rate, with nearly a third of all conver-

sations failing to meet the user’s information needs.

This difference emphasizes how crucial conversa-

tional memory is for maintaining dialogue coherence

and successfully handling multi-turn interactions.

Average Number of Turn Pairs per Interaction. The

results show that the memory-enabled database con-

versational agent reached an average of 1.01 turns per

interaction, which is close to the ideal of 1.0. This

indicates that the agent almost always correctly un-

derstood user requests on the ﬁrst attempt, rarely re-

quiring additional turns for clariﬁcation or correction.

For the baseline agent, without dialogue memory,

the results reveal an average of 1.59 turns per in-

teraction. This signiﬁcantly higher number of turns

demonstrates how the lack of dialogue memory im-

pacts dialogue efﬁciency, requiring more back-and-

forth exchanges to satisfy user information needs, and

sometimes not helping the agent to understand the

user’s real intention.

Automated Evaluation of Database Conversational Agents

285

Figure 5: Experimental setup.

Table 1: Results of the database conversational agent for the Mondial database.

Metric With Memory Without Memory Improvement

User Turn Intention Alignment Rate 100.0% 59.7% +40.3%

SQL Query Correctness Rate 90.8% 32.6% +58.2%

Dialogue Intention Alignment Rate 100% 68.0% +32.0%

Average Number of Turn Pairs per Interaction 1.01 1.73 -42.0%

Table 2: Results of the database conversational agent for the industrial database.

Metric With Memory Without Memory Improvement

User Turn Intention Alignment Rate 95.3% 69.0% 26.3%

SQL Query Correctness Rate 82.6% 41.3% 41.3%

Dialogue Intention Alignment Rate 100% 92% 8%

Average Number of Turn Pairs per Interaction 1.06 1.45 -27%

Dialogue Memory Impact Analysis. The compara-

tive results between the memory-enabled agent and

the baseline agent without dialogue memory provide

clear evidence of the critical role that dialogue mem-

ory plays in text-to-SQL interfaces. Without the abil-

ity to maintain context across turns, the baseline agent

struggled with:

1. Follow-up questions: Unable to connect questions

to previous context, leading to misinterpreted in-

tentions.

2. Incomplete queries: Unable to ﬁll in missing in-

formation from previous turns.

3. Conversational references: Failed to resolve ref-

erences like “their” and “that result”.

4. Query reﬁnement: Unable to build upon or reﬁne

previous queries.

These challenges resulted in the substantially

lower performance metrics observed in the baseline

agent, conﬁrming the hypothesis that dialogue mem-

ory is essential for building conversational agents.

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

286

5.3 Experiment with an Industrial

Database

The industrial database chosen for the experiments

(Izquierdo et al., 2024) is in production at an energy

company and has a relational schema with 27 tables

and a total of 585 columns and 30 foreign keys, where

the largest table has 81 columns.

The experiments based on the industrial database

used the database conversational agent with the KwS-

assisted text-to-SQL tool described in (Nascimento

et al., 2025a). They applied the evaluation agent of

Section 4.2 to a dialogue test dataset generated as ex-

plained in Section 3.2, with 50 extended dialogues

and a total of 140 interactions, where 10 extended di-

alogues had two interactions, and 40 had 3.

As in the experiment with Mondial, the dialogue

test dataset was manually analyzed. The analysis

found that a few golden standard SQL queries were

repeated in some turns, which does not represent a

problem, and two golden standard SQL queries were

incorrect and had to be manually rewritten.

Also, the evaluation agent log was manually ana-

lyzed. In the experiment with the database conversa-

tional agent with memory, the evaluation agent cor-

rectly classiﬁed all alignments. In the experiments

with the database conversational agent without mem-

ory, the evaluation agent correctly clariﬁed most of

the user questions in the ﬁrst attempt, when the dia-

logue agent returned “I am sorry, but your question

seems incomplete. Could you please provide more

details or rewrite the question.”, or when the align-

ment was false. However, in three cases, the evalua-

tion agent made a mistake when simulating the user,

and, consequently, the dialogue control agent and the

text-to-SQL tool also failed.

Table 2 summarizes the results of the experiments,

which were similar to those in Table 1 and, therefore,

require no further comments.

6 CONCLUSIONS

This paper addressed the problem of testing a

database conversational agent for a given database D.

To achieve this goal, the paper introduced a tool to

create a dialogue test dataset T

for D, a collection

of performance metrics, and an evaluation agent that

simulates a user interacting with the database con-

versational agent over D, guided by the dialogue test

dataset T

. This contribution is essential to verify the

performance of a database conversational agent over

a given database before it goes into production.

The paper concluded with a proof-of-concept ex-

periment with a database conversational agent over

two databases, the Mondial database and an industrial

database. The implementation used the LangGraph

ReAct framework and LLM-based text-to-SQL tools

designed for the databases.

The experiment suggests that the dialogue test

datasets the dialogue generation tool creates and the

user simulations the evaluation agent implements are

valuable in assessing the performance of a database

conversational agent over a given database, which is

the central problem the paper addresses.

Lastly, as future work, experiments along the lines

of the setup described in Section 5 will be conducted

with other real-world databases that are in production

to equip them with a natural language conversational

interface.

ACKNOWLEDGEMENTS

This work was partly funded by FAPERJ un-

der grants E-26/204.322/2024, by CNPq under

grant 305.587/2021-8, and by Petrobras Contract

2018/00716-0.

REFERENCES

Aliannejadi, M., Abbasiantaeb, Z., Chatterjee, S., Dalton,

J., and Azzopardi, L. (2024). TREC iKAT 2023: A

test collection for evaluating conversational and inter-

active knowledge assistants. In Proc. 47th Interna-

tional ACM SIGIR Conference on Research and De-

velopment in Information Retrieval, SIGIR ’24, page

819–829, New York, NY, USA. Association for Com-

puting Machinery. doi: 10.1145/3626772.3657860.

Aquino, P. (2024). How to get structured output from llms

applications using pydantic and instrutor. Medium.

https://medium.com/@pedro.aquino.se/how-to-get-

structured-output-from-llms-applications-using-

pydantic-and-instrutor-87d237c03073.

Boris, F. B. (Sep 2, 2024). Say goodbye to boring chatbots

by combining structure (bot frameworks) and ﬂexibil-

ity (llms). AI Advances. https://ai.gopubby.com/say-

goodbye-to-boring-chatbots-by-combining-structure-

bot-frameworks-ﬂexibility-llms-18ae70bc73c5.

Budzianowski, P., Wen, T.-H., Tseng, B.-H., Casanueva,

I., Ultes, S., Ramadan, O., and Ga

c, M. (2020).

Multiwoz – a large-scale multi-domain wizard-of-

oz dataset for task-oriented dialogue modelling.

https://arxiv.org/abs/1810.00278.

Chase, H. and Team, L. (2023). Langgraph documentation.

https://langchain-ai.github.io/langgraph/.

Coelho, G. M. C. et al. (2024). Improving the accuracy of

text-to-SQL tools based on large language models for

Automated Evaluation of Database Conversational Agents

287

real-world relational databases. In Database and Ex-

pert Systems Applications: 35th International Confer-

ence, DEXA 2024, Naples, Italy, August 26–28, 2024,

Proceedings, Part I, page 93–107, Berlin, Heidelberg.

Springer-Verlag. doi: 10.1007/978-3-031-68309-1 8.

Floratou, A. et al. (2024). NL2SQL is a solved

problem... not! In 4th Annual Conference

on Innovative Data Systems Research (CIDR ’24).

https://vldb.org/cidrdb/papers/2024/p74-ﬂoratou.pdf.

Fokoue, A. et al. (2024). A system and bench-

mark for LLM-based Q&A on heterogeneous data.

https://arxiv.org/abs/2409.05735.

Galimzhanova, E., Muntean, C. I., Nardini, F. M.,

Perego, R., and Rocchietti, G. (2023). Rewrit-

ing conversational utterances with instructed large

language models. In 2023 IEEE/WIC Int’l. Conf.

on Web Intelligence and Intelligent Agent Technol-

ogy (WI-IAT), pages 56–63. doi: 10.1109/WI-

IAT59888.2023.00014.

Hong, Z., Yuan, Z., Zhang, Q., Chen, H., Dong, J., Huang,

F., and Huang, X. (2025). Next-Generation Database

Interfaces: A Survey of LLM-based Text-to-SQL.

http://arxiv.org/abs/2406.08426.

Izquierdo, Y. et al. (2024). Busca360: A search application

in the context of top-side asset integrity management

in the oil & gas industry. In Proc. 39th Brazilian Sym-

posium on Data Bases, pages 104–116, Porto Alegre.

SBC.

Jurafsky, D. and Martin, J. H. (2024). Speech and Lan-

guage Processing: An Introduction to Natural Lan-

guage Processing, Computational Linguistics, and

Speech Recognition with Language Models. Online

manuscript released August 20, 2024, 3rd edition.

https://web.stanford.edu/

jurafsky/slp3/.

Lei, F., Chen, J., Ye, Y., Cao, R., Shin, D., SU, H., SUO,

Z., Gao, H., Hu, W., Yin, P., Zhong, V., Xiong, C.,

Sun, R., Liu, Q., Wang, S., and Yu, T. (2025). Spider

2.0: Evaluating language models on real-world enter-

prise text-to-SQL workﬂows. In The Thirteenth In-

ternational Conference on Learning Representations.

https://openreview.net/forum?id=XmProj9cPs.

Liu, N., Chen, L., Tian, X., Zou, W., Chen, K., and

Cui, M. (2024). From LLM to conversational agent:

A memory enhanced architecture with ﬁne-tuning of

large language models. CoRR, abs/2401.02777. doi:

10.48550/ARXIV.2401.02777.

Nascimento, E. R., Avila, C. V., Izquierdo, Y. T., Gar-

cia, G. M., Feij

o, L., Facina, M. S., Lemos,

M., and Casanova, M. A. (2025a). On the

text-to-SQL task supported by database keyword

search. In Proc. 27th International Conference

on Enterprise Information Systems - Volume 1:

ICEIS, pages 173–180. INSTICC, SciTePress. doi:

10.5220/0013126300003929.

Nascimento, E. R., Garc

ıa, G. M., Izquierdo, Y. T., Feij

L., Coelho, G. M., Oliveira, A. R., Lemos, M., Garcia,

R. S., Leme, L. A. P., and Casanova, M. A. (2025b).

LLM-based text-to-SQL for real-world databases. SN

COMPUT. SCI., 6(2). doi: 10.1007/s42979-025-

03662-6.

Oliveira, A. et al. (2025). Small, medium, and large lan-

guage models for text-to-SQL. In Conceptual Model-

ing, pages 276–294, Cham. Springer Nature Switzer-

land. doi: 10.1007/978-3-031-75872-0 15.

Pasupat, P. and Liang, P. (2015). Compositional seman-

tic parsing on semi-structured tables. In Proc. 53rd

Annual Meeting of the Association for Computational

Linguistics and the 7th Int’l. Joint Conf. on Natural

Language Processing (Volume 1: Long Papers), pages

1470–1480, Beijing, China. Association for Computa-

tional Linguistics. doi: 10.3115/v1/P15-1142.

Quamar, A. et al. (2020). An ontology-based conversa-

tion system for knowledge bases. In Proc. 2020 ACM

SIGMOD International Conference on Management

of Data, SIGMOD ’20, page 361–376, New York,

NY, USA. Association for Computing Machinery. doi:

10.1145/3318464.3386139.

Shi, L., Tang, Z., Zhang, N., Zhang, X., and Yang, Z.

(2025). A survey on employing large language mod-

els for text-to-SQL tasks. ACM Comput. Surv. doi:

10.1145/3737873.

Silva, M. O. (2025). Automation in the generation and

evaluation of dialogues for text-to-SQL systems: An

agent-based approach. Master’s thesis, Department of

Informatics, PUC-Rio, Rio de Janeiro, Brazil.

Wang, S., Sun, X., Li, X., Ouyang, R., Wu, F., Zhang, T., Li,

J., Wang, G., and Guo, C. (2025). GPT-NER: Named

entity recognition via large language models. In

Chiruzzo, L., Ritter, A., and Wang, L., editors, Find-

ings of the Association for Computational Linguistics:

NAACL 2025, pages 4257–4275, Albuquerque, New

Mexico. Association for Computational Linguistics.

doi: 10.18653/v1/2025.ﬁndings-naacl.239.

Wei, J., Kim, S., Jung, H., and Kim, Y.-H. (2024). Lever-

aging large language models to power chatbots for

collecting user self-reported data. Proc. ACM on

Human-Computer Interaction, 8(CSCW1):1–35. doi:

10.1145/3637364.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,

K., and Cao, Y. (2023). REACT: Synergizing reason-

ing and acting in language models. In Proc. 11th In-

ternational Conference on Learning Representations

(ICLR 2023).

Zhang, J., Qian, K., Liu, Z., Heinecke, S., Meng, R., Liu,

Y., Yu, Z., Wang, H., Savarese, S., and Xiong, C.

(2024a). DialogStudio: Towards richest and most

diverse uniﬁed dataset collection for conversational

ai. In Findings of the Association for Computational

Linguistics: EACL 2024, pages 2299–2315, St. Ju-

lian’s, Malta. Association for Computational Linguis-

tics. https://aclanthology.org/2024.ﬁndings-eacl.152/.

Zhang, Y., Henkel, J., Floratou, A., Cahoon, J., Deep, S.,

and Patel, J. M. (2024b). ReAcTable: Enhancing react

for table question answering. Proc. VLDB Endow.,

17(8):1981–1994. doi: 10.14778/3659437.3659452.

Zheng, H., Liu, Y., Ge, Y., Awadallah, A. H., and Gao, J.

(2023). Judging LLM-as-a-judge: Evaluating large

language models for summarization evaluation. In

Proc. 61st Annual Meeting of the Association for

Computational Linguistics (ACL 2023), pages 11034–

11056. https://aclanthology.org/2023.acl-long.616.

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

288