Automated Test Generation Using LLM Based on BDD:

A Comparative Study

Shexmo Richarlison Ribeiro dos Santos

1 a

, Luiz Felipe Cirqueira dos Santos

1 b

Marcus Vinicius Santana Silva

1 c

, Marcos Cesar Barbosa dos Santos

1 d

Mariano Florencio Mendonc¸a

1 e

, Marcos Venicius Santos

1 f

, Marckson F

abio da Silva Santos

1 g

Alberto Luciano de Souza Bastos

1 h

, Sabrina Marczak

2 i

, Michel S. Soares

1 j

and

Fabio Gomes Rocha

3 k

Federal University of Sergipe, S

ao Crist

ao, Sergipe, Brazil

School of Technology, PUCRS, Porto Alegre, Rio Grande do Sul, Brazil

ISH (SafeLabs), Vit

oria, Esp

ırito Santo, Brazil

Keywords:

Software Quality, Behavior-Driven Development (BDD), Large Language Models (LLM), Automatic Test

Code Generator, Experiment.

Abstract:

In Software Engineering, seeking methods that save time in product development and improve delivery quality

is essential. BDD (Behavior-Driven Development) offers an approach that, through creating user stories and

acceptance criteria in collaboration with stakeholders, aims to ensure quality through test automation, allowing

the validation of criteria for product acceptance. The lack of test automation poses a problem, requiring man-

ual work to validate acceptance. To solve the issue of test automation in BDD, we conducted an experiment

using standardized prompts based on user stories and acceptance criteria written in Gherkin syntax, automat-

ically generating tests in four Large Language Models (ChatGPT, Gemini, Grok, and GitHub Copilot). The

experiment compared the following aspects: response similarity, test coverage concerning acceptance criteria,

accuracy, efﬁciency in the time required to generate the tests, and clarity. The results showed that the LLMs

have signiﬁcant differences in their responses, even with similar prompts. We observed variations in test cov-

erage and accuracy, with ChatGPT standing out in both cases. In terms of efﬁciency, related to time, Grok

was the fastest while Gemini was the slowest. Finally, regarding the clarity of the responses, ChatGPT and

GitHub Copilot were similar and more effective than the others. The results show that the LLMs adopted in

the study can understand and generate automated tests accurately. However, they still do not eliminate the

need for human assessment, but they do serve as a support to speed up the automation process.

https://orcid.org/0000-0003-0287-8055

https://orcid.org/0000-0003-4538-5410

https://orcid.org/0009-0000-9211-5259

https://orcid.org/0000-0002-7929-3904

https://orcid.org/0000-0003-0732-3980

https://orcid.org/0009-0006-1645-6127

https://orcid.org/0009-0001-6479-1900

https://orcid.org/0009-0002-3911-9757

https://orcid.org/0000-0001-9631-8969

https://orcid.org/0000-0002-7193-5087

https://orcid.org/0000-0002-0512-5406

1 INTRODUCTION

Behavior-Driven Development (BDD) is a framework

often used in software development. Frameworks are

standardized methods used in the software develop-

ment process to contribute to understanding the logi-

cal and sequential steps involved, helping developers

to understand the process as a whole.

BDD was created in 2003 by Dan North (North,

2006) to mitigate the issues arising from Test-Driven

Development (TDD). While TDD focuses on testing

the software, BDD aims to determine the behaviour

the software needs to exhibit when executing a par-

Ribeiro dos Santos, S. R., Cirqueira dos Santos, L. F., Silva, M. V. S., Barbosa dos Santos, M. C., Mendonça, M. F., Santos, M. V., Santos, M. F. S., Bastos, A. L. S., Marczak, S., Soares, M.

S. and Rocha, F. G.

Automated Test Generation Using LLM Based on BDD: A Comparative Study.

DOI: 10.5220/0013683600003985

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 21st International Conference on Web Information Systems and Technologies (WEBIST 2025), pages 47-58

ISBN: 978-989-758-772-6; ISSN: 2184-3252

ticular functionality. Thus, BDD is used throughout

the system lifecycle, from requirements elicitation to

automation and validation (Bruschi et al., 2019).

Requirements elicitation using BDD uses the

“given, when, then” pattern, employing an easily

understandable language known as Gherkin (Smart,

2014) to enhance stakeholder communication and un-

derstanding. Through this pattern, the software com-

ponents’ requirements are elicited, focusing on out-

lining the expected behaviour of the software.

The use of generative Artiﬁcial Intelligence (AI)

in software development has been evident in various

tools that enhance productivity, provided they are ad-

equately supervised (Sauvola et al., 2024). Therefore,

it is natural to employ generative AI tools for testing

and documentation purposes to optimize BDD pro-

cesses.

Therefore, the main goal of this research is to

“Analyze the efﬁciency of user stories using BDD,

aiming at automatic generation of test code, in com-

parison to Large Language Models (LLMs), from an

academic perspective, in the context of software de-

velopment”. To achieve this goal, the posed the fol-

lowing research questions.

RQ1. What is the similarity of responses among

the different AIs when generating automated tests?

RQ2. What is the coverage of acceptance criteria

by the tests generated by each AI?

RQ3. What is the accuracy of the generated tests

compared to a reference test set?

RQ4. How much time is required to generate the

tests by each LLM?

RQ5. What is the clarity of responses among dif-

ferent executions of each AI?

To answer them, we conducted an experiment,

where a software was created to read user stories and

their respective acceptance criteria, generating test

code using different AIs. Using the prompt we cre-

ated, we were able to insert them into the AIs used so

that we could observe what the return would be from

them. In this way, we were able to analyze whether

the use of AIs is effective in this context.

Our paper brings two main contributions: the

demonstration that LLMs can be used to generated

automated tests based on user stories and acceptance

scenarios and the comparison among the performance

of the selected LLMs: ChatGPT, Gemini, Grok, and

GitHub Copilot, for each one of the investigated as-

pects.

The remainder of this paper is organized as fol-

lows: Section 2 explains the concepts inherent to

the use of BDD and AI. Section 3 presents related

work. Section 4 details our experiment procedures

and methods. Section 5 reports our study results and

Section 6 discusses them with regards literature. Sec-

tion 7 presents the threats to the validity of our study.

Section 8 concludes the paper by highlighting once

again its contributions and presenting proposed future

work.

2 BACKGROUND

Next, concepts inherent to Behavior-Driven Develop-

ment (BDD) and Large Language Model (LLM) are

discussed, including the presentation of the frame-

works used in this research for the automatic gener-

ation of test code.

2.1 Behavior-Driven Development

The Agile Movement (Beck et al., 2001) gathered

computing experts to seek improvements in software

quality and time response of software delivery. From

this point, changes occurred, and frameworks were

created to enable faster deliveries. Created after this

Movement, Behavior-Driven Development (BDD) is

an agile framework used throughout the software de-

velopment cycle.

Created by Dan North (North, 2006), BDD aims

to improve communication among those involved in

software development, specially communication with

stakeholders who are often not familiar with technical

language, enhancing the quality of software delivery

as a consequence. Thus, some beneﬁts of BDD in-

clude time optimization and enhanced quality in re-

quirements elicitation (Pereira et al., 2018).

BDD is characterized by the use of the Gherkin

language (Smart, 2014), written in a clear, direct,

and assertive natural language, contributing to high-

quality requirements elicitation to ﬁnd the expected

behaviour of software, i.e., the stakeholder’s needs.

BDD divides the writing of requirements into two

parts: the ﬁrst being the functionality or expected be-

haviour of the system, and the second part being one

or more acceptance criteria related to the validation

of this behaviour (North et al., 2019), as illustrated in

Figure 1.

BDD uses the terms “given, when, then” as a stan-

dard writing format, with each user story having its

respective acceptance criteria. Thus, the scenarios or

criteria are written to be testable (Silva and Fitzger-

ald, 2021), aiding in the speciﬁcation and veriﬁcation

of requirements (Guerra-Garcia et al., 2023).

Since its functionalities are written concisely, user

stories are easily understood by all involved, improv-

ing communication among stakeholders (Couto et al.,

2022; Pereira et al., 2018; Bruschi et al., 2019).

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

Figure 1: User story example.

By enhancing communication, the Gherkin language

adopted by BDD ensures assertiveness in the software

behaviour, making all parties aware of what a partic-

ular user story refers to, for example.

However, BDD presents speciﬁc challenges,

among which the gap related to the misalignment be-

tween acceptance criteria and automation stands out.

Implementing tests is a manual task, requiring a sys-

tematic process for efﬁcient integration (Zameni et al.,

2023). The lack of precision and coverage of the tests

concerning the acceptance criteria can result in in-

complete validation (Zameni et al., 2023). In addi-

tion, the manual creation of tests increases the work-

load, emphasising the need for systematic ﬂows and

support tools to effectively integrate tests in the BDD

context (Ma et al., 2023).

Although BDD has been used successfully in var-

ious software engineering processes, studies still need

to explore the enhancement of BDD through emerg-

ing technologies such as machine learning (Bina-

mungu and Maro, 2023).

2.2 Large Language Model

Language Models refer to any system trained to pre-

dict a series of characters, whether letters, words, or

sentences, sequentially or not, given some previous or

adjacent context (Bender and Koller, 2020). The de-

velopment of Language Models follows two main ap-

proaches: transformer models and word embeddings.

Word embeddings improve the results of various

performance tests while reducing the labelled data

needed for multiple supervised tasks. In contrast,

transformer models have continuously beneﬁted from

larger architectures and datasets, with their capac-

ity subsequently enhanced for speciﬁc tasks. Some

of these models have redeﬁned the concept of their

classiﬁcation, making it more accurate to character-

ize them as Large Language Models (LLMs) (Ben-

der et al., 2021), such as GPT (OpenAI), Gemini

(Google), Grok (Twitter), and GitHub Copilot.

Recent studies indicate that using LLMs can re-

duce the time required to write and maintain BDD

tests and improve the quality and coverage of tests

(Zhang et al., 2023). This methodology smooths the

barriers between requirements gathering and techni-

cal implementation, allowing for more effective col-

laboration between stakeholders and the development

team, and promoting a more agile and integrated soft-

ware development cycle.

3 RELATED WORK

Large language models (LLMs) are increasingly used

in Software Engineering (SE) for different tasks, like

generating code, designing software, and automating

test cases. As a result, in this study, we want to

highlight related work identiﬁed in Table 1. We will

present this information next.

The principle of using natural language to cre-

ate high-level code with AI is addressed by Lee et

al. (Lee et al., 2023), using the OpenCV tool for a

UI test in an object-oriented analysis combined with

BDD for a test automation. They experiment with nat-

ural language-based templates and then apply it to a

UI test simulator. The results were analyzed using

a BDD and an OOBDD (object-oriented behavior-

driven design) approach. Through their results, we

understood how generative AI works efﬁciently for

human-language test automation for complex soft-

ware.

In contrast, Takerngsaksiri et al. (Takerngsak-

siri et al., 2024) share the results of using PyTester,

a Text-to-TextCase tool in a TDD scenario. They

compare results from PyTester with the state-of-

art models directly (Finetuned CodeT5-large, In-

coder,Starcoder e GPT3.5). The results reinforce how

the use of AI-automated tools based on natural lan-

guage are efﬁcient with a low percentage of errors or

inconsistencies.

Mock et al. (Mock et al., 2024) analyses the in-

teraction between the development team without AI-

assist and the test codes generated by AI, identifying

which ones are the most promising. With the aim

of automating TDD processes with artiﬁcial intelli-

gence, the study compared the generated code with

the other 5 developers, concluding that the automa-

Automated Test Generation Using LLM Based on BDD: A Comparative Study

Table 1: Related Work.

Reference Abstract

(Lee et al., 2023) Demonstrates an

approach using

generative AI to

translate human

language into high-

level programming

language.

(Mock et al., 2024) Introduces an

approach that

proposes the au-

tomation of TDD

through Generative

AI.

(Karpurapu et al., 2024) Evaluates a detailed

approach focusing

on enhancing BDD

practices through

LLMs to generate

acceptance tests

automatically.

(Takerngsaksiri et al., 2024) Presents and evalu-

ates PyTester as a

tool for generating

formal test cases

from natural lan-

guage.

tion of the TDD process can indeed be used efﬁciently

but with due supervision due to the quality of the code

that is produced in this way.

Regarding a multi-AI analysis, Karpurapu et al.

(Karpurapu et al., 2024) concludes that BDD accep-

tance tests generated by LLMs are beneﬁcial, as these

tests represent considerable complexity.

The use of Large Language Models (LLMs) for

automated code generation following Test-Driven De-

velopment (TDD) and Behavior-Driven Development

(BDD) processes has shown satisfactory and promis-

ing results, warranting further exploration and expan-

sion. Our article builds on this foundation by using

code generated from generative AI. However, we take

it a step further by introducing a novel approach of

comparing the outputs from various LLMs using the

prompt and evaluating them through speciﬁc metrics,

which will be presented throughout the article.

4 METHODOLOGY

Experiments are an empirical method that aids in the

evaluation and validation of research results (Wohlin

et al., 2003).In Software Engineering, experiments

aim to identify the outcomes of certain situations and

seek to beneﬁt the ﬁeld with potential discoveries.

For this experiment, a library provided by a com-

pany containing 34 user stories was used. Of these,

30 stories had three acceptance criteria, and four had

only one criterion, adding 94 acceptance criteria. The

stories were written in Gherkin language using the

BDD framework in the native language of the re-

searchers (Brazilian Portuguese). Based on these, a

prompt was created for each scenario in the selected

Large Language Models. The stories and their respec-

tive scenarios and prompts can be viewed at the link

The objective at this point is to compare the ef-

fectiveness of LLMs Grok, Gemini, ChatGPT, and

GitHub Copilot in generating automated tests based

on user stories and acceptance criteria, following

BDD in the Gherkin standard, compared to comput-

erised tests performed by a development team. This

objective can be divided into ﬁve sub-objectives: each

part is related to one of the research questions and has

their own metrics, and hypotheses detailed as follows:

Objective 1: Measure the similarity of responses

from different LLMs

• RQ1 - Question: What is the similarity of re-

sponses among the different AIs when generating

automated tests?

• Metric: Similarity coefﬁcient (Cosine Similarity).

• Null Hypothesis (H0.1): There is no signiﬁcant

difference in the similarity of responses among

the different AIs.

• Alternative Hypothesis (H1.1): There is a sig-

niﬁcant difference in the similarity of responses

among the different AIs.

Objective 2: Validate whether the results gener-

ated by the LLMs cover the acceptance criteria.

• RQ2 - Question: What is the coverage of accep-

tance criteria by the tests generated by each AI?

• Metric: Acceptance criteria coverage (percentage

of acceptance criteria covered by the generated

tests).

• Null Hypothesis (H0.2): There is no signiﬁcant

difference in the coverage of acceptance criteria

among the different AIs.

• Alternative Hypothesis (H1.2): There is a signiﬁ-

cant difference in the coverage of acceptance cri-

teria among the different AIs.

Objective 3: Evaluate the accuracy of the tests

generated by the different LLMs

https://doi.org/10.5281/zenodo.13155965

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

• RQ3 - Question: What is the accuracy of the gen-

erated tests compared to a reference test set?

• Metric: Test accuracy (percentage of correspon-

dence between the generated tests and the refer-

ence test set).

• Null Hypothesis (H0.3): There is no signiﬁcant

difference in the accuracy of the tests generated

among the different AIs.

• Alternative Hypothesis (H1.3): There is a signiﬁ-

cant difference in the accuracy of the tests gener-

ated among the different AIs.

Objective 4: Evaluate the efﬁciency, in terms of

time, for generating the tests

• RQ4 - Question: How much time is required to

generate the tests by each LLM?

• Metric: Test generation time (average time re-

quired to generate tests).

• Null Hypothesis (H0.4): There is no signiﬁcant

difference in the test generation time among the

different AIs.

• Alternative Hypothesis (H1.4): There is a signiﬁ-

cant difference in the test generation time among

the different AIs.

Objective 5: Evaluate the clarity of responses

among different executions of each AI

• RQ5 - Question: What is the clarity of responses

among different executions of each AI?

• Metric: Clarity of responses (evaluated by subjec-

tive criteria such as readability, comprehensibility,

and adherence to acceptance criteria).

• Null Hypothesis (H0.5): There is no signiﬁcant

difference in the clarity of responses among the

different AIs.

• Alternative Hypothesis (H1.5): There is a signif-

icant difference in the clarity of responses among

the different AIs.

4.1 Experiment Execution

The methodological steps taken for executing this ex-

periment are outlined next:

1. Submit each user story and their respective ac-

ceptance criteria to the LLMs Grok, Gemini,

ChatGPT, and GitHub Copilot using a standard

prompt;

2. Generate and document the test code returned by

each source;

3. Execute the generated tests and record the results;

4. Statistically evaluate the results.

Table 2: Evaluated Metrics.

Metric Deﬁnition Data Collection

Accuracy It refers to the

proportion of

tests that passed

(correct results)

among all exe-

cuted tests.

After executing

the generated

tests, the num-

ber of tests that

passed and failed

was recorded.

Coverage It refers to the

proportion of

requirements

or acceptance

criteria covered

by the generated

tests.

The number of

acceptance crite-

ria covered by the

generated tests

was checked for

each user story.

Clarity It refers to the

readability and

comprehension

of the generated

tests. It can be

qualitatively

evaluated by a

group of devel-

opers or through

automatic read-

ability metrics.

Developers could

assign a score

from 1 to 5 for

each generated

test. Alterna-

tively, readability

metrics such as

Flesch Reading

Ease could be

used.

Efﬁciency It refers, in the

context of this

paper, to the

time required

to generate the

tests.

The time from

the test request to

its generation and

recording was

measured.

4.2 Evaluated Metrics

As shown in Table 2, the metrics adopted in this study

are partially related to the need to improve alignment

between acceptance criteria and test automation. The

acceptance criteria stage must have a positive result

for the automation of its respective tests to be carried

out. Thus, they must be in line with the behavior ex-

pected by the software. The metrics used were se-

lected with the aim of achieving the objectives pro-

posed in this study. In addition, the efﬁciency mea-

sured is associated with the time needed to create the

tests, thus reducing the manual workload, as identi-

ﬁed as a gap in the literature review.

Through these metrics, results can be more as-

sertive, increasing the reliability of results and pro-

viding clarity and effectiveness to the conclusions of

this research.

Automated Test Generation Using LLM Based on BDD: A Comparative Study

5 RESULTS

The results are related to each sub-objectives out-

lined in section 4, that refer to their respective re-

search questions. The following are the results related

to each of the objectives outlined in Section 4 with

the aim of answering their respective RQs outlined in

Section 1. The results are presented in Figures 2 and

Figure 2: Similarity Matrix.

In Figure 3, “A” refers to LLM Gemini, “B” refers

to ChatGPT, “C” refers to GROK, and “D” refers to

GitHub Copilot.

Figure 3: Distribution of similarities.

5.1 Objective 1

To measure the similarity of responses from different

LLMs, the Kruskal-Wallis test was employed. This

non-parametric test is appropriate for comparing in-

dependent distributions when the assumptions of nor-

mality are not met.

Results of the Kruskal-Wallis Test:

• Statistic: 36.2464

• p-Value: 0.0000

The results indicated a Kruskal-Wallis statistic of

36.2464 and a p-value of 0.0000. The extremely low

p-value (less than 0.05) suggests that there is a sta-

tistically signiﬁcant difference in the similarity of re-

sponses among the different LLMs.

Hypotheses:

• Null Hypothesis (H0.1): There is no signiﬁcant

difference in the similarity of responses among

the different AIs.

• Alternative Hypothesis (H1.1): There is a sig-

niﬁcant difference in the similarity of responses

among the different AIs.

Answering RQ1: Therefore, since the p-value is

less than 0.05, we reject the null hypothesis (H0.1).

Thus, we support the alternative hypothesis (H1.1),

which asserts that there is a signiﬁcant difference in

the similarity of responses among the different LLMs.

This implies that the LLMs Grok, Gemini, ChatGPT,

and GitHub Copilot produce responses with statis-

tically signiﬁcant varying levels of similarity when

generating automated tests based on user stories and

acceptance criteria.

5.2 Objective 2

To validate if the results generated by the LLMs cover

the acceptance criteria, a coverage analysis and an

ANOVA test were conducted, followed bypost-hoc

and Tukey HSD.

Coverage Means:

• Grok: 0.4054

• Gemini: 0.5943

• ChatGPT: 0.7670

• GitHub Copilot: 0.7315

Coverage ANOVA:

• Sum of Squares (Model): 7.419452

• Sum of Squares (Residual): 13.830678

• F-Value: 65.089134

• p-Value: 1.025147e-33

The ANOVA results indicated an F-value of

65.089134 and a p-value of 1.025147e-33, which is

extremely low (less than 0.05). This suggests that

there is a statistically signiﬁcant difference in the

coverage of acceptance criteria among the different

LLMs.

Tukey HSD Test for Coverage:

• ChatGPT vs. GitHub Copilot: p = 0.6063 (not

signiﬁcant)

• ChatGPT vs. Gemini: p ¡ 0.001 (signiﬁcant)

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

• ChatGPT vs. Grok: p ¡ 0.001 (signiﬁcant)

• GitHub Copilot vs. Gemini: p ¡ 0.001 (signiﬁ-

cant)

• GitHub Copilot vs. Grok: p ¡ 0.001 (signiﬁcant)

• Gemini vs. Grok: p ¡ 0.001 (signiﬁcant)

The results of the post-hoc Tukey HSD test

showed that the differences in coverage of accep-

tance criteria are signiﬁcant among most LLMs, ex-

cept for ChatGPT and GitHub Copilot, whose differ-

ences were not signiﬁcant (p = 0.6063).

Hypotheses:

• Null Hypothesis (H0.2): There is no signiﬁcant

difference in the coverage of acceptance criteria

among the different AIs.

• Alternative Hypothesis (H1.2): There is a signiﬁ-

cant difference in the coverage of acceptance cri-

teria among the different AIs.

Answering RQ2: Therefore, since the p-value of

the ANOVA is less than 0.05, we reject the null hy-

pothesis (H0.2). Thus, we support the alternative hy-

pothesis (H1.2), which states that there is a signiﬁ-

cant difference in the coverage of acceptance crite-

ria among the different AIs. The post-hoc tests indi-

cate that, although ChatGPT and GitHub Copilot do

not show signiﬁcant differences between each other,

all other comparisons between the LLMs are signiﬁ-

cantly different.

5.3 Objective 3

To assess the accuracy of tests generated by different

LLMs, precision means were calculated and an anal-

ysis of variance (ANOVA) was conducted, followed

by the Tukey HSD post-hoc test.

Precision Means:

• Grok: 0.3391

• Gemini: 0.5373

• ChatGPT: 0.7670

• GitHub Copilot: 0.7239

ANOVA of Precision:

• Sum of Squares (Model): 10.57519

• Sum of Squares (Residual): 12.50989

• F-Value: 102.56869

• P-Value: 3.882251e-48

The results of the ANOVA indicated an F-value

of 102.56869 and a p-value of 3.882251e-48, which

is extremely low (less than 0.05). This suggests that

there is a statistically signiﬁcant difference in the ac-

curacy of tests generated by the different LLMs.

Tukey HSD Test of Precision:

• ChatGPT vs. GitHub Copilot: p = 0.3944 (not

signiﬁcant)

• ChatGPT vs. Gemini: p ¡ 0.001 (signiﬁcant)

• ChatGPT vs. Grok: p ¡ 0.001 (signiﬁcant)

• GitHub Copilot vs. Gemini: p ¡ 0.001 (signiﬁ-

cant)

• GitHub Copilot vs. Grok: p ¡ 0.001 (signiﬁcant)

• Gemini vs. Grok: p ¡ 0.001 (signiﬁcant)

The results of the Tukey HSD post-hoc test

showed that the differences in test accuracy are sig-

niﬁcant between ChatGPT, Gemini, and Grok. There

were no signiﬁcant differences in accuracy between

GitHub Copilot and ChatGPT (p = 0.3944).

Hypotheses:

• Null Hypothesis (H0.3): There is no signiﬁ-

cant difference in the accuracy of tests generated

among the different AIs

• Alternative Hypothesis (H1.3): There is a signiﬁ-

cant difference in the accuracy of tests generated

among the different AIs.

Answering RQ3: Given that the p-value of the

ANOVA is less than 0.05, we reject the null hypothe-

sis (H0.3). Therefore, we support the alternative hy-

pothesis (H1.3), which states a signiﬁcant difference

in the accuracy of tests generated by the different

LLMs. The results of the post-hoc test indicate that

ChatGPT has a signiﬁcantly different accuracy com-

pared to the other LLMs tested, except for GitHub

Copilot. At the same time, the differences between

GitHub Copilot, Gemini, and Grok are also statisti-

cally signiﬁcant.

5.4 Objective 4

To assess the efﬁciency of the different LLMs in test

generation, the mean generation times were calcu-

lated and an analysis of variance (ANOVA) was con-

ducted, followed by the Tukey HSD post-hoc test.

ANOVA of Efﬁciency:

• Sum of Squares (Model): 0.712254

• Sum of Squares (Residual): 0.028181

• F-Value: 3066.558824

• P-Value: 6.633292e-258

The ANOVA results indicate a signiﬁcant differ-

ence between the groups, as the p-value is extremely

low (6.633292e-258). This means that at least one of

the models has a signiﬁcantly different efﬁciency per-

formance compared to the others.

Tukey HSD Test of Efﬁciency:

Automated Test Generation Using LLM Based on BDD: A Comparative Study

• ChatGPT vs. GitHub Copilot: p = 1.0 (not signif-

icant)

• ChatGPT vs. Gemini: p ¡ 0.001 (signiﬁcant)

• ChatGPT vs. Grok: p = 0.1858 (not signiﬁcant)

• GitHub Copilot vs. Gemini: p ¡ 0.001 (signiﬁ-

cant)

• GitHub Copilot vs. Grok: p = 0.1858 (not signiﬁ-

cant)

• Gemini vs. Grok: p ¡ 0.001 (signiﬁcant)

Signiﬁcant Differences:

• ChatGPT vs. Gemini: The difference is signiﬁ-

cant, indicating that the generation time of Chat-

GPT is signiﬁcantly shorter than that of Gemini.

• GitHub Copilot vs. Gemini: The difference is

signiﬁcant, indicating that the generation time of

GitHub Copilot is signiﬁcantly shorter than that

of Gemini.

• Gemini vs. Grok: The difference is signiﬁcant,

indicating that the generation time of Gemini is

signiﬁcantly longer than that of Grok.

Non-Signiﬁcant Differences:

• ChatGPT vs. GitHub Copilot: There is no sig-

niﬁcant difference, suggesting that the generation

time of ChatGPT is similar to that of GitHub

Copilot.

• ChatGPT vs. Grok: There is no signiﬁcant differ-

ence, indicating that the generation time of Chat-

GPT is similar to that of Grok.

• GitHub Copilot vs. Grok: There is no signiﬁcant

difference, indicating that the generation time of

GitHub Copilot is similar to that of Grok.

Interpretation:

• ChatGPT e GitHub Copilot: Both have compa-

rable and efﬁcient test generation times, with no

signiﬁcant differences between them.

• Gemini: The Gemini model exhibits signiﬁcantly

longer generation times compared to all other

models, indicating inefﬁciency in its test gener-

ation process.

• Grok: The Grok model performs efﬁciently, sim-

ilar to ChatGPT and GitHub Copilot, and signiﬁ-

cantly better than Gemini.

The results suggest that, in terms of test generation

time efﬁciency, both ChatGPT and GitHub Copilot

are effective and comparable. However, the Gemini

model is signiﬁcantly slower, indicating that it may

not be the best choice when generation time is a crit-

ical factor. The Grok model also demonstrates ef-

ﬁciency and is comparable to ChatGPT and GitHub

Copilot.

Hypotheses:

• Null Hypothesis (H0.4): There is no signiﬁcant

difference in the test generation time among the

different AIs.

• Alternative Hypothesis (H1.4): There is a signiﬁ-

cant difference in the test generation time among

the different AIs.

Answering RQ4: Given that the p-value of the

ANOVA is less than 0.05, we reject the null hypothe-

sis (H0.4). Therefore, we support the alternative hy-

pothesis (H1.4), which states that there is a signiﬁcant

difference in the test generation time among the dif-

ferent AIs.

5.5 Objective 5

To assess the clarity of responses across different exe-

cutions of each AI, clarity means were calculated and

an analysis of variance (ANOVA) was conducted, fol-

lowed by the Tukey HSD post-hoc test.

ANOVA of Clarity:

• Sum of Squares (Model): 145030.766421

• Sum of Squares (Residual): 124106.926135

• F-Value: 141.789559

• P-Value: 7.339233e-61

The results of the ANOVA indicate a signiﬁcant

difference in the clarity of responses among the differ-

ent AIs, as the p-value is extremely low (7.339233e-

61). This means that at least one of the AIs has sig-

niﬁcantly different clarity than the others.

Means of Clarity:

• Grok: 44.6511

• Gemini: 67.0604

• ChatGPT: 92.2609

• GitHub Copilot: 92.2609

Tukey HSD Test of Clarity:

• ChatGPT vs. GitHub Copilot: p = 1.0 (not signif-

icant)

• ChatGPT vs. Gemini: p ¡ 0.001 (signiﬁcant)

• ChatGPT vs. Grok: p ¡ 0.001 (signiﬁcant)

• GitHub Copilot vs. Gemini: p ¡ 0.001 (signiﬁ-

cant)

• GitHub Copilot vs. Grok: p ¡ 0.001 (signiﬁcant)

• Gemini vs. Grok: p ¡ 0.001 (signiﬁcant)

Signiﬁcant Differences:

• ChatGPT vs. Gemini: The difference is signiﬁ-

cant, indicating that the clarity of responses from

ChatGPT is signiﬁcantly higher than that of Gem-

ini.

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

• ChatGPT vs. Grok: The difference is signiﬁcant,

indicating that the clarity of responses from Chat-

GPT is signiﬁcantly higher than that of Grok.

• GitHub Copilot vs. Gemini: The difference is sig-

niﬁcant, indicating that the clarity of responses

from GitHub Copilot is signiﬁcantly higher than

that of Gemini.

• GitHub Copilot vs. Grok: The difference is sig-

niﬁcant, indicating that the clarity of responses

from GitHub Copilot is signiﬁcantly higher than

that of Grok.

• Gemini vs. Grok: The difference is signiﬁcant, in-

dicating that the clarity of responses from Gemini

is signiﬁcantly higher than that of Grok.

Non-Signiﬁcant Differences:

• ChatGPT vs. GitHub Copilot: There is no sig-

niﬁcant difference, suggesting that the clarity of

responses from ChatGPT is similar to that of

GitHub Copilot.

Interpretation

• ChatGPT e GitHub Copilot: Both AIs have re-

sponses with comparable clarity and signiﬁcantly

higher than the other AIs evaluated.

• Gemini: The Gemini model exhibits intermedi-

ate response clarity, being signiﬁcantly better than

Grok but worse than ChatGPT and GitHub Copi-

lot.

• Grok: The Grok model has the lowest response

clarity among all evaluated AIs.

Thus, the results suggest that ChatGPT and

GitHub Copilot are adequate and comparable in terms

of response clarity. The Gemini model is intermedi-

ate, exhibiting signiﬁcantly higher clarity than Grok

but lower than ChatGPT and GitHub Copilot. The

Grok model shows the lowest clarity among all AIs.

Hypotheses:

• Null Hypothesis (H0.5): There is no signiﬁcant

difference in the clarity of responses among the

different AIs.

• Alternative Hypothesis (H1.5): There is a signif-

icant difference in the clarity of responses among

the different AIs.

Answering RQ5: Since the p-value of the

ANOVA is less than 0.05, we reject the null hypothe-

sis (H0.5). Therefore, we support the alternative hy-

pothesis (H1.5), which states a signiﬁcant difference

in the clarity of responses among the different AIs.

6 DISCUSSION

Figure 4 was created using the model developed by

Rajbhoj et al. (Rajbhoj et al., 2024). The adaptation

created for this study made it possible to follow a step-

by-step method inherent to executing the automatic

test generator created here.

As can be seen, there is initially a stakeholder who

has a desire for a particular behaviour performed by

the software. Thus, the user story initially emerges,

which, in turn, leads to one or more acceptance cri-

teria, scenarios testable that can guarantee the return

desired by the software. From this point on, the pro-

gramming language followed by the testing frame-

work is selected, intending the standard prompt to be

used in LLMs can be created, and ﬁnally, the auto-

matic code can be generated.

Furthermore, this study allowed reviewing an in-

tegrated cycle for using LLMs associated with the

BDD points. The LLMs are incorporated as feedback

points after formalizing the user story and its accep-

tance criteria. During development, LLMs may be

asked to review and improve the source code during

the TDD refactoring stage. This cycle is illustrated in

Figure 5.

In addition to the previous cycle, the work be-

gan with the formalization of user stories and their

respective acceptance scenarios provided by the com-

pany. These stories and scenarios were integrated into

the prompt presented in Figure 6 prompt, making it

possible to carry out the necessary tests in the LLMs

Grok, Gemini, ChatGPT and GitHub Copilot, using

the Python programming language to observe the sim-

ilarity resulting from each one. Therefore, during the

development cycle, after test generation, we imple-

ment a new interaction that can beneﬁt the develop-

ment process by using LLM, allowing LLM feedback

during the TDD refactoring stage.

The model generates the code. Otherwise, as TDD

and BDD themselves highlight refactoring as a crucial

factor, a cycle involving new requests to LLM may

be necessary, automating the development, execution

and refactoring process with the support of LLM.

Regarding accuracy, ChatGPT and GitHub Copi-

lot performed the best, being very close to each other.

This result is because GitHub Copilot uses parts of

the ChatGPT model. On the other hand, Gemini and

Grok had signiﬁcantly lower accuracies, suggesting

that the different models were more effective, in part

due to the use of the free version but also because,

in general, Gemini has difﬁculty delivering tests for

three scenarios in one single command. Therefore,

we suggest submitting one scenario at a time during

refactoring.

Automated Test Generation Using LLM Based on BDD: A Comparative Study

Figure 4: Prompt model for BDD test automation.

Figure 5: Automatic code generation.

As for clarity, Grok had the worst performance,

mainly because it often generates results in English,

which is in line with the platform’s main focus being

the English language. Another point to consider is

that Grok focuses more on conversation and research

based on data from Twitter, not having code genera-

tion as its primary objective. In the free version, Gem-

ini presents good clarity when analyzing a single ac-

ceptance criterion but has difﬁculty generating code

for multiple scenarios. There is a need to evaluate the

advanced version to see if this issue is resolved.

The results showed that ChatGPT and the devel-

opment team were effective and comparable in terms

of test generation time efﬁciency. The Gemini model,

however, was signiﬁcantly slower, indicating that it

may not be the best choice when generation time is a

critical factor. The Grok model proved efﬁcient and

comparable to ChatGPT and the development team.

We observed that low-quality stories and scenar-

ios negatively impact automatic code generation. This

occurs due to ambiguities or unclear texts, making it

difﬁcult for LLMs to read and causing confusion in

delivering the expected text. BDD was developed to

be objective and concise; however, if a scenario is pre-

pared with poor-quality writing, the return will cer-

tainly not be as expected.

The correct use of BDD and an LLM can bene-

ﬁt software development by helping developers auto-

mate test code. However, it is essential to emphasize

that using this technology does not mean replacing

the professional with the machine but instead taking

advantage of existing technologies to assist in the nec-

essary work.

We contributed to the advancement of the studies

presented in Section 3 by carrying out a comparative

experiment using the LLMs Gemini, Grok, ChatGPT

and GitHub Copilot together with the BDD frame-

work, demonstrating the effectiveness of well-written

user stories with the objective of automatically gener-

ate test codes through AI.

7 THREATS TO VALIDITY

Threats to validity are understood as circumstances

encountered during the study’s execution, which need

to be explained as they were mitigated, bringing re-

liability to the research (Runeson and H

ost, 2009).

Below, the threats encountered in this study are de-

scribed according to Zhou et al. (Zhou et al., 2016).

7.1 Construction Validity

To conduct this study, a theoretical approach to BDD

was necessary to understand how the framework func-

tions and information about the LLMs used in this re-

search. These details are provided in Section 2, where

two authors addressed the critical concepts related to

the themes of this study, synthesizing essential infor-

mation for understanding the achieved results.

7.2 Internal Validity

Seven researchers conducted the study: three con-

ducted the relevant theoretical research on the topic,

one created the test codes, one performed the statis-

tical analyses and supervised the paper, and two re-

viewed the study for improvements in the quality of

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

Figure 6: Prompt for the Tests Generation.

the ﬁnal product. All authors read and approved the

ﬁnal version of the article. There were no objections

from any of the authors.

7.3 External Validity

To ensure the reliability of the study, user stories, their

respective acceptance criteria, and prompts were cre-

ated to be used with the selected LLMs, resulting in

the generation of automated test code. These data can

be viewed at this link

7.4 Conclusion Validity

The step-by-step process of this research was de-

scribed in Section 4, and the ﬁndings are presented

in Section 5. This study is replicable if the method-

ological steps used are followed.

8 CONCLUSION

This work adopted an experimental research strategy

using statistical evaluation based on 34 user stories

and a total of 94 acceptance scenarios to analyze the

similarity between the responses, the coverage of the

tests generated to the indicated scenarios, the accu-

racy of the tests, the efﬁciency to generation time and,

ﬁnally, the clarity of responses.

Creating automated tests using Large Language

Models (LLMs) through Behavior-Driven Develop-

ment (BDD) has proven to be a relevant approach in

software development. However, we have identiﬁed

that faster LLMs currently do not provide satisfactory

results in clarity and accuracy, which suggests that

speed should not be the main criterion when choos-

ing an LLM.

The LLMs used in this study could understand

and generate natural language text with precision and

quality based on well-described user stories and ac-

ceptance scenarios. This aspect allows software engi-

neers and quality assurance teams to automate the cre-

ation of their tests based on BDD acceptance scenar-

https://doi.org/10.5281/zenodo.13155965

ios, using natural language descriptions, speeding up

development and ensuring that tests more accurately

reﬂect the precise requirements of the business.

Although the results have shown promise, we are

still far from complete automation that would allow

human evaluation to be dispensed with LLMs speed

up the initial creation steps, improving quality and

saving time. Still, the scenarios must be written of

high quality so that the resulting codes achieve the

expected software behaviour. Good scenario writing

is crucial to ensuring that automated test codes are ef-

fective.

The comparative tests brought reliability to this

study, demonstrating that, as the software was created

to interpret the stories with the LLMs, the tests could

show whether the automated codes were being gen-

erated in a meaningful way based on the hypotheses

presented.

One way to improve LLMs’ responses for fu-

ture work could be to implement the checklist pre-

sented by Oliveira, Marczak, and Moralles (Oliveira

et al., 2019) in creating user stories with BDD.

This could contribute to delivering more accurate test

codes through the AutoDevSuite tool developed in

this study.

ACKNOWLEDGMENTS

Sabrina Marczak would like to thank CNPq for the

ﬁnancial support (Productivity Scholarship, process

no. 313181/2021-7). Shexmo Santos would like to

thank CAPES/Brazil for the ﬁnancial support (Mas-

ter’s Scholarship, process no. 88887.888613/2023-

00).

REFERENCES

Beck, K., Beedle, M., Van Bennekum, A., Cockburn, A.,

Cunningham, W., Fowler, M., Grenning, J., High-

smith, J., Hunt, A., Jeffries, R., et al. (2001). The

agile manifesto.

Bender, E., Gebru, T., McMillan-Major, A., and Shmitchell,

S. (2021). On the dangers of stochastic parrots: Can

Automated Test Generation Using LLM Based on BDD: A Comparative Study

language models be too big? In Proceedings of the

2021 ACM Conference on Fairness, Accountability,

and Transparency, FAccT ’21, pages 610–51998623.

ACM.

Bender, E. and Koller, A. (2020). Climbing towards nlu:

On meaning, form, andunderstanding in the age of

data. In Proceedings of the 58th Annual Meeting of

the Association for Computational Linguistics, pages

5185–51998. ACL.

Binamungu, L. P. and Maro, S. (2023). Behaviour driven

development: a systematic mapping study. Journal of

Systems and Software, 203:111749.

Bruschi, S., Xiao, L., Kavatkar, M., et al. (2019). Be-

havior Driven-Development (BDD): a case study in

healthtech. In Paciﬁc NW Software Quality Confer-

ence.

Couto, T., dos Santos Marczak, S., Callegari, D. A., M

ora,

M., and Rocha, F. (2022). On the Characterization

of Behavior-Driven Development Adoption Beneﬁts:

A Multiple Case Study of Novice Software Teams.

Anais do XXI Simp

osio Brasileiro de Qualidade de

Software, 2022, Brasil.

Guerra-Garcia, C., Nikiforova, A., Jim

enez, S., Perez-

Gonzalez, H. G., Ram

ırez-Torres, M. T., and

Onta

non-Garc

ıa, L. (2023). ISO/IEC 25012 - Based

methodology for managing data quality requirements

in the development of information systems: Towards

Data Quality by Design . Data and Knowledge Engi-

neering, 145:102152–102152.

Karpurapu, S., Myneni, S., Nettur, U., Gajja, L. S., Burke,

D., Stiehm, T., and Payne, J. (2024). Comprehensive

evaluation and insights into the use of large language

models in the automation of behavior-driven develop-

ment acceptance test formulation. IEEE Access.

Lee, E., Gong, J., and Cao, Q. (2023). Object oriented bdd

and executable human-language module speciﬁcation.

In 2023 26th ACIS International Winter Conference

on Software Engineering, Artiﬁcial Intelligence, Net-

working and Parallel/Distributed Computing (SNPD-

Winter), pages 127–133. IEEE.

Ma, S.-P., Chen, Y.-A., Guo, Y.-J., and Su, Y.-S. (2023).

Semi-automated behavior-driven testing for the web

front-ends. In 2023 IEEE International Conference

on e-Business Engineering (ICEBE), pages 225–230.

IEEE.

Mock, M., Melegati, J., and Russo, B. (2024). Generative

ai for test driven development: Preliminary results.

arXiv preprint arXiv:2405.10849.

North, D. (2006). Introducing BDD.

https://dannorth.net/introducing-bdd/.

North, D. et al. (2019). What’s in a story? Dosegljivo:

https://dannorth. net/whats-in-a-story/[Dostopano 4.

5. 2016].

Oliveira, G., Marczak, S., and Moralles, C. (2019). How

to evaluate bdd scenarios’ quality? In Proceedings

of the XXXIII Brazilian Symposium on Software Engi-

neering, pages 481–490.

Pereira, L., Sharp, H., de Souza, C., Oliveira, G., Marczak,

S., and Bastos, R. (2018). Behavior-Driven Develop-

ment beneﬁts and challenges: reports from an indus-

trial study. In Proceedings of the 19th International

Conference on Agile Software Development: Com-

panion, pages 1–4.

Rajbhoj, A., Somase, A., Kulkarni, P., and Kulkarni, V.

(2024). Accelerating software development using

generative ai: Chatgpt case study. In Proceedings of

the 17th Innovations in Software Engineering Confer-

ence, pages 1–11.

Runeson, P. and H

ost, M. (2009). Guidelines for conduct-

ing and reporting case study research in software en-

gineering. Empirical software engineering. Springer,

V.14:131–164.

Sauvola, J., Tarkoma, S., Klemettinen, M., Riekki, J., and

Doermann, D. (2024). Future of software develop-

ment with generative ai. Automated Software Engi-

neering, 31(1):26.

Silva, T. R. and Fitzgerald, B. (2021). Empirical ﬁndings on

BDD story parsing to support consistency assurance

between requirements and artifacts. In Evaluation and

Assessment in Software Engineering, pages 266–271.

Smart, J. (2014). BDD in Action: Behavior-Driven Devel-

opment for the Whole Software Lifecycle. Manning

Publications, Shelter Island, NY.

Takerngsaksiri, W., Charakorn, R., Tantithamthavorn, C.,

and Li, Y.-F. (2024). Tdd without tears: Towards test

case generation from requirements through deep rein-

forcement learning. arXiv preprint arXiv:2401.07576.

Wohlin, C., H

ost, M., and Henningsson, K. (2003). Empir-

ical research methods in software engineering. Em-

pirical methods and studies in software engineering:

Experiences from ESERNET. Springer, pages 7–23.

Zameni, T., van Den Bos, P., Tretmans, J., Foederer, J., and

Rensink, A. (2023). From bdd scenarios to test case

generation. In 2023 IEEE International Conference

on Software Testing, Veriﬁcation and Validation Work-

shops (ICSTW), pages 36–44. IEEE.

Zhang, L., Wang, Y., and Li, X. (2023). Enhancing bdd

test generation with large language models. Journal

of Software Engineering Research and Development,

11(2):75–90.

Zhou, X., Jin, Y., Zhang, H., Li, S., and Huang, X. (2016).

A map of threats to validity of Systematic Literature

Reviews in Software Engineering. In 2016 23rd Asia-

Paciﬁc Software Engineering Conference (APSEC),

pages 153–160. IEEE.

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies