Comparative Analysis of Large Language Models for Automated Use

Case Diagram Generation

Nikhil Krishnan G S

, Ambadi S

and M G Thushara

Department of Computer Science and Applications, Amrita School of Computing, Amritapuri, India

Keywords:

UML, Use Case Diagram, LLM, Automating UML, Automation.

Abstract:

This study explores the use of Large Language Models in automating the process of generating use case

diagrams from software requirements written in natural language, ensuring both syntactic correctness and

semantic accuracy. The proposed methodology involves selecting a few prominent LLMs, preparing standard-

ized inputs, and assessing outputs based on their syntactic and semantic correctness, and relationship mapping.

Models were compared using a rigorous error analysis framework to identify strengths and limitations of the

models. Among the tested models, gemma2 achieved the best average performance. This research contributes

to advancing automated requirements processing, offering a scalable solution for software engineering work-

ﬂows.

1 INTRODUCTION

Requirements engineering is the very ﬁrst step in soft-

ware development which deﬁnes b oth the functional

and non-functio nal speciﬁcations of a software sys-

tem. These speciﬁcation s are usually docu mented in

a Software Requirements Speciﬁcations (SRS) do c u-

ment. This document helps to translate stakehold er

expectations into the technical implementation of the

software and guides the development, testing, and val-

idation processes. I n the requirements, use case d ia -

grams play an important role. It visually represents

the interaction between users (actors) and the system

(use cases), makin g the requireme nts clear to both

technical and non-technical stakeholders and convey-

ing the software system’s expe cted behavior across

development teams.

Manually creating these use case diagrams is a

time-consuming task, particularly for complex soft-

ware systems with multiple actors and use ca ses

(Elallaoui et al., 2018). This of te n leads to inconsis-

tency issues, as high skill and attention to detail are re-

quired to m aintain uniformity in notations and struc-

tures in all the diagrams (Nair and T hushara, 2 024a).

In ad dition, manually created diagrams are prone to

human errors, such as n eglected relationships or mis-

https://orcid.org/0009-0001-8538-2372

https://orcid.org/0009-0003-3839-5490

https://orcid.org/0000-0002-8325-1491

interpreted use cases, w hich will affect the quality of

the SRS document (Nair an d Thushara, 2024b). Up-

dating the dia grams to add any new change in require-

ment is a labo r-intensive task and can introduce fur-

ther inconsistencies leading to scalability issues. D ue

to these drawbacks with the manual creation of dia-

grams, automating the proc e ss is gaining signiﬁcant

attention, to streamline development and reduce error

(Elallaoui et al., 2018).

Large Language Models (LLMs) are a g ood

tool for automating the generation of use case

diagrams from natural language speciﬁcations

(Alessio et al., 2024) (Jeong, 2024). PlantUML

compiler allows for the generation of UML diagr ams

from PlantUML code. Bu t writing PlantUML code

manually is a time-consuming and error-prone

process (Alessio et al., 2024). LLMs when trained

on speciﬁc datasets containing both text and code,

become capable of generating PlantUML syntnax

from functional requirements (Soudani et al., 2024).

However, current LLMs, especially general-purpose

ones, often produce output with syntactic or semantic

errors, necessitating manual corrections for precise

executable code (Xie et al., 2023).

This study is motivated by the need for an auto-

mated and accurate generation of use case diagrams

from functional requirements, with minima l manual

intervention. By evaluating multiple base LLM s, this

research aims to identify which models perform be st

G S, N. K., S, A. and Thushara, M. G.

Comparative Analysis of Large Language Models for Automated Use Case Diagram Generation.

DOI: 10.5220/0013594700004664

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 3rd International Conference on Futuristic Technology (INCOFT 2025) - Volume 2, pages 465-471

ISBN: 978-989-758-763-4

465

in generating error-free PlantUML code for use case

diagrams. This analysis will focus on error classi-

ﬁcation, particularly syntactic and sema ntic errors,

and determine the suitability of eac h model for fur-

ther ﬁne-tuning. Identifying an optimal base model

with fewer initial errors will reduce the effort required

in post-genera tion correction and enhance the viabil-

ity of automated UML generation in both educ ational

and software development contexts.

The paper is structur ed as follows: Sectio n II re-

views related work in requirements engineering and

automated UML generation, focusing on advances in

LLM-driven code generation. Section III details the

methodology, includ ing environment setup, model se-

lection, input data preparation, a nd evaluation met-

rics. Section IV presents the results, analyzing mode l

performance based on error ra te s and code quality.

Section V explains how the models were evaluated

and the criteria that led to determining the best LLM

model. Finally, Section VI concludes with in sig hts

from this evaluation and discusses potential directions

for future research, including model ﬁne-tuning for

improved accuracy in UML generation.

2 RELATED WORKS

2.1 Automated Requirements

Processing

The process of automating the generation of

UML diagrams and other structural formats

(Vemuri et al., 20 17) from software requirements

has been explored before. Previous works uti-

lized rule-based and NLP-based methods to

interpret the requirem ents (Veena et al., 2018)

(Veena et al., 2019). These approaches relied on

syntactic and semantic parsing to identify the

actors and use cases in the functional req uire-

ments. But the results generated often require

manual corrections (Nair and Thushara, 2024 b)

(Nair and Th ushara, 2024 a). More recent approaches

leverage machine learnin g models, but this requires

substantial doma in-speciﬁc training data to gen-

erate results with accuracy (Ahmed et a l., 2022).

Transforming UML diagrams from one type to

another has also been investigated, such as the work

on deriving activity diagrams from Java execution

traces (Devi Sree and Swaminathan, 2018) and trans-

forming sequ ence diagrams into activity diagram s

(Kulkarni and Srinivasa, 2021).

2.2 Use of Large Language Models in

Software Engineering

Studies (Alessio et al., 2024) have shown that LLMs

can translate descriptions in natural language into

code snippets and other meaningful structural rep-

resentations. Current LLMs like GPT and Codex

struggle with syntax accuracy and speciﬁc domain

requirements, often generating inconsistent results

(Xie et al., 2023). Research in this d omain shows a

need for ﬁne-tuning LLMs to g enerate reliable re-

sults containing f ewer errors (Jeong, 2024). A ddi-

tionally, fram eworks leveraging retrieval-augmented

generation (RAG)-based approa ches have be e n ap-

plied to teaching U ML diagram generation effec-

tively (Ar dimento et al., 2024). Recent technical ad-

vancements also highlight efﬁcient LLM models

that are capable of functioning on edge devices

(Abdin et al., 2024).

2.3 Error Analysis in LLM-Generated

Code

Identify ing errors and correcting them is a critical

step in evaluating the outputs genera te d by LLMs

for software e ngineering applications. Earlier stud-

ies have identiﬁed some common error types, such

as inconsistent semantics a nd misinterpreted require-

ments. Correction after generating results and ﬁne-

tuning in the speciﬁc domain are two techniques that

have been proposed to address these issues. Recent

research de monstrates how graph transformation ap-

proach e s (Anjali et al., 2019) can mo del and analyze

UML diagrams effectively (Hachichi, 2022). Further-

more, evaluating the effectiveness of LLMs in gener-

ating UM L diagr ams has been explored, with notable

efforts in class diagram generation for comparative

analysis (De Bari, 2 024).

There is limited research on comparing LLMs

speciﬁcally for PlantUML syntax generation. This

gap stresses the need for comprehen sive error analysis

to determine the most suitab le model for auto mated

use case diagram generation.

ROUGE and BLUE are two conventional

referenc e -based metrics for evaluating LLMs. These

metrics are not preferred in this research as they have

relatively low correlation with human judg ements,

especially in open -ended generation task s. While

these metrics are good for evaluating shor t and

generic outputs, they fail to evaluate coherence and

relevance which are cr itical for human-like judge-

ments. On e solution to overcome these shortcomings

is to use LLM as a judge fo r evaluation. GEval

INCOFT 2025 - International Conference on Futuristic Technology

466

(Liu e t al., 2023) is a framework that uses LLM s

with chain-of-thoughts to evaluate the quality of

LLM-generated results. By taking natural language

instruction that deﬁnes the evaluation criteria as

prompt, G-Eval uses an LLM to generate a chain-

of-tho ughts of detailed evaluation steps. The prompt

along with the generated chain-of-thoughts is used to

evaluate the results.

2.4 Summary and Research Gap

While prior work has explored LLMs in requ ire-

ments processing and code generation, few studies

have focused on systematically comparing models for

use c ase diagram ge neration with PlantUML syntax

(Alessio et al., 2024). Our study addresses this gap b y

evaluating multiple lig htweight LLMs (low parameter

LLMs) installed in local ma c hines to identify those

with the lowest error rate s, aiming to reduce the need

for manual c orrections and enhance the automation

potential of UML generation tools in future works.

3 METHODOLOGY

This methodology aims to systematically evaluate and

compare the effectiveness of different Large Lan-

guage Models (LLMs) in generating accura te Plan-

tUML syntax directly from natural language require-

ments. Given the increasing use of LLMs in au-

tomating coding and diagramming tasks, understand-

ing their strength s and limitations in this speciﬁc con-

text is essential. This analysis seeks to identify which

models produce syntactically correct and semanti-

cally meanin gful use case diagrams, with minimal er-

rors, thus reducing the need for manual corrections.

By cond ucting a comparative analysis, we aim to de-

termine the m ost suitable LLM f or auto mating use

case diagram generation, particularly in terms of ac-

curacy, efﬁciency, and reliability.

The methodology is structured into key stages as

shown in ﬁgure 1: model selection, data preparation,

input speciﬁcation , error classiﬁcation, and evaluation

and analysis. E ach stage plays a vital role in ensur-

ing a comprehensive and fair comparison of the mod-

els. T his structured approach helps in identifying the

best-performing model and highlights common areas

where improvement is required.

3.1 Model Selection

The very ﬁrst step in the methodology was to se-

lect a set of Large Language Models for evaluation.

We chose the models based on several factors includ-

ing their capability in generating struc tured code out-

puts, understanding natural language inputs, and ac-

cessibility. The focus was on selecting models tha t

were small in size, had smaller par ameters, and were

resource- efﬁcient so that the mode ls could be run in

local machines. Each one of the selected models ha s

varied parameters and a rchitecture, which allows us

to observe the d ifferences in code quality, syntax ac-

curacy, and reliability across diverse conﬁgurations.

This selection process aims to ensure a well-round ed

compariso n of lightweight base mo del LLMs, to de-

termine which could best translate n atural language

requirements into accurate PlantUML syntax.

3.2 Data Preparation

To assess each model’s perfo rmance, a functional

requirements dataset that could efﬁciently test each

model’s capabilities had to be prepar e d. We prepared

a set of speciﬁcations based on common use case sce-

narios, including data processing, user administration,

and e-commerce. Th e requirements were annotated

wherever possible to emphasize speciﬁc components

such as actors, actions an d system interactions. This

gives the models a structured input. We pre-processed

the data before inpu tting it into the models to make the

wording more uniform and eliminate a mbiguities in

requirements caused by any technical jargon presen t

in the data, which th e model could potentially misin-

terpret. Because of this uniform formatting, it became

easier to evaluate each model’s answer correctness.

3.3 Input Speciﬁcation

The LLM models receive a CSV ﬁle as input contain-

ing ten software fu nctional requirements, each rep-

resenting a software system. These functional re-

quirements are added as plain text with a standard-

ized prompt design e d to guide the mo del in gener-

ating PlantUML code for use case diagram s. The

prompt contains the desired output for mat, including

the proper semantics of ”include” and ”extend” rela-

tionships and speciﬁes tha t the output should contain

PlantUML code with valid syntax, without any ad di-

tional explanations or comments that could cause syn-

tax errors. The same prompt is used for every LLM

model and input set to guarantee uniformity in input

interpretation.

3.4 Evaluation Metrics

Answer correctness metric is used for evaluation. It is

a numeric a l scor e between 0 and 1, 0 being the least

Comparative Analysis of Large Language Models for Automated Use Case Diagram Generation

467

Figure 1: Structural Flow of the Proposed Model.

Table 1: Overview of Selected LLM Models for Use Case Diagram Generation.

Model Parameters Description

Qwen 2.5-coder 7B Qwen2.5-Coder (Hui et al., 2024) is an advanced version of the Qwen series of large language

models (formerly CodeQwen), offering notable improvements in code generation, r easoning,

and bug ﬁxing over its predecessor, CodeQwen1.5. Built on the Qwen2.5 foundation, it is

trained on a vast dataset of 5.5 tri llion tokens, which includes source code, text-code grounding,

and synthetic data. It also supports long-context inputs of up to 128K tokens.

Gemma 9B Google’s lightweight Gemma2 (Team et al., 2024) models build on Gemini research, with en-

hanced capabilities in language understanding, suitable for tasks like question-answering, sum-

marization, and reasoning.

Llama 3.1 8B Meta’s Llama 3 (Dubey et al., 2024) improves over Llama 2 in coherence and contextual under-

standing, excelling in creative writing, coding, and conversational AI. Llama 3.1 is an upgraded

version that improves reasoning and context size.

Code Llama 7B Code Llama (Roziere et al., 2024) is a code generation model built on Llama 2, designed to

enhance developer workﬂows and assist in learning programming. It can generate both code

and natural language explanations, supporting popular programming languages like Python,

C++, Java, PHP, Typescript, C#, and Bash.

Mistral NeMo 12B A 12B model by Mistral AI with NVIDIA, NeMo offers a 128k token context window, support-

ing multilingual applications in English, French, German, and more. The model is optimized

for function calling and is similar to Mistral 7B.

Aya Expanse 8B Aya Expanse, developed by Cohere For AI, is a family of multili ngual large language mod-

els designed to close language gaps and improve global communication. It supports over 23

languages, including Arabic, Chinese, English, Hindi, and Spanish. The models use advanced

techniques like supervised ﬁne-tuning, multilingual preference training, and model merging to

enhance performance across various linguistic tasks.

correct an d 1 being the most correct answer. Th e cor-

rectness metric involves comparing the LLM’s result

with the expected result based on any c ustom evalua-

tion criteria deﬁne d by the user. The evaluation crite-

ria involves:

• Syntax Correctness: This criter ia checks if the

generated PlantUML code has syntax errors and

if it will compile successfully.

• Comments and Descriptions: Any additional

comments o r descriptions included with the Plan-

tUML code will cause a syntax error. This criteria

checks if there are any comments or descriptions

in the result.

• Actor Count: To conﬁrm that ther e ar e the same

number of actors in the exp ected and actual out-

puts.

• Use Case Correspondence: This criterion en -

sures that the generated output matches the ex-

pected output.

• Extend and Include Relationships: To conﬁrm

that the relationships are accurately implemented

and positioned in the diagram.

• Hallucination Detection: To look for any extra

created content that does not ﬁt the speciﬁed func-

tional requirements.

These me trics provided a balanced assessment

framework, helping to capture both the precision and

usability of each model’s output.

INCOFT 2025 - International Conference on Futuristic Technology

468

4 RESULTS

Figure 2: Evaluation results.

The identical set of carefully chosen requirements

was fed into each model for testing, and the output

produced by PlantUML was compared to our pre-

determined metrics. We ensur ed that every output

complied with PlantUML requirements by using au-

tomated methods to verify synta x accuracy. In or-

der to conﬁrm semantic accuracy, we also carried out

a manual examination in wh ich we evaluated if the

relationships and elements were a ppropriately repr e-

sented by compar ing the created diagrams with the

original cr iteria.

After testing, we ranked th e models accordin g to

how well th ey perfo rmed across all measu res. The

inputs, actual outputs, expecte d outputs, correctness

scores and corr ectness reasons for each LLM can

be viewed here - https://github.com/Mandalorian-

way/Comparative-Analysis-of-Large-Language-

Models-Result. As seen in table 2, models with

frequent errors or ambiguou s outputs were scored

worse, whereas those with higher syntax and se-

mantic accuracy and lower err or rates were ranked

favorably. We were able to determine each model’s

advantages and disadvantages as well as which

LLMs are best at producing pre cise UML dia grams

straight from requirements thanks to this comparative

analysis.

5 EVALUATION

The G-Eval correctness score is calculated by using a

scoring fu nction that combines the prompt, a chain-

of-tho ught and the input context, along with the gen-

erated plantUML code. The evaluation process in-

volves generating a set of scores based on the evalu-

ation criteria we provide. The LLM evaluates the re-

sults and produces a probability for each score, and

the ﬁnal score is calculated as a weig hted sum of

the probabilities and their corresponding score values.

This approach normalizes the scores and provides a

more continuous, ﬁne-grained score.

The G-Eval correctness score is calculated as:

score =

∑

i=1

p(s

) × s

Where:

• p(s

) is the probability of the score s

being as-

signed by the LLM.

• s

is one of the possible discrete scores in the pre-

deﬁned set of scores for evaluation criteria.

On examining the evaluation results, it is ob-

served that Gemma2:9b-instruct-q6

K performed

the best with Qwen2.5 -coder:7b-instruct-q6

K and

Mistral-Nemo:12b-instruct-2407-q6 K right behind

the model, having comp arable scores.

Gemma2:9b-instruct-q6

K perform ed well in

modeling use cases but some relationships were

mapped incorrectly, missed some functional require-

ments, and had mismatched u se case names. Aya-

Expanse:8 b-q6

K received its score due to the misuse

and

relationships, logical in-

consistencies in connecting actors and use cases, and

inclusion of unnecessary text and comments that com -

promises syntax . Llama3:8b-instruct-q6

K correctly

modeled most use cases but had u sed relationships in-

correctly, added unnece ssary comments, and omitted

or misrepresented requireme nts, with occasional hal-

lucinated conten t. Mistral-Nem o:12b-instruct-2407-

K followed basic syntax well, but had issues with

and

relationships, missing or

misrepresented requiremen ts, and logical errors in

mapping r e la tionships. Qwen2.5-cod e r:7b-instru ct-

K did well in modeling use cases but faced sim-

ilar issues with incorrect relationship usage, miss-

ing connections, and redundant labels in relation-

ships. Codellama:7b-instruct- q6

K struggled with

similar issues, including incorrect use of relation-

ships, logical errors in actor-to-use case connections,

and extra text before or after the PlantUML code.

6 CONCLUSION AND FUTURE

WORK

Automating the generation of UML diagrams holds a

signiﬁcant importance in reducing manual effort and

improving the accuracy of the diagrams. Using Large

Comparative Analysis of Large Language Models for Automated Use Case Diagram Generation

469

Table 2: Answer correctness score.

Functional Requirements Aya expanse :8b codellama:7b gem ma2:9b llama3:8b mistral- nemo:12b qwen2.5-cod er:7b

Application 1 0.524 0.314 0.864 0.003 0.577 0.722

Application 2 0.379 0.77 0.76 0.617 0.74 0.747

Application 3 0.656 0.647 0.448 0.655 0.732 0.624

Application 4 0.644 0.786 0.627 0.78 0.665 0.699

Application 5 0.387 0.308 0.47 0.517 0.469 0.646

Application 6 0.461 0.136 0.447 0.54 0.665 0.575

Application 7 0.465 0.271 0.406 0.397 0.366 0.625

Application 8 0.436 0.36 0.653 0.373 0.763 0.405

Application 9 0.479 0.382 0.865 0.484 0.532 0.665

Application 10 0.495 0.351 0.889 0.602 0.455 0.475

Language Models f or code generation is a growing

area of research in recent times. In this research, we

have explored the capability of LLMs to create these

UML diagrams. The objective of this research was to

compare a few base LLMs to ﬁnd out which one of

them performed be st fo r generating PlantUML code

for use case diagrams.

From this research, we have identiﬁed that on

average, gemma2 outperformed the other models in

generating PlantUML cod e with least errors. How-

ever, from our ob servations, we see that even though

gemma2 has the be st average score, it did no t perform

the best for every input. T he results produced still has

a few syntactic and semantic inaccurac ie s. This is be-

cause base LLMs do not have any understanding of

the domain and will generate inconsistent r esults.

In order to improve the pe rformance of these mod-

els in this domain, extensive ﬁne- tuning is required.

A set of diverse functional require ments can be used

for ﬁn e-tuning to create a robust mo del capable of

handling complex functional requirements. Retirieval

Augmente d Generation can also be used to provide

context on UML syntax to the LLM. Future works

can use human feedback loops to iteratively reﬁne the

results and improve the semantic quality of the gener-

ated results. Using com plex inputs for testing, includ-

ing ambiguous a nd incomplete inputs, can help assess

the model’s adaptability to d ifferent patters of inputs.

This reﬁned model can reduce the human effort re-

quired in the generation of UML diagrams. The ﬁnal

version of this improved model can be integrated into

software development pipelines to save the time an d

effort of software analysts and architects.

REFERENCES

Abdin, M., Aneja, J., Awadalla, H., Awadallah, A., Awan,

A. A., Bach, N., Bahree, A., Bakhtiari, A., Bao, J.,

Behl, H., et al. (2024). Phi-3 technical r eport: A

highly capable language model locally on your phone.

arXiv preprint arXiv:2404.14219.

Ahmed, S., Ahmed, A., and Eisty, N. U. (2022). Auto-

matic transformation of natural to uniﬁed modeling

language: A systematic review. In 2022 IEEE/ACIS

20th International Conference on Software Engineer-

ing Research, Management and Applications (SERA),

pages 112–119. I EEE.

Alessio, F., Sallam, A., and Chetan, A. (2024). Model gen-

eration from requirements with llms: an exploratory

study-replication package.

Anjali, S., Meera, N. M., and T hushara, M. (2019). A

graph based approach for keyword extraction from

documents. In 2019 Second International Confer-

ence on Advanced Computational and Communica-

tion Paradigms (ICACCP), pages 1–4. IEEE.

Ardimento, P., Bernardi, M. L ., and Cimitile, M. (2024).

Teaching uml using a rag-based llm. In 2024 Interna-

tional Joint Conference on Neural Networks ( IJCNN),

pages 1–8. IEEE.

De Bari, D. (2024). Evaluating large language models

in software design: A comparative analysis of uml

class diagram generation. PhD thesis, Politecnico di

Torino.

Devi Sree, R. and Swaminathan, J. (2018). Construction of

activity diagrams from java execution traces. In Am-

bient Communications and Computer Systems: RAC-

CCS 2017, pages 641–655. Springer.

Dubey, A., Jauhri, A ., Pandey, A., Kadian, A ., Al-Dahle,

A., Letman, A., Mathur, A., Schelten, A., Yang, A.,

Fan, A., et al. (2024). The llama 3 herd of models.

arXiv preprint arXiv:2407.21783.

Elallaoui, M., Naﬁl, K., and Touahni, R. (2018). Automatic

transformation of user stories into uml use case dia-

grams using nlp techniques. Procedia computer sci-

ence, 130:42–49.

Hachichi, H. (2022). A graph transformation approach

for modeling uml diagrams. International Journal of

INCOFT 2025 - International Conference on Futuristic Technology

470

Systems and Service-Oriented Engineering (IJSSOE),

12(1):1–17.

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L .,

Liu, T., Zhang, J., Yu, B., Lu, K., et al. (2024).

Qwen2. 5-coder technical report. arXiv preprint

arXiv:2409.12186.

Jeong, C. (2024). Fine-tuning and util ization meth-

ods of domain-speciﬁc llms. arXiv preprint

arXiv:2401.02981.

Kulkarni, D. R. and Srinivasa, C. (2021). Novel approach

to transform uml sequence diagram to activity dia-

gram. Journal of University of Shanghai f or Science

and Technology, 23(07):1247–1255.

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Z hu, C.

(2023). G-eval: Nlg evaluation using gpt-4 with better

human alignment. arXiv preprint arXiv:2303.16634.

Nair, R. P. and Thushara, M. (2024a). Hybrid actor-action

relation extraction: A machine learning approach.

Procedia Computer Science, 233:401–410.

Nair, R. P. and Thushara, M. (2024b). Investigating nat-

ural language techniques for accurate noun and verb

extraction. Procedia Computer Science, 235:2876–

2885.

Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I.,

Tan, X. E., Adi, Y., Liu, J., Sauvestre, R., Remez, T.,

et al. (2024). Code llama: Open foundation models

for code.

Soudani, H., Kanoulas, E., and Hasibi, F. (2024). Fine tun-

ing vs. retrieval augmented generation for less popular

knowledge. In Proceedings of the 2024 Annual In-

ternational ACM SIGIR Conference on Research and

Development in Information Retrieval in the Asia Pa-

ciﬁc Region, pages 12–22.

Team, G., Riviere, M., Pathak, S., Sessa, P. G. , Hardin, C.,

Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahri-

ari, B., Ram´e, A., et al. (2024). Gemma 2: Improv-

ing open language models at a practical size. arXiv

preprint arXiv:2408.00118.

Veena, G., Gupta, D., Lakshmi, S., and Jacob, J. T. (2018).

Named entity recognition in text documents using a

modiﬁed conditional random ﬁeld. In Recent Find-

ings in Intelligent Computing Techniques: Proceed-

ings of the 5th ICACNI 2017, Volume 3, pages 31–41.

Springer.

Veena, G., Hemanth, R., and Hareesh, J. (2019). Relation

extraction in clinical text using nlp based regular ex-

pressions. In 2019 2nd International Conference on

Intelligent Computing, Instrumentation and Control

Technologies (ICICICT), volume 1, pages 1278–1282.

IEEE.

Vemuri, S., Chala, S., and Fathi, M. (2017). Automated

use case diagram generation from textual user re-

quirement documents. In 2017 IEEE 30th Canadian

Conference on Electrical and Computer E ngineering

(CCECE), pages 1–4. IEEE.

Xie, D., Yoo, B., Jiang, N., Kim, M., Tan, L ., Zhang, X.,

and Lee, J. S. (2023). Impact of large language models

on generating software speciﬁcations. arXiv preprint

arXiv:2306.03324.

Comparative Analysis of Large Language Models for Automated Use Case Diagram Generation

471