Comparative Evaluation of NLP Approaches for Requirements

Formalisation

Shekoufeh Kolahdouz Rahimi

1 a

, Kevin Lano

2 b

, Sobhan Yassipour Tehrani

3 c

Chenghua Lin

4 d

, Yiqi Liu

4 e

and Muhammad Aminu Umar

2 f

School of Arts, University of Roehampton, London, U.K.

Department of Informatics, King’s College London, London, U.K.

Department of Computer Science, University College London, U.K.

Department of Computer Science, University of Shefﬁeld, U.K.

Keywords:

Requirements Formalisation, Model-Driven Engineering, NLP.

Abstract:

Many approaches have been proposed for the automated formalisation of software requirements from semi-

formal or informal requirements documents. However this research ﬁeld lacks established case studies upon

which different approaches can be compared, and there is also a lack of accepted criteria for comparing the

results of formalisation approaches. As a consequence, it is difﬁcult to determine which approaches are more

appropriate for different kinds of formalisation task. In this paper we deﬁne benchmark case studies and a

framework for comparative evaluation of requirements formalisation approaches, thus contributing to improving

the rigour of this research ﬁeld. We apply the approach to compare four example requirements formalisation

methods.

1 INTRODUCTION

Automated requirements formalisation (RF) has sig-

niﬁcant potential as a means of reducing software de-

velopment costs, accelerating development processes,

and increasing the rigour of requirements engineer-

ing processes. Typically, RF involves producing a

software model in a language such as UML, or in

a domain-speciﬁc language (DSL), from natural lan-

guage requirements documents in text format. The

RF process or result can also provide useful analysis

information about the requirements statement, i.e., to

detect duplicated or invalid requirements.

Many approaches have been proposed for the for-

malisation of software requirements, typically involv-

ing some form of natural language processing (NLP) or

machine learning (ML) (Otter et al., 2023). However,

the research ﬁeld lacks widely-recognised benchmark

https://orcid.org/0000-0002-0566-5429

https://orcid.org/0000-0002-9706-1410

https://orcid.org/0000-0003-4417-0477

https://orcid.org/0000-0003-3454-2468

https://orcid.org/0000-0002-8070-5056

https://orcid.org/0000-0001-9433-2409

case studies to support comparative evaluation of dif-

ferent approaches on the same requirements cases, and

the main evaluation technique, based on estimating

precision/recall and F-measure accuracy, is subjective.

F-measure gives the degree to which a proposed

formalised model correctly expresses the model ele-

ments implied by the source text, and has the deﬁnition

F = 2 ∗

precision∗recall

precision+recall

recall =

correctly identiﬁed elements

total identiﬁed

precision =

correctly identiﬁed elements

total correct elements

The judgement as to the correctness of the model ele-

ments has a subjective aspect, for example, it may be

based on agreement with a ‘gold standard’ model pro-

duced by a human expert. However different modellers

may produce signiﬁcantly different gold standard mod-

els. Our view is that evaluation of RF approaches

should use objective measures where possible, and

also take into account the software engineering con-

text of use of formalised models. In general, software

Rahimi, S., Lano, K., Tehrani, S., Lin, C., Liu, Y. and Umar, M.

Comparative Evaluation of NLP Approaches for Requirements Formalisation.

DOI: 10.5220/0012318700003645

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 12th International Conference on Model-Based Software and Systems Engineering (MODELSWARD 2024), pages 125-132

ISBN: 978-989-758-682-8; ISSN: 2184-4348

125

developers should be able to effectively use the for-

malised models for further development stages, and

should be able to relate the models to the original

requirements statement. Thus the quality and inter-

nal consistency of the formalised model is important

(e.g., there should not be duplicated class or use case

names in formalised models, and names should adhere

to name style conventions for the respective model

kinds). There should be a high degree of traceability

between the formalised requirements and the source

text.

To effectively compare different RF approaches,

objective measures of accuracy are needed, which are

aligned to the SE context of use of the formalised

model. Thus we propose the evaluation model of

Figure 1, which involves three aspects: (i) an ob-

jective measure of similarity between the formalised

model and a rigorously produced ‘gold standard’ ref-

erence model; (ii) a measure of completeness of the

formalised model wrt the source requirements docu-

ment; (iii) a measure of internal quality of the model.

The reference model could be produced by agree-

ment between two or more human experts, with an

independent review by a further expert, to improve its

appropriateness as a correctness standard.

Figure 1: Evaluations of requirements formalisation.

1.1 Natural Language Processing (NLP)

NLP is a collection of techniques for the processing of

natural language text, including part-of-speech (POS)

tagging/classiﬁcation, tokenisation and segmentation,

lemmatisation, parsing, chunking, dependency anal-

ysis and reference correlation. NLP tools include

Stanford NLP (Stanford University, 2020), Apache

OpenNLP (Apache, 2021), iOS NLP Framework,

Python’s NLTK, and WordNet (Princeton University,

2021).

The standard parts of speech include (Santorini,

1990): Determiners – tagged as DT, e.g., “a", “the";

Nouns – NN for singular nouns and NNS for plural;

Proper nouns – NNP and NNPS; Adjectives – JJ for

general adjectives, JJR for relative adjectives, JJS for

superlatives; Modal verbs – MD such as “should",

“must"; Verbs – VB for the base form of a verb, VBP

for present tense except 3rd person singular, VBZ for

present tense 3rd person singular, VBG for gerunds,

VBD for past tense; Adverbs – RB; Prepositions/subor-

dinating conjunctions – IN.

For specialised purposes, additional parts of speech

may be deﬁned, as in (Xu et al., 2020). NLP has

been a key element of automated RE approaches (Otter

et al., 2023). However, the trained models available

for POS-tagging and parsing with the existing NLP

tools are usually oriented towards general English text,

which differs signiﬁcantly from the subset of English

typically used in software requirements statements.

The existing POS-tagger models therefore sometimes

misclassify words in requirements statements (e.g.,

“stores" may be misclassiﬁed as a noun, even when

used as a verb, and “existing" used as an adjective

misclassiﬁed as a gerund).

1.2 Machine Learning (ML)

Machine learning covers a wide range of techniques

by which knowledge about patterns and relationships

between data is gained and represented as implicit

or explicit rules in a software system. ML can be

used for classiﬁcation, translation or prediction. In

particular, ML is used to create the part-of-speech

and other models used in NLP tools. ML techniques

include K-nearest neighbours (KNN), decision trees,

inductive logic programming (ILP) and neural nets.

A key distinction can be made between techniques

such as decision trees and ILP where explicit rules

are learnt from data, and techniques such as neural

nets where the learned knowledge is in an implicit

form (consisting of the weights of connections in the

trained network). In recent years there have been sub-

stantial advances in neural networks (recurrent neural

networks or RNNs) which possess a ‘memory’ of a se-

quence of inputs, enabling them to perform prediction

tasks and process data (such as natural language texts

or programs) that consist of a connected sequence of

elements (sentences) (Kolahdouz-Rahimi et al., 2023).

The increasing power of ML approaches based on

large language models (LLMs) such as BERT (Guo

et al., 2021), Codex (Chen et al., 2021) and GPT (Zhao

et al., 2023) has already led to innovative software

assistants such as Copilot (GitHub.com, 2022) and

program translators such as CodeT5. The appropriate

MODELSWARD 2024 - 12th International Conference on Model-Based Software and Systems Engineering

126

pre-training and ﬁne-tuning of LLMs to support soft-

ware engineering tasks is an area of active research.

There have been few works on using LLMs for require-

ments formalisation and existing LLM-based tools are

limited in their capabilities in this area (Camara et al.,

2023).

Toolsets for ML include Google MLKit, Tensor-

ﬂow, Keras, ScikitLearn and Theano.

2 COMPARED APPROACHES

In this study we illustrate our proposed evaluation

framework by comparing four alternative approaches

for formalising behavioural system requirements ex-

pressed in unstructured English text or as semi-

structured user stories.

The approaches are:

Hamza and Hammad (Hamza and Hammad, 2019):

based on segmentation, POS-tagging, chunking,

grammar patterns.

Elallaoui, Naﬁl and Touahni (Elallaoui et al., 2018):

based on POS-tagging

AgileUML (Lano et al., 2021):

based on POS-

tagging, chunking, semantic analysis and

word-similarity matching

Simple heuristic:

As a baseline for comparison, a sim-

ple heuristic approach based on POS-tagging and

chunking is deﬁned and evaluated.

User stories are widely-used in agile methods to ex-

press functional requirements. They have the semi-

structured format

As a/an [actor], I [wish/want/...] to [action],

[purpose]

where the stakeholder requiring the functionality is

identiﬁed in the ﬁrst part, then the action in the second

part, and an optional purpose is described in the third

part.

For example:

As a doctor, I wish to view the patient’s EHR

2.1 Hamza and Hammad Approach

(Hamza and Hammad, 2019)

The approach starts from a textual speciﬁcation of re-

quirements in English text, this is spell-checked to

eliminate erroneous text, then segmented into sen-

tences. POS-tagging is applied, and this information

is used to chunk the text into sequences of closely-

associated words. For example the sequence of adjec-

tives associated with a noun are grouped with it: this

is the chunk pattern JJ

∗

NN and related chunk patterns.

Stemming is used to identify the root form of words in

the text. Grammar knowledge patterns (GKP) are used

to recognise expected sentence structures, and provide

a corrective to semantic errors arising from incorrect

POS-tagging.

To identify actors and actions of a use case, differ-

ent rules are applied to handle variation in the way that

these can be expressed in unstructured text.

The approach is evaluated on four case studies of

requirements statements, of small size (each case has

between 11 to 23 functional requirements). It is not

clear if these are real systems or artiﬁcial examples.

Precision and recall are evaluated, however it is unclear

how the correctness of the formalisation is determined,

ie., how the correct reference model was constructed.

They ﬁnd an average recall of 69% and precision of

72%, which indicates that the approach tends to pro-

duce both incorrect formalisations, and fails to produce

correct formalisations. The reasons for these errors are

mainly due to linguistic variability and ambiguity/in-

completeness in the requirements statements. Another

factor is that the requirements statements also express

constraints, such as “all ﬁelds of an edited asset can

be modiﬁed except Ids", which do not correspond to

use cases. The approach could be extended by adding

recognition of these different forms of requirement,

e.g., by classifying requirements statements.

On the example sentence, the approach produces

the result:

class doctor { }

usecase viewpatient’ehr

{ actor = doctor; }

The use case is correct, but the name of the use

case contains an invalid character.

2.2 Elallaoui, Naﬁl and Touahni

Approach (Elallaoui et al., 2018)

In contrast to the preceding approach, this approach

takes as input semi-structured user stories, and applies

POS-tagging as its main NLP technique. The input

text sentences are POS-tagged and the words of each

sentence are ﬁltered to remove adjectives and auxil-

iary words, retaining only nouns and verbs. The ﬁrst

noun/compound noun in each sentence is assumed to

be the actor of the use case. The following verbs and

nouns then make up the action of the use case.

The evaluation uses a single case, but of large size

(168 user stories). Recall and precision compared to a

manually-constructed use case model are calculated,

with high precision and recall values for actors (p =

98%, r = 98%), use cases (p = 87%, r = 85%) and their

relationships (p = 87%, r = 85%).

Comparative Evaluation of NLP Approaches for Requirements Formalisation

127

On the example sentence, the approach produces

the result:

class doctor { }

usecase wishviewpatientEHR

{ actor = doctor; }

This is a valid formalisation.

2.3 AgileUML (Lano et al., 2021)

This approach operates on either unstructured or semi-

structured behaviour speciﬁcations. It performs seg-

mentation into sentences and POS-tagging, and uses a

decision tree classiﬁer to distinguish sentences that ex-

press user stories from those that express data require-

ments or general constraints. Both class deﬁnitions

and use cases are derived from the classiﬁed sentences,

using heuristics to recognise the class names, attribute

names, actors and actions in the text. A thesaurus/glos-

sary is used to classify words/phrases. Approximate

matching using text edit distance (Levenshtein et al.,

1966) is used in order to allow for variation in word

form.

The evaluation is performed on 27 cases, including

24 real-world cases. There are 10 large cases (over

75 user stories), 2 small and 15 of medium size (25 to

74 use cases). The average F-measure is 94%, based

on comparison of the automatically formalised models

with manually-derived models.

On the example sentence, the approach produces

the result:

class Doctor {

stereotype originator="1";

}

class Patient {

stereotype originator="1";

}

usecase viewThePatient : void {

parameter doctorx : Doctor;

parameter patientx : Patient;

stereotype originator="1";

stereotype actor="Doctor";

stereotype read;

true => patientx->display();

}

Here, tracing information is embedded into the

model using the originator tag. Executable behaviour

is produced for use cases where possible, so that they

can be immediately used for prototyping. However the

key noun ‘EHR’ is missing from the use case name.

The authors ﬁnd that a major cause of poor formali-

sation results is incorrect tagging and incorrect parsing

by the NLP tools used (Stanford NLP and Apache

OpenNLP). Thus the formalisation algorithms fail be-

cause the input they are given is semantically incorrect

(e.g., ‘existing’ mis-classiﬁed as a verb in a phrase ‘the

existing ﬁles’).

2.4 Simple Heuristic Approach

This approach operates on semi-structured user sto-

ries as input. It tokenises the input sentences and

applies POS-tagging. For each sentence it attempts

to recognise the entities (classes) referenced in the

sentence as those noun phrases

DT?JJ

∗

NN+

which

have a noun in a predeﬁned glossary of ‘entity’

nouns. Use cases are recognised from those sen-

tences which contain both a verb and a modal verb.

Chunking of the sentence according to the pattern

[

VB]

∗

[

VB]

∗

[

VB]

∗

is performed, where

the ﬁrst block of non-verbs is used to form the ac-

tor name, and the second block starting with the ﬁrst

verb following the modal verb is taken as the use case

action. Finally, the category of this verb is used to

classify the kind of use case as ‘create’, ‘edit’, ‘read’,

‘delete’ or ‘other’.

For example, “As a doctor, I wish to view the pa-

tient’s EHR" would be chunked as [As, a, doctor, I],

[wish], [to], [view, the, patient’s, EHR]. This would

produce the use case

usecase view_the_patient_EHR {

stereotype actor="doctor";

stereotype "read";

}

Doctor and Patient would become classes.

2.5 Summary

The approaches all use POS-tagging and segmenta-

tion of the text into sentences as initial steps. They

differ in their strategies for extracting use case ele-

ments from the resulting text, although some form of

chunking based on the expected form of behavioural

requirements is an essential part of each strategy. The

evaluations of each approach all use the concepts of

accuracy based on recall/precision and F-measure, but

the number, scale and provenance of evaluation exam-

ples differ, as does the basis for computing accuracy.

This approach to compute accuracy also depends upon

the subjective judgement of the evaluator as to whether

a formalised element is correct or not.

MODELSWARD 2024 - 12th International Conference on Model-Based Software and Systems Engineering

128

3 EVALUATION FRAMEWORK

In order to provide a platform for consistently compar-

ing different requirements formalisation approaches,

we deﬁne an evaluation framework which consists of:

•

A domain-speciﬁc language (DSL) for expressing

NLP pipelines

•

An instantiation of the DSL in Python, using the

Python NLTK NLP library, together with a spe-

ciﬁc library of utility functions for requirements

formalisation

•

A set of evaluation examples taken from real-world

requirements documents, together with manually-

derived ‘gold standard’ formalised models pro-

duced by a rigorous process

•

Evaluation tools, to perform a threefold evaluation

of the models created by each RF approach (Figure

1).

This framework has the beneﬁt of a high degree of

automation: the evaluations can be performed without

human subjectivity entering into the assessment, as

could occur if precision/recall ﬁgures are estimated

based on manual comparison of two models.

The DSL includes statement constructs for loading

datasets, ﬁltering and transforming datasets, and per-

forming analysis operations on them, and for saving

datasets. The syntax is based on SQL. A novel facility

is the ability to specify chunking transformations by

regular expressions. Thus a regular expression formed

from POS names or generalised POS names can be

written in order to specify that a POS-tagged text is to

be split into chunks that match the expression.

The evaluation examples for user stories include

1 large (FABSucs) and 1 medium sized example

(k3ucs) of user story speciﬁcations, taken from dif-

ferent sources (Kaggle, 2021) and (Mendeley, 2021),

and written using different styles. There are also eval-

uation examples for unstructured data requirements,

including a real-world requirements statement case.

The evaluation tools are as follows:

•

checkModelNames

py – checks the names of model

elements, i.e., whether attributes, use cases and

classes have valid names, including a check for du-

plicate names and a check that class names should

be a singular noun. This check helps to ensure that

the models are suitable for use as the speciﬁcation

for an application.

•

compareModel

Source

py – checks the percent-

ages of source document nouns and verbs which

also appear in the generated models. This helps to

ensure that all information from the source docu-

ment has been represented in the formalised model.

•

compareModels

bat – compares the reference

‘gold standard’ model for a case to the model pro-

duced by a formalisation approach. This gener-

alises the usual precision/recall estimates by (i)

comparing classes and attributes in the two mod-

els, in addition to use cases; (ii) allowing partial

matches between names of elements, based on

string edit distance.

The tools may be accessed at (Lano, 2023).

We also include the F-measure accuracy estima-

tion, based on the standard deﬁnition but with a 0.5

correctness score for an element which is identiﬁed

but with signiﬁcant differences to the correct version

(e.g., its name includes several superﬂuous words or

omits necessary words).

We do not evaluate the efﬁciency/time performance

of approaches because these are implemented on dif-

ferent platforms and hence comparing execution times

would not give information about the intrinsic efﬁ-

ciency of the approaches.

EVALUATION OF APPROACHES

We apply the three comparisons of Figure 1 to each of

the approaches of Section 2, for the k3ucs and FAB-

Sucs requirements cases. We also estimate our mod-

iﬁed F-measure accuracy for recognising use cases.

All artefacts and results of these evaluations may be

accessed at (Lano, 2023).

Tables 1 and 2 show the model quality scores for

each approach and the two evaluation cases.

Table 1: Formalised model quality: k3ucs.

Approach Class Attribute Use case Flaws

validity validity validity

Hamza/Hammad 0 0 1 0

Elalloui et al 0.33 0 0.42 17

AgileUML

0.88 1 1 3

Simple heuristic 0.95 0 1 1

Table 2: Formalised model quality: FABSucs.

Approach Class Attribute Use case Flaws

validity validity validity

Hamza/Hammad 0 0 1 0

Elalloui et al 0.57 0 1 6

AgileUML 0.96 1 1 3

Simple heuristic 0.96 0 1 2

Tables 3 and 4 show the model completeness scores

for each approach and the two evaluation cases.

Tables 5 and 6 show the model accuracy scores for

each approach and the two evaluation cases.

Table 7 shows the modiﬁed F-measure for use case

recognition for each approach and both cases.

Comparative Evaluation of NLP Approaches for Requirements Formalisation

129

Table 3: Formalised model completeness: k3ucs.

Approach Noun Verb

completeness completeness

Hamza/Hammad 0 0.4

Elalloui et al 0.08 1.0

AgileUML 0.51 1.0

Simple heuristic 0.45 0.93

Table 4: Formalised model completeness: FABSucs.

Approach Noun Verb

completeness completeness

Hamza/Hammad 0 0.32

Elalloui et al 0.09 1.0

AgileUML 0.41 0.86

Simple heuristic 0.36 1.0

Table 5: Formalised model accuracy: k3ucs.

Approach Class Attribute Usecase

similarity similarity similarity

Hamza/Hammad 0 0 0.47

Elalloui et al 0 0 0

AgileUML 0.04 0.002 0.68

Simple heuristic 0.007 0 0.57

4.1 Discussion

The accuracy of all the approaches, as estimated by

model similarity (Tables 5, 6) is below 70%, which

indicates that they need to be used in conjunction with

human expertise to reﬁne or correct their results, and

that they do not provide a fully automated solution.

The formalisation completeness results for nouns is

quite low (Tables 3 and 4), indicating that informa-

tion on data elements is being ignored or lost by the

formalisation processes. The completeness for verbs

is generally high except for the Hamza/Hammad ap-

proach. On inspection the low results for this approach

are due to verb stemming, so that the original version

of the verb is lost and only the stem retained in the

resulting model. On the other hand, the AgileUML

approach adds the purpose of the user story into the

use case name, resulting in excessively long and com-

plex names. The simple heuristic approach sometimes

adds the actor part of the user story into the use case

name: in cases such as “The system should allow a

staff member to ...", the actor is wrongly assigned as

‘The system’. The Ellaloui et al approach also has

the same ﬂaw. It should be a user-conﬁgurable choice

whether certain terms such as ‘System’ or ‘Application’

can be accepted as actors.

Although the Hamza/Hammad and AgileUML ap-

proaches aim to recognise a range of different tex-

tual formats for user stories, there are still some cases

which they fail to process correctly. None of the ap-

proaches are able to create use cases involving two or

Table 6: Formalised model accuracy: FABSucs.

Approach Class Attribute Usecase

similarity similarity similarity

Hamza/Hammad 0 0 0.6

Elalloui et al 0.01 0 0.56

AgileUML 0.02 0 0.75

Simple heuristic 0 0 0.57

Table 7: F-measure for cases.

Approach k3ucs FABSucs

Hamza/Hammad 0.23 0.15

Elalloui et al 0.35 0.23

AgileUML 0.69 0.81

Simple heuristic 0.71 0.55

more actors, instead the AgileUML approach creates

use cases which can be linked to several entities via

data usage relationships.

All of the approaches use heuristic rules to recog-

nise use case elements. The only explicit use of ma-

chine learning (ML) is the decision tree classiﬁer used

by AgileUML to distinguish different categories of

requirements. It may be difﬁcult to use ML to learn

the derivation from user stories to use cases because

of the relatively small amounts of data available for

training.

The use of tools such as WordNet or Word2Vec

(Mikolov et al., 2013) to compute word similarity

scores could also be of potential use to improve the

approaches.

5 RELATED WORK

Since 2018 there has been a noticeable increase in the

number of papers that apply NLP and DL techniques

for automatic generation of UML diagrams from re-

quirements.

The automation of natural language analysis

for extraction of UML diagrams is emphasized in

(M. Maatuk and A. Abdelnabi, 2021). The main focus

of this work is to apply NLP techniques and heuristic

rules to generate usecase and activity diagrams. The

Stanford CoreNLP tool is used to perform NLP tasks.

Following these, heuristic rules are applied individu-

ally for generation of activity and usecase diagrams. In

(Xu et al., 2020) a utilities permitting system based on

an NLP algorithm is designed to formalise the require-

ments of road agencies in UML and OCL formats. The

input requirements are in textual format. The NLP pro-

cess includes a pre-processing step, which tokenises

and splits sentences. In this step words are classiﬁed

into parts of speech and labeled according to POS

tags. Then occurrences of the terms in the sentences

are recognised and in the third step by applying a

MODELSWARD 2024 - 12th International Conference on Model-Based Software and Systems Engineering

130

chunking technique the sentence structure is analysed

and represented as tree structures. Finally, ﬁve rules

are applied to generate target information from tree

structures. The system is validated in terms of perfor-

mance and applicability by using random cases. The

Requirement Transformation (RETRANS) approach

is presented in (Kamarudin et al., 2015), this gener-

ates usecase and activity diagrams from requirements

in text format by applying model transformation and

NLP techniques. An NLP algorithm is designed in

(Alashqar, 2021) to generate sequence and class dia-

grams from scenario-based requirements. A software

tool called automatic generation of UML (AGUML) is

presented to perform all the tasks automatically. Exper-

imental results are reported to show the applicability

and performance of the approach. In (Sanyal et al.,

2018) an automatic approach for generation of class

diagrams from semi-structured text inputs is presented.

Some keywords are used to structure the input. NLP

techniques and heuristic rules are used to generate the

result by applying the procedure in four steps. In the

ﬁrst step classes are extracted followed by generation

of attributes in the next step. Following that methods

and relations are extracted. The last three steps depend

on the ﬁrst step in this procedure.

A ML approach for extraction of classes and at-

tributes from unstructured plain text is introduced in

(Elmasry et al., 2021). Two classiﬁers are used to

classify each word into class and attributes. To relate

appropriate attributes to classes dependency parsing is

used. A public requirements dataset is used throughout

this research (Ferrari et al., 2017). NLP techniques are

used in the pre-process and post-tagged phases to trans-

fer data to the ML tasks. Then the machine learning al-

gorithms of Support Vector Machine (SVM) and Naive

Bayes (NB) are applied for extracting classes and at-

tributes. A text-to-model transformation framework

for mapping textual requirements to UML models is

presented in (Sedrakyan et al., 2022). The authors

emphasize on integrating machine learning methods,

word embedding, heuristic rules, statistical and lin-

guistic knowledge to increase the quality of the out-

come. A web application for generation of use case

and class diagrams from English text is presented in

(Narawita et al., 2017). In this research NLP and

ML techniques are used. Tokenization, POS tagging,

chunking and splitting are NLP techniques that are

applied in this process. Finally, the visual represen-

tation of diagrams are provided using Visual studio.

RF using LLMs is investigated with ChatGPT by (Ca-

mara et al., 2023). The results show that some basic

requirements formalisation ability is present in Chat-

GPT, however speciﬁc ﬁne-tuning of a suitable LLM

(pre-trained with datasets including software models)

by instruction training for the formalisation task would

be necessary to improve this capability.

Although different works have been investigating

the concepts of RF by applying NLP and DL tech-

niques, however this ﬁeld remains at an experimental

stage. Most of the works in this domain applied heuris-

tic approaches, however these were not evaluated on a

broad range of input cases. Therefore, it is not possi-

ble to determine the applicability of such approaches

in speciﬁc domains. Only a few approaches have ap-

plied DL in the RF domain, this limited use is likely

to be due to the limited quantity of available appro-

priate training data (i.e., relating requirements text to

formalised models). In general most of the ML ap-

proaches do not apply the whole potential of DL in

the domain. Furthermore, there is no standard bench-

mark and evaluation criteria for comparing different

RF cases. Therefore, in this research we provide a set

of requirement statement to compare the effectiveness

of RF approaches according to the standard criteria.

6 FUTURE WORK

We intend to enlarge the set of evaluation cases to in-

clude a wider range of requirements documents, and to

expand the evaluation to more approaches, including

data-oriented formalisation (creating class diagrams).

Formalisation approaches for unstructured and struc-

tured requirements documents will also be evaluated.

Further evaluation criteria could be added, for ex-

ample, some measure of how conﬁgurable and adapt-

able an approach is: to what extent it permits users to

modify any parameters, strategies or knowledge bases

used in its formalisation process. Another form of

comparison could be a ‘blindfolded taste test’ where a

group of independent software engineers evaluate and

rank alternative formalisations of a requirements state-

ment, without knowing the identity of the approach

which produced the formalisation.

7 CONCLUSIONS

We have provided the ﬁrst framework for systematic

and objective comparison of requirements formalisa-

tion approaches, and demonstrated the application of

this framework to compare four alternative approaches

to behavioural requirements formalisation. The results

provided useful insights into the issues and factors

which such approaches need to address to produce

useful formalisations.

Comparative Evaluation of NLP Approaches for Requirements Formalisation

131

REFERENCES

Alashqar, A. M. (2021). Automatic generation of uml dia-

grams from scenario-based user requirements. Jorda-

nian Journal of Computers and Information Technol-

ogy, 7(02, June 2021).

Apache (2021). Apache opennlp toolkit. https://opennlp.

apache.org. 2021.

Camara, J., Troya, J., Burgueno, L., and Vallecillo, A. (2023).

On the assessment of generative ai in modeling tasks.

SoSyM, 22.

Chen, M. et al. (2021). Evaluating large language models

trained on code. arXiv preprint, 2107:03374v2.

Elallaoui, M., Naﬁl, K., and Touahni, R. (2018). Auto-

matic transformation of user stories into uml use case

diagrams using nlp techniques. Procedia computer

science, 130:42–49.

Elmasry, I., Wassif, K., and Bayomi, H. (2021). Extract-

ing software design from text: A machine learning

approach. In 2021 Tenth International Conference on

Intelligent Computing and Information Systems (ICI-

CIS), pages 486–492. IEEE.

Ferrari, A., Spagnolo, G. O., and Gnesi, S. (2017). Pure:

A dataset of public requirements documents. In 2017

IEEE 25th International Requirements Engineering

Conference (RE), pages 502–505. IEEE.

GitHub.com (2022). GitHub CoPilot,

https://copilot.github.com/.

Guo, D. et al. (2021). GraphCodeBERT: Pre-training code

representations with dataﬂow. In ICLR 2021.

Hamza, Z. A. and Hammad, M. (2019). Generating uml

use case models from software requirements using nat-

ural language processing. In 2019 8th International

Conference on Modeling Simulation and Applied Opti-

mization (ICMSAO), pages 1–6. IEEE.

Kaggle (2021). Kaggle software requirements

dataset. https:www.kaggle.com/iamsouvik/

software-requirements-dataset. Accessed: 2021.

Kamarudin, N. J., Sani, N. F. M., and Atan, R. (2015). Auto-

mated transformation approach from user requirement

to behavior design. Journal of Theoretical and Applied

Information Technology, 81(1):73.

Kolahdouz-Rahimi, S., Lano, K., and Chenghua, L. (2023).

Requirement formalisation using natural language pro-

cessing and machine learning: A systematic review.

International Conference on Model-Based Software

and Systems Engineering.

Lano, K. (2023). Requirements formalisation repos-

itory. https:www.https://github.com/kevinlano/

RequirementsFormalisation. Accessed: 2023.

Lano, K., Yassipour-Tehrani, S., and Umar, M. (2021). Au-

tomated requirements formalisation for agile mde. In

2021 ACM/IEEE International Conference on Model

Driven Engineering Languages and Systems Compan-

ion (MODELS-C), pages 173–180. IEEE.

Levenshtein, V. I. et al. (1966). Binary codes capable of

correcting deletions, insertions, and reversals. Soviet

physics doklady, 10(8):707–710.

M. Maatuk, A. and A. Abdelnabi, E. (2021). Generating uml

use case and activity diagrams using nlp techniques and

heuristics rules. In International Conference on Data

Science, E-learning and Information Systems 2021,

pages 271–277.

Mendeley (2021). Mendeley user story dataset. https:

www.data.mendeley.com/datasets/bw9md35c29/1. Ac-

cessed: 2021.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space. arXiv preprint arXiv:1301.3781.

Narawita, C. R. et al. (2017). Uml generator-use case and

class diagram generation from text requirements. The

International Journal on Advances in ICT for Emerg-

ing Regions, 10(1).

Otter, D. W., Medina, J. R., and Kalita, J. K. (2023). Require-

ment formalisation using natural language processing

and machine learning: A systematic review. Model-

sward 2023.

Princeton University (2021). Wordnet. https:www.wordnet.

princeton.edu. Accessed: 2021.

Santorini, B. (1990). Part-of-speech tagging guidelines for

the penn treebank project. University of Pennsylvania,

School of Engineering and Applied Science.

Sanyal, R., Ghoshal, B., et al. (2018). Automatic extrac-

tion of structural model from semi structured software

requirement speciﬁcation. In 2018 IEEE/ACIS 17th In-

ternational Conference on Computer and Information

Science (ICIS), pages 543–58. IEEE.

Sedrakyan, G., Abdi, A., Van Den Berg, S. M., Veldkamp,

B., and Van Hillegersberg, J. (2022). Text-to-model

(tetomo) transformation framework to support require-

ments analysis and modeling. In 10th International

Conference on Model-Driven Engineering and Soft-

ware Development, MODELSWARD 2022, pages 129–

136. SCITEPRESS.

Stanford University (2020). Stanford nlp. https:www.https:

//nlp.stanford.edu/software/. Accessed: 2020.

Xu, X., Chen, K., and Cai, H. (2020). Automating utility

permitting within highway right-of-way via a generic

uml/ocl model and natural language processing. Jour-

nal of Construction Engineering and Management,

146(12):04020135.

Zhao, W. et al. (2023). A survey of large language models.

arXiv, 2303.18223v10.

MODELSWARD 2024 - 12th International Conference on Model-Based Software and Systems Engineering

132