Harnessing LLM Conversations for Goal Model Generation

from User Reviews

Shuaicai Ren, Hiroyuki Nakagawa and Tatsuhiro Tsuchiya

Graduate School of Information Science and Technology, Osaka University, Suita, Japan

Keywords:

LLM, Goal Modeling, User Reviews.

Abstract:

User reviews are a valuable resource for developers, as the reviews contain requests for new features and bug

reports. By conducting the requirements analysis of user reviews, developers can gain timely insights for the

application, which is crucial for continuously enhancing user satisfaction. The goal model is a commonly

used model during requirements analysis. Utilizing reviews to generate goal models can assist developers in

understanding user requirements comprehensively. However, given the vast number of reviews, manually col-

lecting reviews and creating goal models is a signiﬁcant challenge. A method for clustering user reviews and

automatically generating goal models has been proposed. Nevertheless, the accuracy of the goal models gen-

erated by this method is limited. To address these limitations of the existing method and enhance precision of

goal model generation, we propose a goal-generation process based on Large Language Models (LLMs). This

process does not directly generate goal models from user reviews; instead, it treats goal model generation as a

clustering problem, allowing for the visualization of the relationship between reviews and goals. Experiments

demonstrate that compared to the existing method, our LLM-based goal model generation process enhance

the precision of goal model generation.

1 INTRODUCTION

In modern society, mobile applications (Apps) are

playing an increasingly important role in our daily

lives. Mobile application platforms, represented

by the App Store and Google Play, not only al-

low users to download apps but also offer a plat-

form for interaction between users and developers.

On these platforms, users draft reviews, which in-

clude new feature requirements and bug reports (Oriol

et al., 2018) (Maalej and Pagano, 2011) (Seyff et al.,

2010) (Ma et al., 2015). User reviews are a valu-

able resource for developers, as reviews offer invalu-

able insights (Pagano and Maalej, 2013) (Hofmann

and Lehner, 2001) (Zowghi, 2018). By conducting

the requirements analysis of user reviews, develop-

ers deepen their understanding of user requirements,

thereby offering version updates that better match

user requirements. In the ﬁeld of requirements analy-

sis, the goal model is one of the most commonly used

models, and it can be employed to analyze require-

ments from user reviews.

The goal model is a basic model in the ﬁeld

of requirements engineering, providing a structured

framework to describe what functions a system needs

and how to implement these functions. Within the

goal model, goals are arranged in a hierarchical struc-

ture, where the root goal is reﬁned into sub-goals,

ultimately forming a comprehensive goal structure.

This layered structure helps in deeply understand-

ing the interdependencies among goals. The goal

model contains multiple elements, such as conﬂicts

and soft goals. Conﬂicts refer to situations where

achieving one goal may obstruct the realization of an-

other, while soft goals aim to capture non-functional

requirements. The primary advantage of the goal

model is that it allows developers to deﬁne and under-

stand requirements with clarity. When goals conﬂict,

the goal model can support crucial decision-making.

Compared to directly analyzing user reviews, making

goal models ensures that user reviews match the app’s

goals. By connecting reviews with goals, it is easier to

ﬁgure out what goals users care about and if the new

features they want might conﬂict with current goals.

This helps developers know which user requirements

are most urgent and helps them make better updates.

While there are numerous advantages to employ-

ing a goal model for the analysis of user reviews, the

manual construction of a goal model presents a signif-

icant challenge. This challenge mainly arises from the

Ren, S., Nakagawa, H. and Tsuchiya, T.

Harnessing LLM Conversations for Goal Model Generation from User Reviews.

DOI: 10.5220/0012352200003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 3, pages 385-392

ISBN: 978-989-758-680-4; ISSN: 2184-433X

385

vast number of reviews and only a small proportion

of them contain requirements or bug reports (Licorish

et al., 2015) (Pagano and Maalej, 2013) (Chen et al.,

2014). Consequently, the process of manually reading

and summarizing reviews becomes a labor-intensive

and time-consuming task. To automate the process of

utilizing user reviews for goal model generation, we

proposed a method for clustering reviews (Ren et al.,

2022). This clustering method consists of two com-

ponents: the Latent Dirichlet Allocation (LDA) topic

model (Blei et al., 2003) and a distance-based clus-

tering algorithm. The method deﬁnes the root goal as

containing all user reviews, and the topics generated

by the LDA topic model are treated as sub-goals under

the root goal. To reﬁne sub-goals from the generated

topics, a distance-based clustering algorithm is intro-

duced. This method simpliﬁes the process of gen-

erating goal models from user reviews, signiﬁcantly

reducing the required manpower. However, it must

be acknowledged that this method has certain limita-

tions. For instance, its accuracy is suboptimal, and

the generated goals do not have a one-to-one corre-

spondence with the requirements. To enhance the pre-

cision of goal model generation from user reviews,

we explore the potential of leveraging large language

models (LLMs) for goal model generation. By har-

nessing the capabilities of LLMs, our objective is to

enhance the precision and efﬁcacy of the goal model-

ing process, providing a more accurate representation

of user requirements and preferences.

This technology holds the potential to make con-

tributions to agents. The application of LLMs in goal

model generation has several impacts. LLMs help

agents better understand and pull key information

from a wide range of user reviews. These reviews,

coming from many users, provide valuable informa-

tion and preferences that, when processed effectively,

can signiﬁcantly contribute to agent development. By

grouping and reﬁning user reviews, agents can get a

deeper understanding of what users really want and

their main concerns. This deeper understanding al-

lows them to give more relevant and personalized re-

sponses, making the user experience better.

LLMs represent an advanced class of natural lan-

guage processing (NLP) models, notable for their ex-

tensive size and parametric complexity. In this re-

search, we harnessed GPT-4 (OpenAI, 2023), recog-

nized as a state-of-the-art LLM, for the purpose of

clustering user reviews and generating goal models.

The goal model generated by GPT-4 has higher ac-

curacy than the goal model generated by the exist-

ing method. However, using GPT-4 to directly gener-

ate goal models comes with certain drawbacks, such

as producing repetitive or incorrect goals and failing

to describe the relationships between generated goals

and reviews. To address these issues, we propose a

novel goal model generation process. This process

does not involve the direct use of GPT-4 to gener-

ate goal models; instead, it begins with clustering

and analyzing user reviews before generating the goal

models. The experimental results demonstrate that,

compared to the existing method, the use of the pro-

posed process with the GPT-4 method gains higher-

precision goal models. This not only saves time but

also enhances developers’ understanding of the im-

portance of each goal.

The contributions of this study are as follows:

First, we introduce a goal model generation method

based on LLMs, offering an alternative method for

the automation of goal model generation. Second,

the proposed method treats goal model generation as

a clustering problem, allowing developers to under-

stand the relationship between user reviews and goals.

Third, compared to the existing method, the proposed

method enhances the precision of goal model genera-

tion.

The following sections of this paper are organized

as: Section 2 introduces related work relevant to our

research. Section 3 presents the existing method for

goal model generation. Section 4 introduces GPT-4

and the proposed process for generating goal mod-

els utilizing GPT-4. Section 5 showcases compara-

tive experiments between different goal model gener-

ation methods. Section 6 evaluates the proposed goal

model generation process. Finally, in Section 7, we

summarize this study and outline future work.

2 RELATED STUDIES

In a recent development, Jiang et al. (Jiang et al.,

2019) introduced SAFER, a novel approach that en-

ables the automatic extraction of features from ap-

plication descriptions and the identiﬁcation of anal-

ogous applications based on API names and extracted

features. SAFER further undertakes the aggregation

and recommendation of features from identiﬁed anal-

ogous applications. On a related note, Dkabrowski

et al. (D ˛abrowski et al., 2023) conducted empirical

research into three distinct opinion mining methods:

GuMa (Guzman and Maalej, 2014), SAFE (Johann

et al., 2017), and ReUS (Dragoni et al., 2019). These

methods underwent evaluation through two distinct

tasks encompassing feature extraction and sentiment

analysis. The research outcomes proffered valuable

insights by suggesting that the efﬁcacy of these meth-

ods might be lower than originally reported. More-

over, Malik et al. (Malik et al., 2020) proposed a

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

386

comprehensive approach for the extraction of opin-

ions from user reviews. Their approach is particu-

larly geared towards assisting developers and users in

the automated extraction and comparison of features

across a spectrum of mobile applications. It is im-

portant to emphasize that the aforementioned studies

primarily pivot around the domain of review analy-

sis, and their primary focus does not lie in goal model

identiﬁcation.

It is important to note that several researchers

believe that LLMs have the potential to revolution-

ize existing software development processes. This

has led to the proposal of numerous methods that

leverage LLMs for software modeling. Nakagawa

et al. (Nakagawa and Honiden, 2023) introduced a

semi-automated process for goal model generation

that employs generative AI founded on the MAPE-

K loop mechanism. Their two case studies demon-

strate that this process, based on the MAPE-K loop

mechanism, is efﬁcacious in goal model construction

without omitting any goal descriptions. Additionally,

Cámara et al. (Cámara et al., 2023) conducted a com-

prehensive investigation into GPT-4’s performance in

modeling tasks and its utility to modeling personnel,

while simultaneously identifying its principal limi-

tations. Their research ﬁndings underscore that the

current iteration of GPT-4 exhibits limited efﬁcacy

in software modeling, especially when compared to

its capabilities in code generation. It exhibits var-

iegated syntax and semantic defects, lacks response

consistency, and faces scalability challenges. Ding

et al. (Ding and Ito, 2023) introduced the ’Self-

Agreement’ framework, aimed at autonomously seek-

ing consensus among diverse opinions using data gen-

erated by large language models (LLMs), without the

need for extensive manual annotation. They utilized

GPT-3 to generate multiple opinions for each ques-

tion in a question dataset and subsequently employed

a BERT model to evaluate the consistency of each

opinion, selecting the most consistent one. Their re-

search focused on ﬁnding consensus among diverse

opinions, whereas our method centers on analyzing

the consistency of reviews and generating goal mod-

els.

Chen et al. (Chen et al., 2023) reported the pre-

liminary experimental results of goal model genera-

tion using GPT-4. They ﬁrst explored GPT-4’s under-

standing of the Goal-oriented Requirement Language

(GRL) and then employed four prompt combinations

to guide the generation of GRL models in two case

studies. One case was a well-documented topic in the

goal modeling domain, while the other was the op-

posite. The experimental results indicate that GPT-4

possesses extensive knowledge related to goal mod-

els and that the generated goal models are valuable.

Notably, all three methods employ LLMs to generate

goal models. However, it is crucial to point out that,

unlike our method, they do not leverage user reviews

in the goal model generation process.

3 EXISTING METHOD

We proposed a method for creating a goal model

by clustering user reviews (Ren et al., 2022). This

method involves two key components: the LDA topic

model and a distance-based clustering algorithm. The

LDA topic model is responsible for generating goals

from all reviews, while the distance-based clustering

algorithm reﬁnes these goals. The LDA model is

a widely utilized probabilistic topic modeling tech-

nique for the analysis of extensive unstructured tex-

tual data in academic research (Papadimitriou et al.,

2000) (Blei et al., 2003). The goal representing all

reviews is regarded as the root goal, while the topics

generated by the LDA model are considered as sub-

goals of the root goal. While it’s possible to reﬁne and

create the goal model further by applying LDA mod-

eling to reviews within each topic, this method may

not be reliable when dealing with a limited number of

reviews (Hajjem and Latiri, 2017).

Given the limitations of LDA topic modeling in

such scenarios, a distance-based clustering algorithm

is proposed to facilitate further reﬁnement. For each

topic, the reviews are vectorized, and Ward’s method

is employed to calculate distances between vectors,

resulting in the creation of compact, evenly sized

clusters (Szmrecsanyi, 2012). These clusters are visu-

ally represented by a dendrogram. Clusters with simi-

lar distance values are assigned as sub-goals under the

same parent goal. This method follows a top-down

approach, creating boundary lines. Clusters above the

boundary line become parent goals, while those be-

low it become sub-goals. The boundary line’s value

is determined by the cluster distances and manually

selected parameters. This automated method of gen-

erating goal models from user reviews aids in a deeper

understanding of user requirements. By combining

LDA and the distance-based clustering algorithm, this

clustering method addresses the challenges of ana-

lyzing numerous reviews and automatically identiﬁes

main topics and their hierarchical relationships, sim-

plifying the goal model generation process. Never-

theless, this method still faces speciﬁc limitations,

primarily accuracy issues. Current methods employ

Ward’s method for review clustering, where reviews

are ﬁrst converted into vectors, and review similar-

ity is determined based on vector distance. Neverthe-

Harnessing LLM Conversations for Goal Model Generation from User Reviews

387

User reviews

Review clustering

(Prompt 1)

Cluster 1

Cluster 2

Cluster 3

Clusters

Cluster

selected

Cluster 1

Cluster 3

Clusters requiring

refinement

Refining clusters

and generating goals

(Prompt 2)

Goal model

Figure 1: Overview of the proposed process for generating goal models using GPT-4.

Root

Account

transfer and

device

upgrades

Chat

history

and data

backup

Multi-

device

support

Account

security

and

recovery

Chat and

media

management

User-

friendly

experience

Sticker

suggestions

and user

experience

…

… …

…

Backup

issues

Data

recovery

Cross-

platform

issues

"chat log transfer absolutely terribleit sense keep upload drive

icloud old new phone auto transfer since account "

"lose phone lock account everytime log tell account transfer phone

lose completely stupid system even send report without verify

phone lose "

…

Figure 2: Goal model generated by the proposed process.

less, research indicates that vector similarity does not

always reﬂect the similarity of the requirements de-

scribed in the reviews (Devine et al., 2022). Within

the same cluster, reviews may share common aspects,

but these shared aspects do not necessarily indicate

identical requirements.

To enhance goal model generation precision and

improve developers’ understanding of the generated

goal model, we explore the possibility of utilizing

GPT-4 to cluster reviews and generate goal models.

4 LLM-BASED GOAL MODEL

GENERATION METHOD

GPT-4, which stands for "Generative Pre-trained

Transformer 4," is a state-of-the-art language model

developed by OpenAI. GPT-4’s primary objective is

to facilitate interactive conversations with users, of-

fering responses that are contextually coherent across

a wide spectrum of prompts and inquiries. While

there are many advantages to generating goal mod-

els directly using GPT-4, several challenges also ex-

ist. For example, the relationship between user re-

views and the goals cannot be visualized. Sometimes,

the information developers obtain from the goals is

insufﬁcient to fully understand user requirements. In

cases like this, if the goal model is generated using

review clustering methods, developers can check re-

views related to the goals to gather additional insights.

However, in goal models generated using GPT-4, re-

views related to the goals are not presented compre-

hensively. Even if you ask GPT-4 to display all the re-

views, it provides only a limited set of reviews, mak-

ing it challenging for developers to understand user

requirements. Additionally, the prioritization of goals

is based on the frequency of corresponding require-

ments mentioned in the reviews. The lack of visual-

ization for the relationship between goals and the as-

sociated reviews decreases the credibility of goal pri-

oritization.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

388

To address the issue of the lack of visualized re-

lationships between goals and reviews in the goal

model directly generated using GPT-4, we propose a

novel process for generating goal models using GPT-

4. This process frames the task of generating goal

models as a clustering problem, where clusters repre-

sent the goals. GPT-4 provides the cluster to which

each review belongs, thus visualizing the relationship

between goals and reviews. Subsequently, developers

can reﬁne some or all of the goals based on factors

such as the number of reviews or the content of re-

views. Figure 1 shows the overview of the proposed

process.

This process consists of two steps. The ﬁrst step

involves clustering reviews using GPT-4, in this step,

the generated clusters are regarded as goals. The

prompt for this step are as follows:

Prompt 1: Can you cluster the following re-

views?

The second step occurs after developers analyze the

generated goals and decide which goals to reﬁne.

GPT-4 is then employed to reﬁne the selected goals,

with prompts for this step as follows:

Prompt 2: Can the ﬁrst category be reﬁned, and

if so, what would the relevant reviews in the

subdivided categories look like? By "relevant

reviews," I mean the reviews I provided earlier.

There is no need to generate relevant reviews;

all reviews should belong to the ﬁrst category

classiﬁed earlier. Each comment should belong

to only one subcategory, and each subcategory

should be akin to a goal in the goal model in the

requirement model.

It is crucial to include the statement "Cannot generate

reviews" in the prompt for the second step to prevent

generation errors by GPT-4. Additionally, it should be

speciﬁed that each review can belong to only one goal

to ensure that redundant reviews do not inﬂuence de-

velopers’ assessments of goal importance. By adopt-

ing the proposed process, developers can maintain the

advantages of using GPT-4 to generate goal models

while visualizing the relationships between reviews

and goals. Furthermore, it provides a more detailed

and time-efﬁcient approach for developers to analyze

and reﬁne goals.

In summary, utilizing the proposed process for

generating goal models from user reviews offers nu-

merous beneﬁts, including high goal coverage, selec-

tive reﬁnement of goals, and streamlined data pro-

cessing. These advantages position the proposed

process as a valuable tool for extracting and com-

prehending user requirements, facilitating effective

decision-making, and enhancing the development of

user-centric applications.

5 EXPERIMENT

The purpose of the experiment is to assess which of

the two, the existing method and the GPT-4 method

with the proposed process, is more similar to the man-

ually created goal model, including both structural

and content similarities. We have collected 150 user

reviews from the App Store. These reviews are from

Line, Google Docs, and YouTube, with each app con-

tributing 50 reviews. For each set of user reviews

for these apps, we used three different methods for

goal model generation: the existing method, the GPT-

4 method with the proposed process, and the man-

ual method. Figure 2 illustrates a portion of the goal

model generated using the GPT-4 with the proposed

process, with user reviews sourced from Line. To cre-

ate the manual model, we manually examined each

review and determined which goal it should belong

to.

We initially focus on evaluating the structure of

the generated goal models. Since goal models have

a tree-like structure, we utilize Tree-Edit-Distance-

based Similarity (TEDS) (Zhong et al., 2020) to as-

sess the similarity between the models generated by

the two methods and the manually created model.

TEDS is a normalized variant of the Tree-Edit-

Distance (TED), and its calculation is as follows:

TEDS(G,G_m) = 100 − (

EditDist(G,G_m))

max(|G|,|G_m|)

× 100),

(1)

where EditDist(G,G_m) is computed as the minimum

number of operations, comprising both Move and Join

operations, necessary to transform the generated goal

model G into the goal model G_m, which is manually

created. The value of EditDist(G,G_m) is determined

through manual computation. max(|G|,|G_m|) repre-

sents the maximum number of goals present within

goal models G and G_m. Consequently, the higher

the degree of similarity between the goal model G and

the goal model G_m, the larger the resulting TEDS

value. In the case of complete equivalence between

G and G_m, TEDS returns a value of 100. Table 1

demonstrates the TEDS values for three apps. The av-

erage TEDS value of the GPT-4 method is 26 points

higher than the existing method, indicating that the

goal model generated by GPT-4 is more similar to the

manually created goal model.

In terms of evaluating the content of the gener-

ated goals, we employ precision and recall to assess

both the existing method and the GPT-4 method. For

Harnessing LLM Conversations for Goal Model Generation from User Reviews

389

Table 1: Results of TEDS.

Line YouTube Google Docs Avg

Existing method 70 56 67 62

GPT-4 with proposed process 100 89 75 88

a goal A generated by the existing method or the GPT-

4 method, we found a goal A’ identiﬁed by the man-

ual goal model that allows the best precision to be

achieved for the goal A. Table 2 lists the precision and

recall for each goal.

6 DISCUSSION

From Table 2, we observe that the goal model gen-

erated by the GPT-4 method exhibits higher precision

and recall. In contrast to the LDA topic model, GPT-4

does not require the prior speciﬁcation of the number

of topics and eliminates the need for extensive prepro-

cessing steps. Even when dealing with a limited num-

ber of reviews, the GPT-4 method does not experience

a decline in accuracy. For the LDA method, having

too many uninformative reviews or too few reviews

can reduce the clustering accuracy. Another com-

ponent of the existing method is the distance-based

clustering method, where distances are computed us-

ing Ward’s method. This method involves ﬁrst trans-

forming reviews into vectors and subsequently deter-

mining the similarity between reviews based on the

vector distances. Nevertheless, studies have demon-

strated that the similarity of vectors may not always

correspond to the likeness of requirements articulated

within the reviews (Devine et al., 2022). While re-

views grouped within the same cluster may exhibit

shared elements, the presence of these commonal-

ities does not invariably signify congruent require-

ments. Owing to its robust performance, GPT-4 can

clearly understand the requirements described in user

reviews. GPT-4 underwent extensive pre-training on

a vast corpus of textual data. As a result, GPT-4

can comprehend intricate content within user reviews,

which often contain colloquialisms, domain-speciﬁc

jargon, and various forms of expression. Even if user

reviews are incomplete, use slang, or unconventional

punctuation, GPT-4 can ﬁlter potential intent and in-

formation from the noise.

By comparing the direct generation of goal mod-

els using GPT-4 with the goal models generated

through our proposed process, we identify several ad-

vantages of the proposed process, which include:

• By clustering reviews, the workload of GPT-4’s

generation process can be signiﬁcantly reduced.

Developers have the option to reﬁne only the most

critical goals, instead of diving into the ﬁne de-

tails of every single objective. This method be-

comes especially advantageous when dealing with

a large volume of user reviews, as clustering aids

in efﬁciently managing and analyzing extensive

datasets, thereby reducing the complexity associ-

ated with processing substantial amounts of data.

For developers dealing with signiﬁcant quantities

of user-generated content, such as product or ser-

vice reviews, this reduction in complexity is in-

valuable. It not only accelerates the goal model

generation process but also aids in maintaining the

quality of the analysis. As a result, developers can

gain meaningful insights from a vast dataset with-

out being overwhelmed by its scale.

• The proposed process can enhance the consis-

tency of the generated goal models. Cluster-

ing can aid in ensuring that the generated goal

model remains consistent within similar clusters

of reviews, thereby enhancing user experience and

comprehension. A noteworthy advantage of this

clustering process is its ability to scrutinize the

generated goals alongside their related reviews.

This critical examination phase serves as a quality

control mechanism, allowing for the identiﬁcation

and correction of ambiguous or erroneous goals.

Such corrections signiﬁcantly reduce the likeli-

hood of potential errors during the goal model

generation phase.

• Clustering can serve as an intermediary step,

greatly facilitating developers in iteratively im-

proving the generated goal model. Based on the

outcomes of clustering, developers can ﬁne-tune

and optimize the generated goal model progres-

sively to enhance its quality. By examining the

clustered goals and the feedback derived from

these clusters, developers can make data-driven

decisions to prioritize certain requirements over

others. This iterative approach fosters a respon-

sive development environment where the gener-

ated goal model evolves alongside user feedback.

As a result, the goal model becomes increasingly

aligned with the users’ expectations, ensuring that

the ﬁnal product or service is more user-centric

and attuned to their needs.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

390

Table 2: Precision and recall of generated goals. E_Goals are generated by the existing method, and G_Goals are generated

by GPT-4 with the proposed process. Avg means the average value of the precision or recall for generated goals.

App Line

Goal E_Goal 1 E_Goal 2 E_Goal 3 E_Goal 4 E_Goal 5 E_Goal 6 E_Goal 7 E_Goal 8 E_Avg

Precision 0.56 1 0.33 0.67 0.5 0.2 0.5 0.25 0.5

Recall 0.45 0.89 0.2 0.25 0.57 0.33 0.17 0.67 0.44

Goal G_Goal 1 G_Goal 2 G_Goal 3 G_Goal 4 G_Goal 5 G_Goal 6 G_Avg

Precision 0.8 1 0.67 0.83 0.5 0.88 0.78

Recall 1 0.9 0.75 0.86 0.33 0.71 0.76

App YouTube

Goal E_Goal 1 E_Goal 2 E_Goal 3 E_Goal 4 E_Goal 5 E_Avg

Precision 0.5 0.5 0.33 0.5 0.6 0.49

Recall 0.25 0.33 0.17 0.67 0.43 0.37

Goal G_Goal 1 G_Goal 2 G_Goal 3 G_Goal 4 G_Goal 5 G_Goal 6 G_Goal 7 G_Avg

Precision 0.8 0.67 0.67 0.5 0.8 0.67 0.65 0.68

Recall 0.75 1 0.5 0.85 0.86 0.75 0.45 0.74

App GoogleDocs

Goal E_Goal 1 E_Goal 2 E_Goal 3 E_Goal 4 E_Goal 5 E_Goal 6 E_Goal 7 E_Goal 8 E_Avg

Precision 0.43 1 0.5 0.42 0.67 0.25 0.5 0.4 0.52

Recall 0.75 0.66 0.18 0.6 0.4 0.13 0.33 0.25 0.41

Goal G_Goal 1 G_Goal 2 G_Goal 3 G_Goal 4 G_Goal 5 G_Avg

Precision 0.71 0.57 0.6 0.75 1 0.73

Recall 0.63 0.67 0.43 0.71 0.2 0.53

7 CONCLUSIONS

In this study, we explored the potential of utilizing

GPT-4 to generate goal models and proposed a novel

goal model generation process. To improve the gener-

ation accuracy of the goal model, we employed GPT-4

to generate goal models. However, a limitation of the

method that directly generates goal models from user

reviews is that the relationship between goals and re-

views is not visualized. To address this limitation, we

introduced a new process that treats goal model gen-

eration as a clustering problem. This process signiﬁ-

cantly saves developers’ time and enhances their un-

derstanding of the goal content. Experimental results

indicate that the accuracy of generating goal mod-

els using the proposed process is higher than that of

the existing method. Regarding future research direc-

tions, we have outlined the following objectives:

Enhancing stability in goal model generation:

When utilizing GPT-4 for the classiﬁcation of user re-

views, it is important to consider that the outcomes

generated may not always be consistent. Speciﬁcally,

the results might exhibit variations, such as a classiﬁ-

cation not based on requirements but rather inﬂuenced

by emotional content. This phenomenon highlights an

essential aspect of working with AI-based language

models, where several factors contribute to the unpre-

dictability of the results. GPT-4 does not possess the

ability to discern the "correct" method of classiﬁca-

tion a priori. Its responses are determined by patterns

and information gleaned from its training data. As

a result, the quality of the responses depends on the

quality and speciﬁcity of the training data and the for-

mulation of the user’s query. To enhance the stabil-

ity and reliability of user review classiﬁcation using

GPT-4, strategies such as ﬁne-tuning the model on

domain-speciﬁc data, providing clear instructions to

the model, or post-processing its outputs may be con-

sidered. These strategies can help align the model’s

responses more closely with the speciﬁc goals of the

task, reducing the variability in outcomes, and mini-

mizing the impact of emotional content on classiﬁca-

tion results.

Time complexity reduction: Although GPT-4 ex-

hibits the capability to generate goal models from user

reviews, it is essential to acknowledge the substantial

time investments linked to the current implementation

of this process. Even when dealing with a relatively

modest review dataset, comprising fewer than 100 re-

views, a noteworthy amount of time and computa-

tional resources is imperative. This temporal over-

head may potentially impede the practical applicabil-

ity of the approach. In response to this concern, our

Harnessing LLM Conversations for Goal Model Generation from User Reviews

391

future research initiatives are dedicated to the reﬁne-

ment of GPT-4’s capabilities, encompassing strategies

such as ﬁne-tuning and few-shot learning. Our ulti-

mate goal is the reduction of time complexity while

upholding the precision of the generated goal models.

REFERENCES

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. Journal of machine Learning re-

search, 3(Jan):993–1022.

Cámara, J., Troya, J., Burgueño, L., and Vallecillo, A.

(2023). On the assessment of generative ai in model-

ing tasks: an experience report with chatgpt and uml.

Software and Systems Modeling, pages 1–13.

Chen, B., Chen, K., Hassani, S., Yang, Y., Amyot, D.,

Lessard, L., Mussbacher, G., Sabetzadeh, M., and

Varró, D. (2023). On the use of GPT-4 for creating

goal models: An exploratory study. In 2023 IEEE

31st International Requirements Engineering Confer-

ence Workshops (REW), pages 262–271. IEEE.

Chen, N., Lin, J., Hoi, S. C., Xiao, X., and Zhang, B.

(2014). AR-miner: mining informative reviews for

developers from mobile app marketplace. In Proc. of

the 36th International Conference on Software Engi-

neering, pages 767–778. ACM.

D ˛abrowski, J., Letier, E., Perini, A., and Susi, A. (2023).

Mining and searching app reviews for requirements

engineering: Evaluation and replication studies. In-

formation Systems, page 102181.

Devine, P., Tizard, J., Wang, H., Koh, Y. S., and Blin-

coe, K. (2022). What’s inside a cluster of software

user feedback: A study of characterisation methods.

In 2022 IEEE 30th International Requirements Engi-

neering Conference (RE), pages 189–200. IEEE.

Ding, S. and Ito, T. (2023). Self-agreement: A frame-

work for ﬁne-tuning language models to ﬁnd agree-

ment among diverse opinions. arXiv preprint

arXiv:2305.11460.

Dragoni, M., Federici, M., and Rexha, A. (2019). An un-

supervised aspect extraction strategy for monitoring

real-time reviews stream. Information processing &

management, 56(3):1103–1118.

Guzman, E. and Maalej, W. (2014). How do users like this

feature? a ﬁne grained sentiment analysis of app re-

views. In 2014 IEEE 22nd international requirements

engineering conference (RE), pages 153–162.

Hajjem, M. and Latiri, C. (2017). Combining ir and lda

topic modeling for ﬁltering microblogs. Procedia

Computer Science, 112:761–770.

Hofmann, H. F. and Lehner, F. (2001). Requirements engi-

neering as a success factor in software projects. IEEE

software, 18(4):58–66.

Jiang, H., Zhang, J., Li, X., Ren, Z., Lo, D., Wu, X.,

and Luo, Z. (2019). Recommending new features

from mobile app descriptions. ACM Transactions

on Software Engineering and Methodology (TOSEM),

28(4):1–29.

Johann, T., Stanik, C., Maalej, W., et al. (2017). SAFE: A

simple approach for feature extraction from app de-

scriptions and app reviews. In Proc. of the 2017 IEEE

25th international requirements engineering confer-

ence (RE), pages 21–30. IEEE.

Licorish, S. A., Tahir, A., Bosu, M. F., and MacDonell,

S. G. (2015). On satisfying the android os commu-

nity: User feedback still central to developers’ portfo-

lios. In Proc. of the 2015 24th Australasian Software

Engineering Conference, pages 78–87. IEEE.

Ma, S., Wang, S., Lo, D., Deng, R. H., and Sun, C. (2015).

Active semi-supervised approach for checking app be-

havior against its description. In Proc. of the 2015

IEEE 39Th annual computer software and applica-

tions conference, volume 2, pages 179–184. IEEE.

Maalej, W. and Pagano, D. (2011). On the socialness of

software. In Proc. of the 2011 IEEE Ninth Interna-

tional Conference on Dependable, Autonomic and Se-

cure Computing, pages 864–871. IEEE.

Malik, H., Shakshuki, E. M., and Yoo, W.-S. (2020). Com-

paring mobile apps by identifying ‘hot’features. Fu-

ture Generation Computer Systems, 107:659–669.

Nakagawa, H. and Honiden, S. (2023). MAPE-K loop-

based goal model generation using generative ai. In

13th International Workshop on Model-Driven Re-

quirements Engineering (MoDRE). IEEE.

OpenAI (2023). Chatgpt. https://chat.openai.com/

chat.

Oriol, M., Stade, M., Fotrousi, F., Nadal, S., Varga, J., Seyff,

N., Abello, A., Franch, X., Marco, J., and Schmidt, O.

(2018). FAME: supporting continuous requirements

elicitation by combining user feedback and monitor-

ing. In Proc. of the 2018 ieee 26th international re-

quirements engineering conference (re), pages 217–

227. IEEE.

Pagano, D. and Maalej, W. (2013). User feedback in

the appstore: An empirical study. In Proc. of the

2013 21st IEEE international requirements engineer-

ing conference (RE), pages 125–134. IEEE.

Papadimitriou, C. H., Raghavan, P., Tamaki, H., and Vem-

pala, S. (2000). Latent semantic indexing: A prob-

abilistic analysis. Journal of Computer and System

Sciences, 61(2):217–235.

Ren, S., Nakagawa, H., and Tsuchiya, T. (2022). Goal

model structuring based on semantic correlation of

user reviews. Intelligent Decision Technologies,

16(4):737–748.

Seyff, N., Graf, F., and Maiden, N. (2010). Using mobile

re tools to give end-users their own voice. In Proc. of

the 2010 18th IEEE International Requirements Engi-

neering Conference, pages 37–46. IEEE.

Szmrecsanyi, B. (2012). Grammatical variation in British

English dialects: A study in corpus-based dialectom-

etry. Cambridge University Press.

Zhong, X., ShaﬁeiBavani, E., and Jimeno Yepes, A. (2020).

Image-based table recognition: data, model, and eval-

uation. In Proc. of the European Conference on Com-

puter Vision, pages 564–580. Springer.

Zowghi, D. (2018). "Affects" of user involvement in soft-

ware development. In Proc. of the 2018 1st Interna-

tional Workshop on Affective Computing for Require-

ments Engineering (AffectRE), pages 13–13. IEEE.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

392