Conversational Assistants for Software Development:

Integration, Traceability and Coordination

Albert Contreras

, Esther Guerra

and Juan de Lara

Computer Science Department, Universidad Aut

onoma de Madrid, Spain

Keywords:

Software Development, Development Assistant, Large Language Model, Conversational Agent, Chatbot,

IDE, Eclipse, Java, Method Renaming.

Abstract:

The recent advances in generative artiﬁcial intelligence are revolutionising our daily lives. Large language

models (LLMs) – the technology underlying conversational agents like ChatGPT – can produce sensible text

in response to user prompts, and so, they are being used to solve tasks in many disciplines like marketing, law,

human resources or media content creation. Software development is also following this trend, with recent

proposals for conversational assistants tailored for this domain. However, there is still a need to understand

the possibilities of integrating these assistants within integrated development environments (IDEs), coordinat-

ing multiple assistants, and tracing their contributions to the software project under development. This paper

tackles this gap by exploring alternatives for assistant integration within IDEs, and proposing a general archi-

tecture for conversational assistance in software development that comprises a rich traceability model of the

user-assistant interaction, and a multi-assistant coordination model. We have realised our proposal building an

assistant (named CARET) for Java development within Eclipse. The assistant supports tasks like code comple-

tion, documentation, maintenance, code comprehension and testing. We present an evaluation for one speciﬁc

development task (method renaming), showing promising results.

1 INTRODUCTION

Software development has been striving for higher

levels of productivity and quality from its incep-

tion. This goal has been pursued by several strate-

gies, such as the use of higher-level development

languages (Wasowski and Berger, 2023), automa-

tion techniques (Brambilla et al., 2017), power-

ful integrated development environments (IDEs like

Eclipse

, Visual Studio Code

, or IntelliJ IDEA

knowledge bases and FAQs documenting develop-

ment expertise (Abdalkareem et al., 2017), catalogues

of design patterns (Gamma et al., 1994), or recom-

menders and development assistants (Rich and Wa-

ters, 1988; Savary-Leblanc et al., 2023). In this paper,

we are interested in the latter approaches.

The recent advances in deep learning, natural lan-

guage processing and generative artiﬁcial intelligence

https://orcid.org/0009-0006-6887-9826

https://orcid.org/0000-0002-2818-2278

https://orcid.org/0000-0001-9425-6362

https://www.eclipse.org/

https://code.visualstudio.com/

https://www.jetbrains.com/idea/

have triggered the appearance of open domain con-

versational agents able to produce sensible responses

upon arbitrary user prompts. These agents, also called

chatbots

, are currently being explored to solve all

sorts of tasks in domains like marketing, law, hu-

man resources, media content creation, and software

development, among many others. They are pow-

ered by large language models (LLMs), which are

transformer-based networks trained on vast amounts

of text data. Some speciﬁc LLMs for code exist (Xu

et al., 2022a), such as Codex

, Code Llama

, and

StarCoder (Li et al., 2023). Some of them are even

integrated into IDEs, such as GitHub Copilot

. Still,

assistant-based development is in its childhood, with

many problems to solve and assistance strategies to

assess (Ozkaya, 2023a). This way, researchers work-

ing on development assistance may wonder: What are

the possible ways to integrate assistants into IDEs?

In this paper, we use the terms conversational agent

and chatbot interchangeably.

https://openai.com/blog/openai-codex

https://ai.meta.com/blog/

code-llama-large-language-model-coding/

https://github.com/features/copilot

Contreras, A., Guerra, E. and de Lara, J.

Conversational Assistants for Software Development: Integration, Traceability and Coordination.

DOI: 10.5220/0012561600003687

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2024), pages 27-38

ISBN: 978-989-758-696-5; ISSN: 2184-4895

Is it possible to retrieve past developer-assistant in-

teractions? How can multiple assistants be coordi-

nated? How can the assistant effectiveness be as-

sessed?

This paper aims to answer the previous questions.

For this purpose, we ﬁrst present a taxonomy of the

possibilities for integrating conversational assistance

into IDEs, in the form of a feature diagram (Kang

et al., 1990). Then, we propose a general and ex-

tensible architecture for conversational assistance in

software development, able to coordinate the recom-

mendations of several chatbots, not necessarily built

using LLMs. The architecture reiﬁes and persists the

interactions between the developers and the assistant

as a traceability model. This allows tracking the deci-

sions made, as well as supporting queries about which

parts of the code generated the assistant, why, when,

and who invoked the assistant.

To validate these ideas, we present a speciﬁc

conversational assistant for Java development within

Eclipse called CARET (Conversational Assistant for

softwaRE developmenT). The assistant helps in a

wide range of development tasks, including code

completion, documentation, maintenance, program

comprehension and unit test generation. It features

a bidirectional traceability model from reiﬁed user-

agent interactions to code, and vice versa, via code

annotations. We present an evaluation of the suitabil-

ity of one of the development tasks (method renam-

ing), which yields very promising results.

In the remainder of this paper, Section 2 provides

background on chatbots and analyses the state of the

art. Next, Section 3 describes the three main ingredi-

ents of our approach: the analysis of the assistant-IDE

integration possibilities, the traceability model, and

the coordination of multiple conversational agents

into a uniﬁed assistant. Section 4 introduces CARET,

our Java/Eclipse assistant. Then, Section 5 reports on

the evaluation. Finally, Section 6 ﬁnishes with the

conclusions and prospects for future work.

2 STATE OF THE ART

Next, we provide some background and state of the

art on conversational agents (Section 2.1) and their

use for software development tasks (Section 2.2).

2.1 Conversational Agents

Conversational agents (or chatbots) are being increas-

ingly used to access software services using natural

language. Their popularity has risen because they re-

duce the entry level to services like customer support,

banking or shopping, and can be easily embedded into

social networks (e.g., Telegram), websites or intelli-

gent speakers. These chatbots are called task-oriented

as they help users in performing a speciﬁc task.

Many technologies to build task-oriented chat-

bots exist (P

erez-Soler et al., 2021), like Google’s

Dialogﬂow

, Amazon Lex

, Microsoft’s Bot Frame-

work

, the IBM Watson Assistant

or Rasa

. They

allow deﬁning the user intents that a chatbot aims

at recognising (e.g., ordering a pizza, setting an ap-

pointment with a technician). Intents declare training

phrases, which are used to train a natural language

understanding engine. This way, when the user inputs

an utterance, the engine selects the most likely intent

with a certain conﬁdence. If the conﬁdence is below

a threshold, then a fallback intent is selected, if avail-

able. Fallbacks are an indication of user requests that

the chatbot cannot handle.

Intents may have parameters, which are pieces

of information required from the user (e.g., type of

pizza, appointment date), and whose value is ex-

tracted from the user utterance. When the chatbot de-

tects an intent, it performs the actions associated to

the intent, usually accessing an external information

system and composing a response. Finally, the user-

chatbot conversation ﬂows are explicitly designed by

setting paths of intents that a user may follow to per-

form a task.

Different from task-oriented chatbots, the recent

advances in generative artiﬁcial intelligence have pro-

moted the appearance of open-domain chatbots based

on LLMs, like OpenAI’s ChatGPT

or Google’s

Gemini

(formerly known as Bard). LLMs are large

neural networks with a transformer-based architecture

that are trained on vast amounts of textual data (Xu

et al., 2022a). They are able to provide a sensible text

output upon arbitrary user prompts without the need

to predeﬁne admissible user intents.

Rather than being task-speciﬁc, LLMs are typi-

cally open-domain, although some of them have been

ﬁne-tuned on specialised data, like code (Xu et al.,

2022a). Fine-tuning enables repurposing an LLM

pretrained on generic text data for speciﬁc down-

stream tasks (e.g., question-answering), or domains

(e.g., programming). However, since LLMs have no

fallbacks, it can be difﬁcult to assess the accuracy of

the produced output, or to assert when an LLM does

https://dialogﬂow.com/

https://aws.amazon.com/en/lex/

https://dev.botframework.com/

https://www.ibm.com/cloud/watson-assistant/

https://rasa.com/

https://openai.com/chatgpt

https://gemini.google.com/

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

not know the answer, leading to so-called hallucina-

tions (Chen et al., 2023) (i.e., inaccurate or nonsensi-

cal answers presented as fact). Related to this issue,

the temperature hyperparameter of LLMs is used to

regulate their unpredictability. This way, the higher

the temperature, the less predictable the LLM’s out-

puts become upon the same input.

2.2 Conversational Assistance for

Software Development

The idea of development assistants can be traced back

to the programmer’s apprentice (Rich and Waters,

1988) in the 80’s. This system used symbolic arti-

ﬁcial intelligence – knowledge representation based

on frames – to describe and reason about programs

with the help of design clich

es (a.k.a. design pat-

terns (Gamma et al., 1994)).

Today, the focus of artiﬁcial intelligence has

shifted to machine learning. In particular, deep learn-

ing is being increasingly used to help software de-

velopers in tasks related to requirements, software

design and modelling, coding, testing, and mainte-

nance (Yang et al., 2022).

Recently, the advent of LLMs (Zhao et al., 2023)

has prompted their use also for software engineer-

ing. Several LLM-based programming assistants have

been proposed. One of the ﬁrst ones was GitHub

Copilot, originally built on Codex, an LLM based

on GPT-3 and ﬁne-tuned on code. Copilot is inte-

grated in several IDEs such as Visual Studio Code

and JetBrains, and offers autocompletion assistance

as the developer types. Although Copilot was ini-

tially free, currently it is a paid feature. Its code com-

pletion capabilities have been recently integrated into

the Eclipse IDE as a plugin

. This plugin provides

autocompletion for several languages using GitHub

Copilot at the back, and hence requiring a subscrip-

tion. While these assistants are valuable for develop-

ers, a deeper integration with the IDE beyond code

completion would be desirable. They also lack trace-

ability information to understand which parts of the

code created the assistant and why. Moreover, future

assistant-enabled IDEs may need to coordinate sev-

eral agents.

The programmer’s assistant (Ross et al., 2023) is

a conversational assistant for Python based on Codex.

It is interacted with using conversation, and the con-

text can be provided by selecting code. A user study

revealed the utility and good acceptance of this as-

sistant by developers. However, the assistant is not

integrated into a fully-ﬂedged IDE, so it does not take

https://www.genuitec.com/products/copilot4eclipse/

advantage of the possibilities of integration via com-

mands, and traceability mechanisms are missing.

In (Xu et al., 2022b), the authors present two sys-

tems for code generation and retrieval from natural

language, both integrated in the IDE PyCharm

for

Python programming. They evaluated the systems for

improved efﬁciency and quality with mixed results,

but developers declared enjoying the experience.

Barke et al. used grounded theory to study how

programmers interact with GitHub Copilot (Barke

et al., 2023). They detected two main usages of the

assistant: for the acceleration of known tasks (i.e.,

autocompletion) and for the exploration of options

that may be used as the starting point to reach a so-

lution. We claim that exploration can be improved by

the availability of several agents, and that the assistant

contributions should be properly traced.

Robe and Kuttal explored design options for Pair-

Buddy, a conversational assistant for pair program-

ming, with a 3D embodiment (Robe and Kuttal,

2022). They used a Wizard of Oz methodology, where

a human controls the assistant. The work is justiﬁed

by the fact that interaction with development assis-

tants is still in its infancy, and so different design op-

tions need to be explored. We agree with this, but

in addition, we propose including traceability support

and the possibility to coordinate multiple agents.

Devy is a voice-based assistant for development

tasks related to version control (Bradley et al., 2018).

Devy is an intent-based chatbot, and so, it maps high-

level user intents into low-level commands. Intents

may have parameters modelling required information,

and Devy asks for their value when absent. In our

work, we also found that intent-based agents are suit-

able to map user intentions into complex IDE com-

mands, but in addition, we can combine LLM- and

intent-based agents. Other types of assistants have

been included into IDEs, such as a recommender for

commands within Eclipse (Gasparic and Ricci, 2020).

Section 5 will evaluate our proposed assistant on

one particular task: method renaming. Different ap-

proaches have been proposed for this task. For ex-

ample, Liu et al. report on a classiﬁer based on a

deep learning architecture that ﬁrst identiﬁes method

names that are not consistent with the code, and then

proposes a new name for them (Liu et al., 2019).

Instead, Zhang et al. use the code history to train

a random forest classiﬁer to state whether a method

needs renaming, and if so, produce a name sugges-

tion (Zhang et al., 2023). Our assistant uses LLMs to

suggest new method names, but needs to be explic-

itly invoked by the user, i.e., it lacks a classiﬁer that

detects the need for renaming.

https://www.jetbrains.com/pycharm/

Conversational Assistants for Software Development: Integration, Traceability and Coordination

Assistant

response

Interaction

Conversational

Sw Development

Assistance

IDE integration

Activation

Proactive

Reactive

Code

completion

Documentation

Unit

testing

Traceability

Task

IDE commands

Code

comprehension

Maintenance

Generative

Intent-based

Multi-language

Text

Multi-agent

coordination

Adaptive

(Re)naming

Code

optimisation

Voice

V&V

Error

detection

Homogeneous

Heterogeneous

OtherTechnology

Assistant

Technology

Coding

Error

correction

mandatory

optional

alternative (=1)

or (>1)

LEGEND

User-to-

assistant

Message

Actions

Figure 1: Dimensions of conversational assistance for software development.

Overall, we ﬁnd different proposals of conversa-

tional assistants for software engineering with a good

acceptance among developers. However, we identify

the following gaps in the state of the art. Firstly, the

integration of the assistants in IDEs, when existent,

is ad-hoc. Beyond autocompletion, the assistants’ re-

sponses are most often messages and do not trigger

IDE commands or modify existing artefacts. Sec-

ondly, the assistant contributions, their provenance

and their rationale are not persisted, to the detriment

of the project monitoring. Lastly, to our knowledge,

no assistant combines or coordinates the contribu-

tions of several LLM- and intent-based conversational

agents, in order to exploit the beneﬁts of each of them.

In the remainder of the paper, we present our proposal

to address these issues.

3 DESIGNING

CONVERSATIONAL

DEVELOPMENT ASSISTANTS

Next, we present the main concepts in our approach.

Section 3.1 presents a feature model with the dimen-

sions of conversational assistance, Section 3.2 pro-

poses a traceability model for assistance-based devel-

opment, and Section 3.3 describes an execution and

coordination model for development conversational

assistants.

3.1 Dimensions of Assistance

Figure 1 shows the dimensions relevant to conversa-

tional assistance for software development. It was

elicited based on an analysis of the literature, and our

own experience. It comprises the following four main

features:

• Assistance. The assistance may be for one or

more development tasks (feature Task in the ﬁg-

ure), and be available for one or several program-

ming languages (feature Multi-language). The fea-

ture model classiﬁes tasks into coding tasks (code

completion, documentation), validation & veri-

ﬁcation tasks (unit testing, (semantic) error de-

tection, error correction), and maintenance tasks

(code optimisation, code comprehension, renam-

ing of methods, classes or attributes). This list of

tasks is not meant to be exhaustive, but it is rep-

resentative of the task types a conversational as-

sistant can help with. For instance, we leave out

tasks not directly related to programming, like as-

sistance for modelling (P

erez-Soler et al., 2017)

or versioning (Bradley et al., 2018).

• IDE integration. The integration of a conversa-

tional assistant into an IDE must consider several

aspects. First, the Activation of the assistant may

be Reactive (the developer explicitly asks for as-

sistance) or Proactive (the assistant monitors the

developer activity and provides assistance when

it sees ﬁt). Both styles are not mutually exclusive.

In addition, the Interaction of the developer with

the assistant (feature User-to-assistant) can be done

through IDE commands (e.g., menus or buttons),

Text in natural language (e.g., code comments like

in GitHub Copilot, or through dedicated views),

or Voice (Bradley et al., 2018). In the case of com-

mands, the IDE needs to produce a textual prompt

in natural language to send to the conversational

assistant, together with the context of the assis-

tance request (e.g., code fragment selected on the

editor). For text and voice, the IDE may need to

extend the developer prompt with additional con-

text information. The response of the assistant

(feature Assistant response) can be a message, or it

may involve actions that modify development arti-

facts (e.g., inserting new code or comments into a

ﬁle, refactoring a code snippet). In the latter case,

the assistant takes an active role, while in the for-

mer case, it acts as an informer or recommender

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

of information. Finally, the developer-assistant in-

teraction may be optionally traced (e.g., storing

the query of the developer, the answer by the as-

sistant, whether the recommendation was applied)

and the IDE may mark the code fragments added

or modiﬁed by the assistant.

• Assistant. The underlying Technology of the con-

versational assistant can be an agent based on gen-

erative artiﬁcial intelligence (LLMs), an intent-

based chatbot, or other technologies (e.g., rule-

based natural language processing as in (P

erez-

Soler et al., 2017)). In addition, some assistants

may be Adaptive to the context of use, e.g., check-

ing or enforcing coding standards and norms used

within a company, or learning from previous in-

teractions with the developer.

• Multi-agent coordination. An assistant may in-

tegrate several conversational agents that help in

different Heterogeneous tasks (e.g., coding and

testing) or provide alternative solutions for the

same task (feature Homogeneous, e.g., several

agents that use different LLMs and prompts to

suggest distinct code completions that the devel-

oper may choose from). If an assistant integrates

several agents, then mechanisms for their coordi-

nation are required.

As we will describe in Section 4, our assistant

CARET supports all the tasks of the feature diagram.

It is not multi-language, as it speciﬁcally targets Java.

Its activation is reactive, interaction is through both

natural language text and IDE commands, their re-

sponses comprise both text and IDE actions (e.g.,

creating new ﬁles, inserting code into ﬁles) and it

offers traceability of the developer-assistant interac-

tion. CARET internally uses generative and intent-

based technologies, is not adaptive, and coordinates

multiple heterogeneous conversational agents.

3.2 Tracing Assistant Contributions

Keeping a trace of the interactions with the assis-

tant can be useful for project management. The trace

would record the contributions of the assistant to the

code, along with the developers’ requests that origi-

nated that code. This can be exploited for code re-

porting and analysis purposes, making it possible to

see what the assistant did, where, when and why.

It would also be possible to undo/redo the assistant

contributions for exploratory purposes. Besides, the

assistant-produced code may require more thorough

testing than the human-produced code, so tracking the

former code would make it easier to identify and sub-

sequently test.

«enum»

Role

USER

AGENT

SYSTEM

«enum»

Task

RENAME_METHOD

CREATE_SUBCLASS

CREATE_TEST

FIND_ERROR

FIX_ERROR

...

UNKNOWN

Response

- text: String

- used: boolean

ConversationalAgent

- name: String

- technology: String

- isLLM: boolean

TraceabilityModel

Interaction

- interactionId: String

- timestamp: long

- isTextual: boolean

- userText: String

CodeFragment

- startLine: int

- endLine: int

- code: String

- length: int

- offset: int

Resource

- fileName: String

- fullPath: String

- projectName: String

- projectRelativePath: String

Context

DevelopmentSession

- sessionId: String

role

task

insertedIn

0..1

agent

responses

sessions

agents

0..1

context

0..1

resource

fragment

0..1

interactions

Figure 2: User-assistant interaction traceability model.

Our approach to tracing the assistant contributions

comprises two elements: a traceability model to store

the interactions, and a set of code annotations that

identify the code fragments introduced by the assis-

tant. This enables bidirectional traceability: from

past developer-assistant interactions into the modiﬁed

code, and from the code to the originating interaction.

Figure 2 shows the traceability model. It records,

for each DevelopmentSession, the Interactions between

the user and the assistant that take place during the

session. The interactions have an identiﬁer, a times-

tamp, the role of the interacting participant (user,

agent or system), whether the interaction was started

by a text message or an IDE command (attribute isTex-

tual), the text entered by the user, the development task

resulting from the interaction (e.g., rename method,

create subclass), and a Context that depends on the par-

ticular task. More in detail, the context may include

any code fragment used to formulate the request to

the assistant, in which case, the context stores both

the CodeFragment and its container Resource. For ex-

ample, this would be the context information stored

for a request such as “document the behaviour of this

method”. Alternatively, the context can be a ﬁle (e.g.,

for requests like “create a class implementing this in-

terface”), a folder (e.g., for requests like “create a new

sub-package called util”), or empty (e.g., for requests

like “create a new Java project”).

The Response to the interaction comprises the text

answered by the assistant (which may combine both

code and textual explanations), the agent producing

it, and whether the developer actually used the sug-

gested code. In the latter case, the model records the

Conversational Assistants for Software Development: Integration, Traceability and Coordination

resource in which the code was inserted, and the po-

sition of the code in the resource. For each agent, the

model stores its name, its technology, and whether it

is based on LLMs. The latter information is relevant

for coordinating multiple agents, as the next subsec-

tion will show.

To trace from the program the assistant-produced

code, we propose the use of code annotations (Guerra

et al., 2010). Whenever an assistant introduces a code

snippet, the outer enclosing code block is automati-

cally annotated to mark the interaction causing it (us-

ing the interaction identiﬁer). In particular, if the as-

sistant adds a method, this becomes annotated; if it

adds a code fragment within a method, the enclosing

method is annotated; and if it adds a class, interface or

enumeration, these receive the annotation. In addition

to the interaction identiﬁer, the annotations carry ad-

ditional meta-data, such as the task being solved and

the agent that suggested the code.

In Section 4.2, we will describe the Java annota-

tion we have created for the contributions of CARET.

3.3 Orchestrating Conversational

Agents

As Section 3.1 discussed, a conversational assistant

may integrate multiple conversational agents for the

same or different tasks, built with different technolo-

gies, and invoked either by IDE commands (e.g.,

menus) or natural language text/voice requests. To co-

ordinate the agents, we propose the scheme displayed

in Figure 3.

task-i

task-j

…

registered

agents

Message

context

injection

task

classification

agent

Prompt

Assistant pool

Answer

Supported

tasks

answer

selection

processing

Answer

User

IDE

command

IDE

registered

agents

Figure 3: Orchestrating conversational agents.

Agents are registered for the tasks they are able to

manage (label 1 in Figure 3). Users can make assis-

tance requests via messages in natural language (la-

bel 2a) or commands of the IDE (label 2b). When

this happens, the ﬁrst step is to select the agents that

know how to handle the user request. The case of IDE

commands is direct, since the command (e.g., create

a class implementing an interface) is linked to a con-

crete task, from which the set of suitable (registered)

agents can be retrieved. The case of natural language

messages is more complex, as it requires the classi-

ﬁcation of the message into a task. To carry out this

classiﬁcation, we propose using an LLM agent (label

3a) with a prompt like:

You are a code assistant that helps software de-

velopers in programming tasks. Please classify

into one of the next categories:

1. task-1

2. task-2

...

n. none of the above

the following request: “⟨user-message⟩”.

If the LLM agent responds “none of the above”,

the user is informed that the assistant is not conﬁg-

ured to assist with the requested task, and the current

interaction is considered ﬁnished. Otherwise, if the

LLM agent returns one of the listed tasks, the agents

registered for that task shall be selected.

Next (label 4), a prompt is constructed using the

user message or IDE command plus the context infor-

mation (e.g., selected code in the IDE). This prompt

is sent to the selected agents (label 5). Then, the user

is presented the answers provided by the agents, and

can select one of them (label 6). This answer needs to

be processed (label 7) to extract the code and inject it

as required into the software project, using the API of

the IDE programmatically (label 8).

4 CARET

Next, we introduce CARET, a Java programming as-

sistant we have built following the principles de-

scribed in Section 3. Section 4.1 presents its archi-

tecture, and Section 4.2 describes its functionalities

and showcases usage examples.

4.1 Architecture of CARET

CARET is a plugin for the Eclipse IDE that assists Java

programmers in software development tasks. Figure 4

shows its architecture. The assistant integrates con-

versational agents of different technologies to pro-

cess task requests, like OpenAI’s GPT-3.5 LLMs,

Dialogﬂow and Rasa. In addition, it can be ex-

tended with other technologies by means of an exten-

sion point named AgentTechnology. Extension points

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

are the mechanism that Eclipse provides to allow

adding functionality to a system externally (i.e., with-

out changing its internal code).

Eclipse

CARET

JDT

Request

Processor

Agent

Orchestrator

registered

agents

ChatView

traceability

model

use

process

message

use

update

query

Task

Classifier

classify

request

process

command

AgentTechnology

…

Rasa

Dialogflow

OpenAI

GPT LLMs

org.eclipse.

ui.commands

IDE Menu

RenameMethod

Command

RenameMethod

doTask

Annotation

Injector

use

org.eclipse.

ui.menus

annotateCode

Figure 4: Architecture of CARET.

Users can request assistance in two ways: select-

ing in the IDE speciﬁc menus for each task, or writing

a text request on a Chat View. In the former case, the

selected menu determines unambiguously the task to

perform. In the latter case, a Task Classiﬁer tries to

ﬁnd the task that better ﬁts the text request by sending

the prompt drafted in Section 3.3 to a GPT-3.5 LLM.

The Request Processor coordinates the accom-

plishment of the tasks. It delegates the task execution

to the command class that implements the task be-

haviour (there is one command class per task), pass-

ing the agents that will handle the task as parameters.

The Request Processor obtains the agents from the

Agent Orchestrator. Currently, the set of registered

agents is predeﬁned, and comprises one agent of each

supported technology.

The above-mentioned command classes send to

the agents a prompt tailored to the target task, which

includes the user request and the necessary context in-

formation. The response from the agents is processed

through an interface implemented by all agents con-

forming to the AgentTechnology extension point. The

response includes text, the matching intent (if any),

context information, and code suggestions. The Re-

quest Processor displays the response in the Chat

View, and asks the user for conﬁrmation to apply the

suggestion. If the answer is positive, the project code

is modiﬁed using the Eclipse JDT

, the interaction is

traced (and can be saved/retrieved in JSON format),

and the modiﬁed code is annotated.

https://projects.eclipse.org/projects/eclipse.jdt

4.2 Tool Support

The user can interact with the assistant by sending a

message through the Chat View or using contextual

menus that appear when right-clicking on the project

ﬁles or a selected code fragment. Currently, CARET

assists with the following tasks:

• Code completion: CARET is able to create a new

project with the given name, a new class or inter-

face with the given name in the current project, a

class implementing a given interface, or a subclass

of a given abstract class. It can also generate the

code of a method, for which the user must provide

either a description of the method, or the method

name and its parameters.

• Documentation: It generates the Javadoc com-

ments for a complete Java ﬁle. If the user does not

provide a ﬁle but a code fragment, it can gener-

ate either Javadoc comments or line-by-line com-

ments for the code.

• Unit testing: It creates a JUnit test for a given

class.

• Error detection and correction: It can help detect

simple semantic errors and propose corrections.

Both functionalities rely solely on GPT-3.5 (i.e.,

the assistant does not integrate analysis or error

detection/ﬁx methods developed ad-hoc for Java).

• Code optimisation: CARET provides four optimi-

sation options for a selected code fragment: ef-

ﬁciency improvement, readability improvement,

complexity reduction, or general optimisation.

• Code comprehension: It produces an explanation

in natural language of a selected piece of code.

• Method (re)naming: It renames a method to re-

ﬂect its behaviour. Section 5 will evaluate the suit-

ability of such renaming suggestions.

After processing the user request, the code of the

suggested solution is displayed in a pop-up window,

so that the user can decide to apply it or not. As an ex-

ample, Figure 5 shows the response of CARET when

the user selects the code of method “power” in the

Java editor, and clicks on the menu option “Improve

efﬁciency”. The suggested code improvement is dis-

played in a popup window. If the user accepts the

suggestion, the suggested code is replaced in the Java

editor, and the Chat View shows both the new code

and its explanation.

In addition, accepting the assistant suggestion au-

tomatically adds a code annotation @Generated to the

modiﬁed method, class or interface. This annotation –

which we have designed for tracing CARET contribu-

tions – allows keeping track of the assistant-generated

Conversational Assistants for Software Development: Integration, Traceability and Coordination

Figure 5: Screenshot of interaction with CARET. The popup

window shows the code suggestion for improving the efﬁ-

ciency of method “power”.

code. It has four parameters: the name of the agent

that produced the code, the performed task, the iden-

tiﬁer of the interaction, and the timestamp (cf. Fig-

ure 2).

As an illustration, Figure 6 shows the code anno-

tation added to the “factorial” method (lines 63–64).

Its parameters indicate that the GPT-3.5 agent modi-

ﬁed the method to reduce its complexity. For conve-

nience, the Chat View at the bottom displays the intro-

duced code, a “Copy code” shortcut button, and a “Go

to” button which opens the ﬁle with the modiﬁed code

and positions the cursor in the modiﬁed code. The lat-

ter information (modiﬁed resource and code fragment

objects) is retrieved from the traceability model that

stores the user-assistant interactions, as explained in

Section 3.2.

5 EVALUATION

This section evaluates the suitability of the assistance

provided by CARET. Given the diverse range of tasks

that CARET supports, we select to evaluate a represen-

tative one, which is method renaming, and leave the

evaluation of the remaining ones for future work.

Method renaming is a common task during cod-

ing and maintenance. It seeks the alignment of the

method name and its implementation. Good method

names are important to make the code comprehensi-

ble – “if you have a good name for a method, you

don’t need to look at the body” (Fowler, 1999) – while

inconsistent method names make the code difﬁcult

to understand and maintain (Liu et al., 2019). As

reviewed in Section 2.2, many different approaches

Figure 6: Screenshot of applied code suggestion for re-

ducing the complexity of method “factorial”, and generated

code annotation.

have been investigated for this task. Our goal is to as-

sess whether the LLM-based agents of CARET are ﬁt

for this task. Thus, our evaluation aims to answer the

following research question (RQ):

Can CARET help to improve method names?

Next, Section 5.1 characterises the experimental

setup, Section 5.2 describes the evaluation protocol,

Section 5.3 analyses the results and answers the RQ,

and Section 5.4 discusses the potential threats to va-

lidity.

The experiment results are available at https://

github.com/caretpro/experiment.

5.1 Experiment Setup

The evaluation considers four Java projects. Table 1

shows a summary of them, detailing the number of

compilation units (i.e., classes, interfaces, enums), the

number of methods, and the lines of code (LoC).

Table 1: Summary of selected projects.

Name # Units # Methods # LoC

Tutorial-compiler 11 66 2216

JVector 96 646 5221

Log4J-detector 19 117 3008

Ramen 78 362 5114

Total 204 1191 15559

The ﬁrst three projects in Table 1 were taken from

GitHub public repositories using the following query:

created:>2021-10-01 stars:>100 size:<3500

path:**/.project language:Java

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

The goal of this query was to ﬁnd popular Java

repositories (with more than 100 stars), of medium

size (less than 3500 Kb), created after the release of

GPT-3.5 (October 2021). Thus, from the top of the

list of retrieved projects, we discarded those either too

small or hard to build due to their numerous depen-

dencies. The fourth project is a student project from a

programming course at our university, stored in a pri-

vate repository. Overall, the domains of the selected

projects are diverse, comprising compilers

, embed-

ded vector search engines

, vulnerability detection

due to the use of Log4J

, and a social network with

a swing graphical user interface.

5.2 Experiment Design

To evaluate the suitability of the method names sug-

gested by CARET, we have performed a user study that

follows the scheme depicted in Figure 7.

project (x4)

method (x2)

original name

suggestion w/ original

suggestion w/o original

baseline suggestion

random order

Likert

scale for

each name

particip.

(x3 each

case)

Figure 7: Scheme of the experiment design.

We ﬁrst selected four Java projects as explained in

Section 5.1. Then, we prepared a questionnaire with

two parts: one collecting demographic data about the

participants, and the other evaluating name sugges-

tions for eight methods (two of each project). The

method selection criterion was to have less than 20

LoC (to prevent participants from getting tired and to

facilitate their understanding of the aim of the code)

but not be trivial (e.g., getters and setters were ex-

cluded). For each method, the questionnaire pre-

sented its body and parameters, and suggested four

names that participants had to rate using a 5-point

Likert scale. The suggestions included the original

method name, a baseline name made of the con-

catenation ⟨class-name⟩+“Method”, and two names

suggested by CARET using the GPT-3.5 agent with

two variants of the prompt. The prompt of the ﬁrst

variant included the body and the original name of

the method, while the second one included the body

https://github.com/wangjs96/

A-tutorial-compiler-written-in-Java

https://github.com/jbellis/jvector

https://github.com/mergebase/log4j-detector

but not the method name. The GPT-3.5 agent used

GPT-3.5-turbo with the parameter temperature set to

0.7. As an example, the next four names were pre-

sented for the same method: insertNotDiverse (orig-

inal), concurrentNeighborSetMethod (baseline), insert-

NonDiverseNode (CARET variant 1), and updateNeigh-

bors (CARET variant 2).

Each evaluation case comprised 8 methods (2

from each project) and was evaluated by 3 partici-

pants. To avoid any bias, participants did not know

how each name suggestion was generated, and the or-

der of presentation of the methods and name sugges-

tions was randomised. We recruited 12 participants

in total, who evaluated 32 different methods, and

96 methods overall. The evaluation was conducted

ofﬂine. Participants received the questionnaires by

email and were given 5 days to submit their responses.

The questionnaires used are available at: https://

github.com/caretpro/experiment.

5.3 Results and Answer to RQ

Demographics of Participants. The age of the par-

ticipants ranged from 21 to 41 (31.9 years on aver-

age). Figure 8 summarises their demographic data.

Overall, 83% of participants were men and 27% were

women. In terms of educational level, 50% had a

PhD, 34% had a master’s degree, 8% had a bachelor’s

degree, and 8% were undergraduate students.

50%

34%

Studies

PhD

Master

Student

Bachelor

83%

17%

Gender

Male

Female

(a) (b)

(a) Gender

50%

34%

Studies

PhD

Master

Student

Bachelor

83%

17%

Gender

Male

Female

(a) (b)

(b) Studies

Figure 8: Demographic data of participants.

As Table 2 shows, the participants had an average

of 9.75 years of experience in software development,

and 4.75 years in Java development. They rated their

knowledge of Java from 1 (none) to 5 (expert), ob-

taining an average of 3.42. Hence, overall, the partic-

ipants declared to have good experience in software

and Java development, and a fair knowledge of Java.

Table 2: Years of experience in development.

Experience Average (years)

Software development 9.75

Java development 4.75

Evaluation Results. Before analysing the responses

to the questionnaires, it is worth noting that all

method names generated by CARET are valid (e.g.,

Conversational Assistants for Software Development: Integration, Traceability and Coordination

(a) (b)

Figure 9: Distribution of scores of the suggested method names. (a) Distribution of the averages of the three scores received

by each method. (b) Distribution of all scores (i.e., without averaging per method).

they do not start with a number or special symbol) and

follow the Java naming convention of being in lower

camel-case.

With regard to the questionnaires, the box plots

in Figure 9 depict the distribution of scores that

each method renaming strategy received. In the box

plots, new1 corresponds to the assistant suggestions

produced with the prompt that includes the original

method name, and new2 to those produced omitting

the original method name. As Section 5.2 explained,

3 participants evaluated each method. Thus, Fig-

ure 9(a) shows the distribution of the average score

values of each method (i.e., 32 data points per se-

ries), and Figure 9(b) shows the distribution of all

scores without averaging per method (i.e., 32×3=96

data points per series).

We can see that the average score (marked with a

cross in the box plots) is 3.05 (out of 5) for the orig-

inal method names, 3.95 for strategy new1, 3.74 for

strategy new2, and 1.51 for the baseline names. As

expected, the baseline names were the lowest rated.

Furthermore, in Figure 9(b), the median of the scores

for the original method names is 3, while for the two

assistant-generated method names is 4.

Figure 10 shows the results disaggregated by

project, for the average scores (as in Figure 9(a)).

Across all projects, the average and median of both

new1 and new2 are higher than those of original, and

baseline is consistently the worst.

Now, we delve into the difference in score be-

tween the original method names and those suggested

by the assistant. The left bar of Figure 11 shows the

percentage of method names for which the average

score of both new1 and new2 is higher than the average

score of original. Overall, both new1 and new2 scored

higher in more than half of the methods. The bar on

the right shows the percentage of methods where ei-

ther new1 or new2 is ranked higher than original. In this

case, either new1 or new2 was ranked higher than the

original name for more than 93% of the methods.

Figure 10: Distribution of scores disaggregated by project.

56,25%

93,75%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

original < new1 && new2 original < new1 || new2

Figure 11: Comparison of scores between the original

method names and the assistant suggestions.

Finally, to analyse if the difference in scores of

new1, new2 and original is statistically signiﬁcant, we

use the Wilcoxon Signed-Rank Test (Wilcoxon, 1945)

to compare sample groups by pair ratings.

First, we deﬁne the null hypothesis H

as “there is

no difference between the median scores of the original

names and the new1 suggestions”. This test results

in W = 616, Z

(cal)

= -4.917, α = 0.05, Z

(α/2)

= 1.96,

and p − value = 0.0000008. Since p − value < α, we

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

reject H

and state with 95% conﬁdence that there is

a signiﬁcant difference between the medians of the

scores of the original names and the new1 suggestions.

Second, we set H

to “there is no difference be-

tween the median scores of the original names and the

new2 suggestions”. This test results in W = 797.5,

(cal)

= -3.356, α = 0.05, Z

(α/2)

= 1.96, and p −value

= 0.0007897. Since p − value < α, we reject H

and

state with 95% conﬁdence that there is a signiﬁcant

difference between the median scores of the original

names and the new2 suggestions.

Finally, we set H

to “there is no difference

between the median scores of the new1 and new2

names”. This results in W = 1264, Z

(cal)

= -1.227, α =

0.05, Z

(α/2)

= 1.96, and p −value = 0.2196459. Since

p −value > α, we accept H

and state with 95% con-

ﬁdence that there is no signiﬁcant difference between

the median scores of the new1 and new2 names.

Answering the RQ. For the used dataset, the partic-

ipants perceived the original method names as less

appropriate than the suggestions new1 and new2 pro-

duced by CARET. Hence, we can answer that CARET

suggestions could have helped to improve the method

names in this study.

5.4 Threats to Validity

Internal validity refers to the extent to which there

is causal relationship between the conducted experi-

ment and the resulting conclusions. We attempted to

avoid any bias in the data by selecting Java projects

developed by third parties, which were not present in

GPT3’s training data. We also tried to prevent bias

in the experiment by randomising the order of the

projects and method names in the questionnaires, and

by not revealing to the participants which mechanism

was used for each presented method name.

External validity concerns the generalisability of the

results. The study involved 12 participants who eval-

uated 384 alternative method names for 96 method

blocks (32 unique ones) coming from 4 projects.

While this is a fair amount of data, more evidence

would be obtained with larger sets of participants and

methods. Moreover, the participants rated methods

with less than 20 LoC, so the results may differ for

longer methods. Our study used GPT-3.5 with a tem-

perature value of 0.7, but we cannot claim that this is

the best value for solving the method renaming task.

In the future, we will experiment with other tempera-

ture values to assess the quality of the output.

Construct validity is the extent to which an experi-

ment accurately measures the concept it intends to

evaluate. Since our evaluation is based on a subjec-

tive assessment of the appropriateness of the method

names, we compiled 3 evaluations per method and av-

eraged the scores. We did not consult the original

project developers (e.g., via pull requests as in (Liu

et al., 2019)), but 12 independent developers evalu-

ated the method names. To validate that the opinion

of the participant developers was aligned and there

were no outliers, we measured the inter-rater reliabil-

ity using Fleiss’ kappa (Fleiss, 1971). The level of

agreement between the participants was between 0.2

and 0.4 in all projects which, according to (Landis and

Koch, 1977), can be considered fair.

6 CONCLUSIONS AND FUTURE

WORK

Intelligent conversational assistants will soon become

an integral part of most development processes and

environments (Ozkaya, 2023b). With this expecta-

tion, we have explored the option space for their in-

tegration within IDEs, and proposed a traceability

model and a coordination scheme for multiple con-

versational agents. We have realised our proposal in

CARET, a Java assistant for Eclipse that helps in tasks

such as code completion, documentation, code opti-

misation, and unit test generation. Finally, we have

conducted a user study of the method renaming task

supported in CARET, with very promising results.

We are currently improving CARET to allow regis-

tering any number of conversational agents that would

be orchestrated in combination with the predeﬁned

ones. We would like to add further homogeneous

agents, e.g., focused on autocompletion using LLMs

for code, such as GitHub Copilot. Our goal is to

automate the creation of conversational assistants for

other programming languages (e.g., Python or C++),

Eclipse plugins, testing frameworks (e.g., Cucumber),

or model-driven development (e.g., the Eclipse Mod-

eling Framework (Steinberg et al., 2008)). We also

plan to explore the possibility to inject additional con-

text to the assistants by prompts that include, e.g.,

the user expertise or company-speciﬁc coding stan-

dards and guidelines. Finally, as with method renam-

ing, we intend to evaluate the other tasks supported

by CARET, taking as a basis works on evaluation of

LLMs for code (Chen et al., 2021).

ACKNOWLEDGEMENTS

Work funded by the Spanish MICINN with projects

SATORI-UAM (TED2021-129381B-C21), FINESSE

(PID2021-122270OB-I00), and RED2022-134647-T.

Conversational Assistants for Software Development: Integration, Traceability and Coordination

REFERENCES

Abdalkareem, R., Shihab, E., and Rilling, J. (2017). What

do developers use the crowd for? A study using stack

overﬂow. IEEE Softw., 34(2):53–60.

Barke, S., James, M. B., and Polikarpova, N. (2023).

Grounded Copilot: How programmers interact with

code-generating models. Proc. ACM Program. Lang.,

7(OOPSLA1):85–111.

Bradley, N. C., Fritz, T., and Holmes, R. (2018). Context-

aware conversational developer assistants. In ICSE,

pages 993–1003. ACM.

Brambilla, M., Cabot, J., and Wimmer, M. (2017). Model-

driven software engineering in practice, 2nd edition.

Synthesis Lectures on Software Engineering. Morgan

& Claypool Publishers.

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto,

H. P., Kaplan, J., et al. (2021). Evaluating

large language models trained on code. CoRR,

abs/2107.03374.

Chen, Y., Fu, Q., Yuan, Y., Wen, Z., Fan, G., Liu, D., Zhang,

D., Li, Z., and Xiao, Y. (2023). Hallucination de-

tection: Robustly discerning reliable answers in large

language models. In CIKM, pages 245–255. ACM.

Fleiss, J. L. (1971). Measuring nominal scale agreement

among many raters. Psychological Bulletin, 76:378–

382.

Fowler, M. (1999). Refactoring - Improving the Design of

Existing Code. Addison Wesley object technology se-

ries. Addison-Wesley.

Gamma, E., Helm, R., Johnson, R., and Vlissides, J. M.

(1994). Design Patterns: Elements of Reusable

Object-Oriented Software. Addison-Wesley Profes-

sional, 1 edition.

Gasparic, M. and Ricci, F. (2020). IDE interaction support

with command recommender systems. IEEE Access,

8:19256–19270.

Guerra, E. M., Cardoso, M., Silva, J. O., and Fernandes,

C. T. (2010). Idioms for code annotations in the Java

language. In SugarLoafPLoP, pages 7:1–7:14. ACM.

Kang, K., Cohen, S., Hess, J., Novak, W., and Peterson,

A. (1990). Feature-oriented domain analysis (FODA)

feasibility study. Technical Report CMU/SEI-90-TR-

021, Software Engineering Institute, Carnegie Mellon

University, Pittsburgh, PA.

Landis, J. R. and Koch, G. G. (1977). The measurement of

observer agreement for categorical data. Biometrics,

33:159–174.

Li, R. et al. (2023). StarCoder: May the source be with you!

CoRR, abs/2305.06161. See also https://huggingface.

co/blog/starcoder.

Liu, K., Kim, D., Bissyand

e, T. F., Kim, T., Kim, K.,

Koyuncu, A., Kim, S., and Traon, Y. L. (2019). Learn-

ing to spot and refactor inconsistent method names. In

ICSE, pages 1–12. IEEE / ACM.

Ozkaya, I. (2023a). Application of large language models to

software engineering tasks: Opportunities, risks, and

implications. IEEE Softw., 40(3):4–8.

Ozkaya, I. (2023b). The next frontier in software devel-

opment: AI-augmented software development pro-

cesses. IEEE Softw., 40(4):4–9.

erez-Soler, S., Guerra, E., de Lara, J., and Jurado, F.

(2017). The rise of the (modelling) bots: Towards as-

sisted modelling via social networks. In ASE, pages

723–728. IEEE Computer Society.

erez-Soler, S., Ju

arez-Puerta, S., Guerra, E., and de Lara,

J. (2021). Choosing a chatbot development tool. IEEE

Softw., 38(4):94–103.

Rich, C. and Waters, R. C. (1988). The programmer’s ap-

prentice: A research overview. Computer, 21(11):10–

25.

Robe, P. and Kuttal, S. K. (2022). Designing PairBuddy –

A conversational agent for pair programming. ACM

Trans. Comput.-Hum. Interact., 29(4).

Ross, S. I., Martinez, F., Houde, S., Muller, M., and Weisz,

J. D. (2023). The programmer’s assistant: Conver-

sational interaction with a large language model for

software development. In IUI, pages 491–514. ACM.

Savary-Leblanc, M., Burgue

no, L., Cabot, J., Pallec, X. L.,

and G

erard, S. (2023). Software assistants in soft-

ware engineering: A systematic mapping study. Softw.

Pract. Exp., 53(3):856–892.

Steinberg, D., Budinsky, F., Merks, E., and Paternostro, M.

(2008). EMF: Eclipse Modeling Framework, 2nd edi-

tion. Pearson Education.

Wasowski, A. and Berger, T. (2023). Domain-speciﬁc lan-

guages - Effective modeling, automation, and reuse.

Springer.

Wilcoxon, F. (1945). Individual comparisons by ranking

methods. Biometrics, 1:196–202.

Xu, F. F., Alon, U., Neubig, G., and Hellendoorn, V. J.

(2022a). A systematic evaluation of large language

models of code. In MAPS@PLDI, pages 1–10. ACM.

Xu, F. F., Vasilescu, B., and Neubig, G. (2022b). In-IDE

code generation from natural language: Promise and

challenges. ACM Trans. Softw. Eng. Methodol., 31(2).

Yang, Y., Xia, X., Lo, D., and Grundy, J. C. (2022). A sur-

vey on deep learning for software engineering. ACM

Comput. Surv., 54(10s):206:1–206:73.

Zhang, J., Luo, J., Liang, J., Gong, L., and Huang, Z.

(2023). An accurate identiﬁer renaming prediction

and suggestion approach. ACM Trans. Softw. Eng.

Methodol., 32(6):148:1–148:51.

Zhao, W. X. et al. (2023). A survey of large language mod-

els. https://arxiv.org/abs/2303.18223.

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering