Dynamic Mitigation of RESTful Service Failures Using LLMs

ebastien Salva and Jarod Sue

LIMOS - UMR CNRS 6158, Clermont Auvergne University, UCA, France

Keywords:

RESTful APIs, Security, Software Healing, LLM.

Abstract:

This paper presents a novel self-healing approach for RESTful services, leveraging the capabilities of large

language models (LLMs) to generate source code that implement ﬁne-grained mitigations. The proposed

solution introduces 18 healing operators tailored for RESTful services, accommodating both grey-box and

black-box perspectives. These operators implement a dual-mitigation strategy. The ﬁrst mitigation employs

encapsulation techniques, enabling dynamic service adaptation by generating supplementary source code with-

out modifying the original implementation. If the primary mitigation fails, a fallback mitigation is applied to

maintain service continuity. We investigate the potential of LLMs to perform the ﬁrst mitigation of these

healing operators by means of chains of prompts we speciﬁcally designed for these tasks. Furthermore, we

introduce a novel metric that integrates test-passing correctness and LLM conﬁdence, providing a rigorous

evaluation framework for the effectiveness of the mitigations performed by LLMs. Preliminary experiments

using four healing operators on 15 RESTful services with various and multiple vulnerabilities demonstrate the

approach feasibility and adaptability across both grey-box and black-box perspectives.

1 INTRODUCTION

The rise of service-oriented architectures and the

widespread adoption of RESTful services have rev-

olutionised modern software system design, offering

scalability, modularity, and ease of integration. How-

ever, ensuring the reliability and security of these ser-

vices remains a persistent challenge, as the complex-

ity of service compositions increases. This challenge

is exacerbated by the growing reliance on large lan-

guage models (LLMs) for generating and maintaining

service code, which introduces new risks of vulnera-

bilities and unpredictable behaviours. In such envi-

ronments, the need for robust techniques to address

failures dynamically is more critical than ever. To

address this need, (self) dynamic repairing or heal-

ing approaches may be considered: repairing localises

and patches bugs in the source code, while healing

applies mitigations to maintain software availability

whenever possible. Repairing approaches are promis-

ing, but at the moment, they are not always accurate,

as they may generate code that contains bugs (Jesse

et al., 2023) or code with vulnerabilities (Schuster

et al., 2021).

Existing software healing approaches (Tosi et al.,

2007; Dinkel et al., 2007; Subramanian et al., 2008;

Vizcarrondo et al., 2012; Wang, 2019; Rajagopalan

and Jamjoom, 2015a; Magableh and Almiani, 2020)

offer valuable strategies to mitigate the impact of bugs

or vulnerabilities by maintaining the availability and

functionality of services during failures. These tech-

niques, however, often rely on relatively basic solu-

tions, such as service restarts, rollbacks (Subramanian

et al., 2008; Wang, 2019; Rajagopalan and Jamjoom,

2015b), or manual interventions (Tosi et al., 2007;

Dinkel et al., 2007; Subramanian et al., 2008; Vizcar-

rondo et al., 2012; Rajagopalan and Jamjoom, 2015a),

which can be labour-intensive and are not always scal-

able in dynamic and highly interconnected systems.

Given the increasing complexity of RESTful service

architectures, there is a pressing need for innovative,

ﬁne-grained healing methods that can dynamically

adapt to diverse failure scenarios while minimising

service disruptions.

This paper proposes a healing approach to quickly

enhance the reliability and security of RESTful ser-

vices deployed in production environments. Com-

pared to previous healing approaches, we propose al-

gorithms and healing operators aimed to perform ﬁne-

grained mitigations. To this end, our healing opera-

tors rely on two kinds of mitigations. The ﬁrst aims

to heal a faulty service precisely using encapsulation

techniques and source code generation. The ﬁrst mit-

igation generates source code, but unlike repair ap-

Salva, S., Sue and J.

Dynamic Mitigation of RESTful Service Failures Using LLMs.

DOI: 10.5220/0013460700003964

In Proceedings of the 20th International Conference on Software Technologies (ICSOFT 2025), pages 27-38

ISBN: 978-989-758-757-3; ISSN: 2184-2833

proaches that change original source code to ﬁx an

error, this mitigation produces additional source code

that implements an encapsulation technique to dy-

namically adapt the service behaviour with precision

either considering a grey or black-box perspective. If

the ﬁrst mitigation fails, a simpler fallback mitiga-

tion, as one of those listed previously, is applied. The

choice between black-box and grey-box perspectives

is primarily driven by the availability of source code.

With the black-box perspective, the service is treated

as an opaque entity, enabling mitigation without re-

quiring access to internal structures, making it suit-

able for proprietary or legacy systems. In contrast, a

grey-box perspective leverages partial knowledge of

the system, allowing for more precise mitigations that

may also be more resource-efﬁcient.

Depending on the nature of the observed errors

and the structure of the original service source code,

implementing mitigations using encapsulation tech-

niques can be complex and time-consuming. In this

paper, we also explore the potential of leveraging gen-

erative AI, speciﬁcally LLMs, to simplify and gener-

ate source codes that implement mitigations. Recent

studies indeed show that LLMs are effective at gen-

erating source code for tasks of easy to medium difﬁ-

culty (Austin et al., 2021; Helander et al., 2024). The

ﬁrst mitigation of a healing operator, is hence per-

formed by an LLM called with chains of prompts,

which break down a healing process into successive

tasks. These prompt chains were developed follow-

ing a strict protocol to ensure that healing operators

are both reproducible and effective. We also provide

a metric to evaluate the success of a mitigation, which

is based upon two main concepts: test passing cor-

rectness and LLM conﬁdence. The former assesses

how test suites capture defects, the latter quantiﬁes

the model internal reliability in the correctness of its

output. It helps detect hallucinations. Furthermore,

this paper presents extensive experiments on 4 REST-

ful service compositions (15 services) to demonstrate

the practical beneﬁts and limitations of this approach.

The results show that our approach achieves a 85%

success rate in healing services with LLM based mit-

igations. In summary, the major contributions of this

paper are:

• the deﬁnition of 18 healing operators specialised

for RESTful services and composed of two miti-

gations. The ﬁrst one denoted m

uses an encap-

sulation technique considering either a black-box

or a grey-box perspective. The second mitigation

based upon older techniques is considered as a

fallback solution;

• the design of two algorithms that materialise the

testing and healing processes of RESTful ser-

vices;

• the study of the capabilities of LLMs, regarded as

general generators of source code, to perform m

We provide generic chains of prompts to perform

the mitigations, i.e. prompts composed of vari-

ables;

• the deﬁnition of a metric to evaluate the successful

application of m

. This metric is based upon test

passing correctness and LLM conﬁdence;

• the evaluation of 4 healing operators on 4 REST-

ful service compositions having 4 different vul-

nerabilities, using two LLMs, by considering both

grey and black-box perspectives. We evaluate

the effectiveness of our operators to heal ser-

vice methods having one vulnerability, and ser-

vice methods having several vulnerabilities at the

same time. We also investigate how LLM conﬁ-

dence aligns with test passing correctness.

The paper is organised as follows: Section 2 re-

views the related work. In Section 3, we present the

context of this study and introduce key deﬁnitions.

Section 4 details the algorithms for vulnerability de-

tection and service healing. The design of mitigations

using LLMs, along with the evaluation metric, are

covered in Section 5. Section 6 presents our evalu-

ation results. Finally, Section 7 summarises our con-

tributions and outlines some perspectives for future

work.

2 RELATED WORK

Several surveys were proposed in the literature to

classify the features, evaluate the performances and

list the limitations of self healing systems. Frei et

al. proposed a terminology and taxonomy for self-

healing and self-repair, considering all areas of en-

gineering (Frei et al., 2013). Schneider et al. fo-

cused on healing systems and proposed a comprehen-

sive overview materialised as a tabular showing some

properties of healing systems such as the detection

criteria, or the recovery techniques (Schneider et al.,

2015). Ghahremani et al. evaluated the performances

of self-healing systems and provided another classi-

ﬁcation of different input types for such systems and

analysed the limitations of each input type (Ghahre-

mani and Giese, 2020).

A signiﬁcant portion of healing systems, such as

those in (Tosi et al., 2007; Dinkel et al., 2007; Subra-

manian et al., 2008; Vizcarrondo et al., 2012), focuses

on web services and service compositions, which are

ubiquitous in the industry. Given their prevalence and

the critical need for resilience, it is unsurprising that

ICSOFT 2025 - 20th International Conference on Software Technologies

numerous software healing solutions speciﬁcally tar-

geting web services have been proposed in the litera-

ture. In the previous surveys, the techniques listed to

heal services are limited in number and mostly cor-

respond to: reconﬁgure the available resources (iso-

lation, reroute messages to another instance, service

restart, data restoration), replace it with an older ver-

sion, add Quality of Service (QOS) metadata, limit

access to some URLs with rule policies, reorganise

the service composition if the service descriptions are

provided or perform manual intervention. The sur-

veys concluded that the characteristics of the recovery

techniques are not sophisticated, and thus the validity

of the outcome is often not clear. This is mostly due

to the limitations of the healing approaches, which

do not change the system source code. Some more

recent papers about service or micro-service heal-

ing presented more sophisticated techniques, such as

machine learning to detect anomalies. But the cho-

sen recovery techniques yet correspond to manag-

ing the versions of services without human interven-

tion (Wang, 2019; Rajagopalan and Jamjoom, 2015a;

Magableh and Almiani, 2020).

Compared to the previous approaches, we study

the possibility of using LLMs for improving the heal-

ing process. Overall, AI has been used in various

software engineering activities for decades. Allama-

nis et al. proposed a survey of machine learning

methods applied to source code in (Allamanis et al.,

2018). Many learning methods have been proposed

for program repair, e.g.,(Fan et al., 2023; Jiang et al.,

2023; Yasunaga and Liang, 2020; Chen et al., 2021;

Maniatis and Tarlow, 2023) These approaches offer

promising results but they still may generate code that

contains bugs (Jesse et al., 2023) or code with known

and risky vulnerabilities (Schuster et al., 2021). Re-

cently, LLMs have been more and more used for pro-

gramming. For example, Kanade et al. proposed

CuBERT, a specialised transformer model for code

understanding (Kanade et al., 2020). Other papers

showed that LLMs such as ChatGPT can solve easy

and medium programming problems (Austin et al.,

2021; Helander et al., 2024). They can also correctly

audit applications to detect vulnerabilities and explain

the results (Ma et al., 2024).

This paper studies and leverages the ability of

LLMs called with specialised chains of prompts to

generate source code for healing services with ﬁne

granularity i.e. at the method level instead of the ser-

vice level as the previous healing approaches. But

if the result provided by the LLM is unsuccessful to

mitigate a vulnerability, our algorithms apply one of

the classical recovery techniques cited previously as a

failing mitigation.

3 CONTEXT AND ASSUMPTIONS

Let S be a countable set of RESTful services, and

T be a countable set of test cases. A service s ∈ S

is deployed in an environment that supports testing

from two perspectives: black-box and grey-box test-

ing. With the former, the service s is accessible exclu-

sively through HTTP requests and responses, referred

to as communication events. To determine the success

or failure of a test, these events can be read (assum-

ing no encryption). The events also include parame-

ter assignments, which allow for the identiﬁcation of

their source and destination. In the case of grey-box

testing, the service s can still be experimented using

events, but the project’s source ﬁles are also accessi-

ble.

To simplify fault localisation, which is not the pri-

mary focus of this paper, we assume that a service

s ∈ S can be tested in isolation. This assumption offers

several practical beneﬁts. First, fewer resources are

required, as each test case is executed on a single ser-

vice instance at a time. Testing s in isolation removes

dependencies, preventing test execution from being

blocked by faults in dependent services and speeding

up the process. Additionally, testing in isolation re-

duces verdict inconsistencies caused by environmen-

tal factors, such as network trafﬁc or container man-

agement. Finally, isolation testing explicitly simpli-

ﬁes fault localisation.

We consider having a test suite denoted T (s) ⊂ T

for each service s ∈ S . T (s) contains test cases en-

coding functional scenarios, which will be used here

for regression testing. A service s passes a test case

t ∈ T (s), denoted s passes t if and only if all possible

test executions result in a pass verdict. This implies

that the service may need to be tested multiple times

to cover all possible scenarios encoded in a single test

case and to explore any nondeterministic behaviours

of the service. The notation s fails t indicates that s

does not pass the test case.

We also assume the existence of additional test

suites to assess non-functional aspects of the services.

In this work, we focus on security and robustness, al-

though other aspects, such as performance, could be

explored in future research. We denote V as the set

of countable vulnerabilities, and T

(s) ⊂ T as the test

suite used to detect whether the service s has the vul-

nerability v ∈ V . T

(s) includes test cases that sim-

ulate various attack scenarios, or experiment s with

unexpected events for robustness. Given a test case

t ∈ T

(s) , if s fails t, then we say that s has the vul-

nerability v. At this point, we consider having the

test suites T

(s),...,T

(s) to check whether s has

the vulnerabilities v

,...,v

Dynamic Mitigation of RESTful Service Failures Using LLMs

However, testing services in isolation requires

writing speciﬁc test cases that interact with mocks,

which simulate the real requisite services called by s.

Additionally, writing test cases and mocks that simu-

late attacks is often considered as a challenging task.

To alleviate this difﬁculty, we proposed test case and

mock generation algorithms, along with supporting

tools in (Salva and Blot, 2020; Salva and Sue, 2023;

Salva and Sue, 2024).

4 VULNERABILITY DETECTION

AND SERVICE HEALING

We propose in this paper a proactive solution for

detecting and subsequently healing vulnerabilities in

services. This approach involves periodically evalu-

ating a Quality of service (QoS), and when QoS is

low, healing services. The QoS can be measured at

each service deployment, at speciﬁc time intervals, or

when server errors are detected (e.g., receipt of events

with an HTTP status 500). We propose to deﬁne the

QoS measurement as the ratio of passing tests. We

present a ﬁrst algorithm, Algorithm 1, to experiment

each service, detect vulnerabilities and localise them

on the service interfaces. If vulnerabilities are de-

tected, a second healing algorithm is then invoked.

Both algorithms are presented subsequently.

4.1 Vulnerability Detection and

Localisation

In our context of service healing, a vulnerability lo-

calisation refers to a URL (in black-box mode) or

a method (in grey-box mode) that invokes speciﬁc

source code containing a vulnerability. For a service s

and a test case t ∈ T

(s) such that s fails t, Algorithm 1

infers the localisation, denoted l(s), of the vulnerabil-

ity v, which are then passed to the healing algorithm.

Deﬁnition 1. Let X be a variable set and V a value

set. We write x := ∗ the assignment of the variable x

with an arbitrary element of V .

A vulnerability localisation l(s) for the service

s ∈ S is the tuple (U RL,Verb,Method,Package) com-

posed of variables in X , which may be assigned to

values in V . L stands for the set of vulnerability lo-

calisations.

The variables of a localisation refer to static infor-

mation extracted from the source code or the interface

of a RESTful service. Verb represents an HTTP verb

(e.g., GET), Method refers to a method name, and

Package is assigned to a string that deﬁnes a names-

pace organising classes. It is noteworthy that the vari-

ables of a localisation could be easily expanded to

support additional vulnerability types or faults. The

variables of l(s) should be assigned to values just af-

ter the failure of a test case, either directly (extrac-

tion from the test case of the URL and HTTP verb)

or collected from the source code of the service s if

the grey-box mode is used. This extraction from the

source code can be performed with automatic tools,

including LLMs. We also denote T (l(s)) ⊆ T (s) as

the test suite that exercises the service s at the locali-

sation l(s). For RESTful services, this subset can be

determined by scanning the URLs and verbs found in

the test cases of T (s).

Algorithm 1: Service Testing and Vulnerability Localisa-

tion.

input : Service s, T (s), T

(s),... , T

(s),

p ∈ {black box,grey box}

output: V l

1 V l :=

2 foreach t ∈ T (s) do

3 if s fails t then

4 Aler t ;

5 Replace s with an older and trusted version;

6 Deploy s;

7 foreach t ∈ T

(s) with 1 ≤ i ≤ k do

8 if s fails t then

9 Locate the vulnerability v

and infer l(s);

10 {l

(s),... , l

(s)} := complete(l(s), p);

11 V l := V l ∪ {(v

(s)),... , (v

(s))};

Algorithm 1 takes a service s ∈ S, some test suites

T (s), T

(s),...,T

(s) and a testing perspective p. It

returns a set V l composed of pairs of the form (vul-

nerability, localisation). The service s is ﬁrstly exper-

imented with T (s) to detect potential regressions. If

the service does not pass a test case of T (s), an alert

is raised, furthermore the service is directly healed by

replacing it with an older and trusted version. Other

actions might be possible, e.g., the service composi-

tion reorganisation (Subramanian et al., 2008), or a

manual intervention. If the service does not pass a

test case t ∈ T

(s), a localisation l(s) is inferred ei-

ther from t or from the source code project of the

service. Subsequently, the procedure complete is in-

voked to assign values to the variables of l(s). We

consider that the URL variable is always assigned to

a value as a failing test case always refers to a URL

called to experiment a RESTful service. If the VERB

variable is not assigned to a value, then 4 localisa-

tions are built for the HTTP verbs GET, POST, PUT,

DELETE. With the grey-box perspective only, the

procedure scans the project source ﬁles to get addi-

tional information (package, method names). A URL

ICSOFT 2025 - 20th International Conference on Software Technologies

and HTTP verb can only be linked to one method,

given in one package. The resulting pairs (v

(s)),

expressing that s has the vulnerability v

tied to a fault

location l

(s), are added to the set V l.

4.2 Service Recovery

A service s ∈ S , having a vulnerability v, is healed

with an operator denoted H

. This healing operator is

composed of two mitigation functions m

and m

that

aim at returning a new service s

′

. The ﬁrst mitigation

attempts to produce a new service s

′

aiming to min-

imise the effects of the vulnerability v or by eliminat-

ing v from s at the localisation l(s). This mitigation,

which uses an encapsulation technique, is performed

by calling an LLM with a chain of prompts denoted

(a) Black-box perspective.

(b) Grey-box perspective.

Figure 1: Healing process by encapsulation: in black-box,

a new service s

′

calls the faulty service s to modify its be-

haviours; in grey-box, new classes are included in the orig-

inal source code of s. With both perspectives, the original

source code remains unmodiﬁed.

Figure 1 illustrates the intuition behind the encap-

sulation techniques. With the black-box perspective,

builds a new service s

′

that calls/encapsulates the

original service s, ﬁltering both incoming and out-

going messages. With the grey-box perspective, m

enhances the service functionality without modifying

the original source code by introducing new classes.

These classes implement various encapsulation tech-

niques, such as introspection, HTTP ﬁlters, or aspect-

oriented programming, to heal the service s. This ap-

proach allows engineers or automated methods to re-

pair the service later by modifying the original source

code. If the mitigation function m

fails to heal the

service s, then m

is invoked. This function imple-

ments a fallback mitigation, which is simpler to apply.

Deﬁnition 2 (Healing Operator H

). A healing oper-

ator H

for the vulnerability v is the tuple (m

)

such that :

• m

: S × L × p → S is a mitigation function that

takes a service s, a localisation l(s), a perspective

p (black or grey-box) and produces a service s

′

(s,l(s), p);

• m

: S → S is an mitigation function that produces

a service s

′

= m2(s).

We designed healing operators to heal 18 vulner-

abilities. The operator list is presented in (Sue and

Salva, 2025) with a tabular form that gives: the gen-

eral weakness, the vulnerability targeted by a given

CAPEC

attack, the generic mitigations m

and m

for both testing perspectives. m

is materialised by

prompt chains while m

refers to a classical mitiga-

tion action.

As we wish the mitigation succeeds in healing

RESTful services, we applied the solutions proposed

and evaluated in the papers (Tosi et al., 2007; Dinkel

et al., 2007; Subramanian et al., 2008; Vizcarrondo

et al., 2012). As stated in Section 2, for most of the

healing operators, the mitigation m

comes down to

”reverting back to a previous and trusted version”.

For some operators, m

corresponds to ”sanitising the

data”, which applies an input validator on the input

data to ﬁlter out potential attacks.

To ensure that the mitigation m

can effectively

heal the functionality of RESTful services, we applied

the solutions proposed and evaluated in the prior pa-

pers (Tosi et al., 2007; Dinkel et al., 2007; Subrama-

nian et al., 2008; Vizcarrondo et al., 2012). As dis-

cussed in Section 2, for most healing operators, the

mitigation m

mostly comes down to ‘reverting to a

previous and trusted version’. For certain operators,

corresponds to ‘sanitizing the data,’ which entails

applying an input validator to ﬁlter out potentially ma-

licious inputs. sFigure 3 illustrates an example for the

operator designed to heal a service vulnerable to the

attack CAPEC 274 (HTTP verb tampering). For read-

ability reasons, we only show the two generic prompt

chains used for the mitigation m

. The prompt chain

construction is detailed in the next section.

Algorithm 2 implements a simple iterative heal-

ing process. It takes the service s ∈ S , the two test

suites T (s) ∈ T and T

(s) ∈ T along with the sub-

set {(v,l

(s)),...,(v,l

(s))} returned by Algorithm 1.

Algorithm 2 covers each localisation and tries to heal

the service with the mitigation m

of the healing op-

erator H

. If the mitigation is applied with success,

it returns a new service s

′

, which replaces s. m

made up of a generic chain of prompts speciﬁcally

written to mitigate v. m

calls a prompt generator to

instantiate the generic prompts, i.e to assign every of

its variables to a concrete value. This instantiation

is performed by two ways. Either a variable also be-

longs to a localisation l(s) given as input to m

. In this

https://capec.mitre.org/

Dynamic Mitigation of RESTful Service Failures Using LLMs

Algorithm 2: Service Healing.

input : Service s, T (s), T

(s), {(v,l

(s)),... , (v, l

(s))} ⊆ V l,

perspective = greybox/blacbox

output: New Service s

1 i := 1;

2 while i <= n do

3 s

′

:= m

(s,l

(s), perspective) with H

= (m

);

// m

calls a prompt generator and build a

prompt chain p

// m

calls an LLM with p

to get source code

// m

aggregate files, if several files with

the same name are available

4 if success(m

(s,l

(s), perspective), T (l(s)),T

(l(s))) then

5 Replace s by s

′

;

6 else

7 s

′′

:= m

(s) ;

8 Replace s by s

′′

;

9 Deploy s;

case, the assignment is directly performed. Or, in the

grey-box perspective, further information is collected

from the service source code project. Then an LLM

is called with the prompt chain p

to generate source

code and to build a new service. The success of the

mitigation is evaluated with the function success that

takes the new service, and test suites (line 4). If the

application of m

at a localisation is unsuccessful, the

mitigation function m

is then called (line 7). At the

end of this algorithm, the service s is replaced by an-

other service s

′

, which is ﬁnally deployed.

The design of prompt chains, the execution of the

mitigation function m

and the concept of success are

detailed in the next section.

5 MITIGATION FUNCTIONS

WITH LLM

A mitigation function m

generates source code by

means of an LLM queried with a chain of prompts.

This section describes how prompt chains have been

designed, the application of the mitigation function

and the evaluation of its success.

5.1 Writing of the Mitigation Function

We have written prompt chains for the mitigations m

of the healing operators by means of a strict proto-

col based upon prompt engineering techniques to op-

timise the generation of correct healing corrections

in a reproducible manner. We followed these super-

vised and incremental steps to write prompt chains

1) Act as an expert in software security

Service information

Task:

constraints 1

constraints 2

Format: Don’t include any explanations in your responses Give me

the code wrapped in <code> tags. Give me 3 different

responses :

2) Give me an agreement score on the 3 responses expressing that

the code is ” error free ” and ”meets the demand” between 0

and 100 with 0 the code is incorrect and 100 the code is

” error free ” and meets the demand

3) Give me all the modiﬁcation required to make my project

working for the response X

Figure 2: Generic form of chain of prompts.

specialised to mitigate security or robustness vulnera-

bilities:

1. construction of a ﬁrst prompt deﬁning a context, a

list of actions required to heal a service for a given

vulnerability v and asking 5 different solutions (no

generation of code here);

2. writing of a prompt chain for calling a LLM to

generate source code : we manually evaluated the

previous solutions to order them from correct to

incorrect using the CAPEC base; from the three

correct and best solutions, we wrote a ﬁrst prompt

chain;

3. cycle of reﬁnement on the prompt chain: we ap-

plied the prompt chain on case studies and incre-

mentally optimised it by manually evaluated the

correctness of the source codes;

4. cycle of reﬁnement to increase reproducibility:

we applied multiple times the prompt chain on the

same case study and assessed reproducibility by

means of a source code similarity measure applied

on the generated source codes. The prompt chain

is kept when the similarity score exceeds 70%;

5. writing of the generic prompt chain p

made up

of variables to replace the speciﬁc information re-

lated to the case studies.

Figure 2 summarises the three initial generic

forms of prompts we use to write prompt chains at

step 2. Those are completed in accordance with a

vulnerability v and may be split into several prompts

while the reﬁnement process. The reason of asking

three different responses is related to LLM conﬁdence

assessment, which is discussed below.

A generic prompt, which contains variables, is

instantiated by assigning every of its variables to a

value, by means of a prompt generator. This instanti-

ation can occur in two ways. If a variable also belongs

to a localisation l(s) provided as input to m

, the value

ICSOFT 2025 - 20th International Conference on Software Technologies

Figure 3: Prompt examples to heal the vulnerability targeted by CAPEC-274.

is directly retrieved from l(s). Alternatively, with the

grey-box perspective, additional information is ex-

tracted from the service’s source code project, such

as the project name, class names, and other relevant

metadata. In the example shown in Figure 3, variables

are represented by words in uppercase. The variables

VERB and ROUTE are assigned to values retrieved

from l(s). The variables TYPE and VERSION, which

indicate the framework type, are either hardcoded or

extracted from the source code project. Similarly,

the variable TECH, which denotes the encapsulation

approach (e.g., ﬁlter, aspect programming, etc.), is

also either hardcoded or determined based on project-

speciﬁc information.

5.2 Reliability Evaluation

Algorithm 2 uses the boolean expression success :

S × T × T

× P → {true, f alse} to evaluate whether

the mitigation function m

has healed a service s cor-

rectly. This evaluation is performed using both test

passing correctness and LLM conﬁdence to compute

a reliability score, denoted (corr

,cor r

,con f ).

Test passing correctness, which exclusively relies

on test suites, is measured by two indicators corr

and

corr

, the ratio of passing tests in T (l(s)) and the ratio

of passing tests in T

(l(s)). Both are obtained after the

experimentation of the new service s

′

generated by the

mitigation function m

. corr

helps identify potential

regressions in the new service s

′

, and corr

allows to

check whether the vulnerability v has been mitigated

at the localisation l(s).

The LLM conﬁdence indicator con f measures

how much trust we should put into the source code

the LLM has generated. A higher conﬁdence score

should mean a higher likelihood of being correct

(Austin et al., 2021). This indicator can be ob-

tained by using intrinsic measures based on condi-

tional probability, i.e. the probability derived from

the response generated by the LLM, or by using self-

reﬂective measures, i.e. metrics that evaluate the

LLM’s own conﬁdence within its responses. Recent

works showed that the second approach often returns

better-calibrated conﬁdence scores (Tian et al., 2023).

The LLM conﬁdence evaluation (and the reduc-

tion of overconﬁdence) can be improved by using dif-

ferent kinds of techniques (Xiong et al., 2024), e.g.,

ask for several responses, ask to different LLMs, ask

for several explanations and evaluate them. This is

why we request the LLM for three source code gen-

erations in our prompt chains. Hence, every time we

apply a mitigation function m

, we obtain 3 responses

and 3 tuples (corr

,cor r

,con f ).

Next, we apply Reciprocal Rank Fusion (RRF),

which is a rank aggregation approach that combines

rankings from multiple sources into a single ranking.

The results of our experimentations suggest to apply

RFF only on the test passing scores corr

and corr

If the best tuple (corr

,cor r

,con f ) has scores higher

than 90%, then success returns true else false. This

90% threshold will be analysed during the evaluation.

6 EXPERIMENTAL RESULTS

We investigated the capabilities of our algorithms

through the following questions:

• RQ1: What is the effectiveness of the mitigations

with service methods having one vulnerabil-

ity?

• RQ2: How does effectiveness evolve when ser-

vice methods have several vulnerabilities?

• RQ3: How well LLM conﬁdence aligns with test

passing correctness?

Dynamic Mitigation of RESTful Service Failures Using LLMs

This study was conducted with 4 healing operators

denoted H

verb

, H

xss

, H

path

, H

token

, which aims at heal-

ing services vulnerable to Verb Tampering (CAPEC-

274), XSS injection (CAPEC-86), Path Traversal

(CAPEC-126) and Token Impersonation (CAPEC-

633). For readability, the vulnerabilities are denoted

verb

, v

xss

, v

path

, v

token

. We considered the following

RESTful service compositions :

• C1: Piggy metric

is a ﬁnancial advisor appli-

cation composed of 3 micro-services specialised

in account management, statistics generation and

notiﬁcation management;

• C2: eShopOnContainers

, implementing an e-

commerce web site using a services-based archi-

tecture (5 services);

• C3: a loan approval process implemented with 4

services developed by third year computer science

undergraduate students;

• C4: a composition of 3 services used to imple-

ment an online shop (stock and client manage-

ment, purchase, etc.) developed by students.

We wrote between 7 and 60 conformance test

cases per RESTful service. We checked that the ser-

vices pass the test cases. As all the services have

deterministic behaviours, we did need to experiment

them multiple times. For each RESTful service, we

developed between 7 and 60 conformance test cases.

We then checked that the services successfully passed

them. Since all services exhibit deterministic be-

haviour, repeated test execution was not required. We

then generated security tests with our approach pre-

sented in (Salva and Sue, 2024) that builds test cases

to try detect the fours vulnerabilities listed previously.

We obtained between 6 and 75 security test cases per

service. We then intentionally injected vulnerabilities

on the 15 RESTful services, as non-vulnerable ser-

vices cannot be healed and are therefore not relevant

for this evaluation. We generated mutants from the

RESTful service source codes by applying the tool

PITest

completed by our own source code mutations

injecting modiﬁcations. We kept the 200 mutants that

have at least one failing security test case.

To perform this evaluation, we implemented a

prototype tool that applies the mitigation m

of the

4 healing operators on RESTful services by call-

ing LLMs. We considered two open-source LLMs,

Mistral’s Codestral-22B, which is an open-weight

small language model explicitly designed for code

generation tasks, and Meta’s Llama3.1-70B, a larger

https://github.com/sqshq/piggymetrics

https://github.com/dotnet/eShop

https://github.com/hcoles/pitest

Table 1: Precision@1 detailed for both perspectives and

both LLMs.

Perspective LLM Precision@1

Black-box Codestral-22B 0.87

Llama3.1-70B 0.87

Grey-box Codestral-22B 0.78

Llama3.1-70B 0.87

model capable of generating code, and natural lan-

guage about code, from both code and natural lan-

guage prompts. This prototype tool heals services

with our base of prompt chains, it compiles, rede-

ploys the healed services, and experiments them with

test suites. It also returns a reliability evaluation score

as presented in Section 5.2. All the services, tools,

prompt chains are available in (Sue and Salva, 2025).

6.1 RQ1: What Is the Effectiveness of

the Mitigation M

with Service

Methods Having One Vulnerability?

Setup: to investigate RQ1, we applied the miti-

gation m

of the healing operators on the mutants

and we computed the reliability evaluation scores

(corr1,corr2,con f ) for all healed services, with both

black-box and grey-box perspectives. In this analy-

sis, we considered the three source codes generated by

the LLMs for each vulnerability localisation, not just

the best one. From the tuples (corr1,corr2,con f ),

we also calculated Precision@1, which represents the

precision of the best solution returned by the LLMs.

Precision@1 reﬂects the proportion of corrections

that were truly effective in healing the mutants. To

determine Precision@1, we manually reviewed the

source codes of the best solutions provided by the

LLMs to verify whether the vulnerability was truly

ﬁxed.

Results: Figure 4 depicts violin diagrams show-

ing for every vulnerability: the distributions, medi-

ans, quartiles along with densities of the scores corr1,

corr2 and con f obtained for all the healed services.

Additionally, Table 1 gives Precision@1 for each per-

spective and for each LLM.

The violin diagrams show that the ratios of pass-

ing tests corr1 and corr2 are most of the time at 100%

(distribution : 92% of ratios of passing tests above

90% for corr1, 82% of ratios of passing tests above

90% for corr2) for all the vulnerabilities and all the

obtained healed services (not only the best solution).

This shows that our approach is effective in repair-

ing these vulnerabilities with our prompt chains and

LLMs. The diagrams also indicate, through corr1,

that there are some differences among healing opera-

tors and vulnerabilities. For H

path

and H

verb

, the ra-

ICSOFT 2025 - 20th International Conference on Software Technologies

(a) . (b) .

Figure 4: Violin diagrams showing the distributions, medians, quartiles along with densities of the scores corr1, corr2 and

con f obtained after having applied the healing operators H

verb

(4a), H

xss

(4b), H

path

(4c), H

token

(4d on all the mutants).

tio of passing tests is sometimes below 100% and can

drop as low as 20%. These results suggest that the

healing process in these cases compromises the initial

functional behaviours of the RESTful services. This

highlights the importance of computing corr1 to as-

sess the impact of healing on service functionalities.

In contrast, the ratios of passing security tests, repre-

sented by corr2, are consistently either 100% or 0%,

which shows whether the targeted vulnerability has

been repaired or not.

The effectiveness of the approach is manually ver-

iﬁed using Precision@1, which achieves an overall

value of 85%. This indicates that 85% of the top-

ranked healed services are really healed. Table 1

demonstrates that there is no signiﬁcant difference

between the two perspectives when using Llama3.1-

70B, whereas Codestral shows reduced effectiveness

under the grey-box perspective. When comparing the

two LLMs, their effectiveness is comparable within

the black-box perspective and lower for Codestral

otherwise. This ﬁnding is noteworthy as it highlights

that a locally executable LLM can perform compet-

itively compared to a larger LLM. The observations

performed on the violin diagrams and with Preci-

sion@1 tend to suggest that the 90% threshold con-

sidered in the boolean expression success, to state

whether a healed service is truly healed, could be

slightly augmented to increase Precision@1. Further

experimentation may be needed to establish a general

threshold.

Figure 4 also highlights the high conﬁdence lev-

els, with con f , provided by the LLMs on their gen-

erated source codes, with values predominantly rang-

ing between 80% and 100%. Conﬁdence is slightly

higher for H

xss

and H

token

. The violin diagrams show

that LLM conﬁdence does not contradict test pass-

ing correctness. However, whether LLM conﬁdence

alone can be reliably trusted by users will be ad-

dressed in RQ3.

6.2 RQ2: How Does Effectiveness

Evolve when Service Methods Have

Several Vulnerabilities?

Setup: RQ2 studies whether the healing operators

and their mitigations m

can be effectively applied

several times on the same service method having

Dynamic Mitigation of RESTful Service Failures Using LLMs

several vulnerabilities. We took back the compo-

sition C1, which includes 3 services and a total of

12 methods, and manually injected the 4 vulnerabil-

ities considered in this evaluation into each method.

We then considered the 1152 code projects gener-

ated by iteratively applying permutations of the four

healing operators, by querying two LLMs (Llama3.1

and Codestral), under either a black-box or grey-box

perspective. With the latter perspective, the gener-

ated source code of the same class was incremen-

tally pasted, without manual intervention. Following

the application of every mitigation of the healing op-

erators, we calculated the percentage of healed ser-

vices achieving a test-passing success rate above 90%

(corr

> 90% ∧ corr

> 90% ).

Figure 5: Ratio of projects with a test-passing success rate

above 90% after the iterative application of mitigations.

Results: Figure 5 illustrates two curves corre-

sponding to the effectiveness of the healing process

for both perspectives. In the black-box approach,

the healing process—deﬁned as generating a new ser-

vice that encapsulates the vulnerable one—maintains

a consistent success ratio throughout. This result

aligns with our expectations since this healing ap-

proach does not introduce an iterative increase in dif-

ﬁculty for the LLMs. Each newly generated service

addresses a single vulnerability and calls a vulnerable

older service. However, this result masks a signiﬁ-

cant limitation: this solution cannot be applied indeﬁ-

nitely due to its reliance on creating and deploying an

increasing number of new services, which becomes

resource-intensive. With the grey-box perspective, we

observe a noticeable decline in the ratio of healed ser-

vices achieving a test-passing success rate above 90%

as the number of mitigations increases. This ratio

drops by up to 30%. A key contributing factor is that

the resulting source codes often include overlapping

references to the same objects or ﬁlters. These re-

dundancies prevent seamless integration of multiple

Table 2: Spearman’s rank correlation coefﬁcient between

corr1,corr2 and con f for each correction type.

Codestral-22B Lllama3.1-70B

Black-box Grey-box Black-box Grey-box

corr1/conf -0.006 -0.171 -0.001 0.070

corr2/conf -0.033 -0.163 0.044 0.071

mitigations, necessitating manual intervention. These

issues could potentially be resolved in the near future

with more efﬁcient LLMs.

6.3 RQ3: How Well the Conﬁdence

Aligns with Test Passing

Correctness?

Setup: We investigated RQ3 by taking back the tu-

ples (corr1, corr2,con f ) for all the healed services

obtained after applying the healing operators on the

mutants with black-box and grey-box perspectives.

We calculated the Spearman’s rank correlation co-

efﬁcients between test-passing correctness and LLM

conﬁdence, segmented by vulnerability, LLM, and

testing perspective. The Spearman’s rank correlation

coefﬁcient measures the strength and direction of a

monotonic relationship between two variables, and

returns a value between -1 and 1: the coefﬁcient is

equal to 0 when both variables are uncorrelated, to

1 when there is a perfect positive monotonic correla-

tion (as one variable increases, the other always in-

creases), and to -1 when there is a perfect negative

monotonic correlation.

Results: Table 2 presents correlation computa-

tions between corr1 and con f , as well as corr 2 and

con f , with both perspectives and both LLMs. In all

cases, the results indicate that the correlations are neg-

ligible, as the monotonic relationships are too weak.

We observed multiple times in our results that the

main reason of weak alignment is the tendency of

LLMs to assign overconﬁdent scores to incorrect or

incomplete source code (which should be completed

by the user).

Consequently, LLM conﬁdence cannot be used as

a reliable indicator of healing effectiveness. Testing

stages hence remain essential. Nevertheless, a low

conﬁdence score can be a useful indicator for quickly

identifying potential issues in code generation. A low

score may indicate insufﬁcient knowledge of the ap-

propriate mitigation to apply, an ambiguous prompt,

or the high complexity of the vulnerability being ad-

dressed. In any case, the resulting service should be

rejected.

ICSOFT 2025 - 20th International Conference on Software Technologies

6.4 Threats to Validity

We now address potential external and internal threats

to the validity of our evaluation. We identiﬁed 3 ex-

ternal and 2 internal threats.

Regarding external threats, the ﬁrst one concerns

the choice of our case studies. To limit the impact of

this threat, we evaluated our tool on RESTful services

covering different domains, two of these case studies

were implemented by beginner developers, the other

two case studies involved microservice compositions

that have been extensively evaluated in previous re-

search. While the inclusion of varied case studies

strengthens our evaluation, these case studies do not

represent all possible types of service compositions.

As a result, we avoid drawing overly broad conclu-

sions from our results. Another external threat relates

to the test suites used during the evaluation. We gen-

erated conformance test cases to ensure that every ser-

vice method was tested by at least two test cases, and

we utilised the security testing approach from (Salva

and Sue, 2024) to create security-speciﬁc test cases.

The latter demonstrated an increase in branch cover-

age by 20% and consistently detected vulnerabilities

in the same services used in the experiments. How-

ever, the test suites may still lack comprehensive cov-

erage for all possible service behaviours or vulner-

abilities, limiting their general applicability to other

contexts. Additionally, we used mutation of service

methods to inject vulnerabilities. To mitigate the risk

that these mutations do not reﬂect realistic vulnerabil-

ities, we employed the well-known, open-source tool

Pitest, supplemented with custom mutations to ensure

every method contained at least one vulnerability. We

also conducted manual checks on the source code to

verify the accuracy of the mutations. Despite these

measures, the generated mutations may not capture

the full range of vulnerabilities encountered in real-

world scenarios.

Other factors that may threaten internal valid-

ity relate to the design and execution of the prompt

chains used in our experiments. We followed a strict

protocol when designing the prompt chains and ap-

plied them consistently across several services. How-

ever, a prompt chain may be interpreted differently

by another LLM. In our experiments, we queried

LLMs using chains of prompts while keeping addi-

tional model parameters, such as temperature, top k

top p to their default values. These parameters could

have been ﬁne-tuned for each LLM to potentially im-

prove code generation performance.

7 CONCLUSION

This paper has proposed the design and evaluation

of a RESTful service healing approach, which in-

troduces ﬁne-grained mitigation techniques to en-

hance reliability and security of RESTful services.

We propose 18 healing operators, each incorporat-

ing two mitigation strategies: m

utilises encapsula-

tion techniques under grey-box or black-box perspec-

tives, generating additional source code for dynamic

service adaptation without altering the original source

code. If m

fails, a fallback mitigation m

is applied.

The study also explores the capabilities of LLMs as

general-purpose source code generators for m

, and

deﬁne a novel metric to evaluate their success. An

evaluation with 4 operators on RESTful services hav-

ing different vulnerabilities demonstrates the effec-

tiveness of m

in both grey-box and black-box con-

texts, addressing single and multiple vulnerabilities.

This evaluation underscores the challenge of

aligning LLM conﬁdence with test-passing success,

demonstrating that LLMs alone cannot be fully

trusted for automated healing of RESTful services.

However, when sufﬁciently large test suites are avail-

able, our test-passing-based metric proves effective

for assessing healing success, achieving an overall

Precision@1 of 85%, though there is room for im-

provement. A compelling research direction would

be to explore methods for evaluating healing success

with minimal or no test suites. This could involve

developing alternative metrics or leveraging LLMs to

generate test cases, which would then require their

own rigorous evaluation.

REFERENCES

Allamanis, M., Barr, E. T., Devanbu, P., and Sutton, C.

(2018). A survey of machine learning for big code

and naturalness. ACM Comput. Surv., 51(4).

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski,

H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q.,

and Sutton, C. (2021). Program synthesis with large

language models.

Chen, Z., Kommrusch, S., Tufano, M., Pouchet, L., Poshy-

vanyk, D., and Monperrus, M. (2021). Sequencer:

Sequence-to-sequence learning for end-to-end pro-

gram repair. IEEE Transactions on Software Engi-

neering, 47(09):1943–1959.

Dinkel, M., Stesny, S., and Baumgarten, U. (2007). Inter-

active self-healing for black-box components in dis-

tributed embedded environments. In Communication

in Distributed Systems - 15. ITG/GI Symposium, pages

1–12.

Fan, Z., Gao, X., Mirchev, M., Roychoudhury, A., and Tan,

Dynamic Mitigation of RESTful Service Failures Using LLMs

S. H. (2023). Automated repair of programs from

large language models.

Frei, R., McWilliam, R., Derrick, B., Purvis, A., Tiwari,

A., and Di Marzo Serugendo, G. (2013). Self-healing

and self-repairing technologies. International Journal

of Advanced Manufacturing Technology, 69(5):1033–

1061.

Ghahremani, S. and Giese, H. (2020). Evaluation of self-

healing systems: An analysis of the state-of-the-art

and required improvements. Computers, 9(1).

Helander, V., Ekedahl, H., Bucaioni, A., and Nguyen, T. P.

(2024). Programming with chatgpt: How far can we

go? Machine Learning with Applications, 1:1–34.

Jesse, K., Ahmed, T., Devanbu, P. T., and Morgan, E.

(2023). Large language models and simple, stupid

bugs. In 2023 IEEE/ACM 20th International Confer-

ence on Mining Software Repositories (MSR), pages

563–575.

Jiang, N., Liu, K., Lutellier, T., and Tan, L. (2023). Im-

pact of code language models on automated program

repair.

Kanade, A., Maniatis, P., Balakrishnan, G., and Shi, K.

(2020). Learning and evaluating contextual embed-

ding of source code. In III, H. D. and Singh, A.,

editors, Proceedings of the 37th International Con-

ference on Machine Learning, volume 119 of Pro-

ceedings of Machine Learning Research, pages 5110–

5121. PMLR.

Ma, W., Wu, D., Sun, Y., Wang, T., Liu, S., Zhang, J., Xue,

Y., and Liu, Y. (2024). Combining ﬁne-tuning and

llm-based agents for intuitive smart contract auditing

with justiﬁcations.

Magableh, B. and Almiani, M. (2020). A self healing

microservices architecture: A case study in docker

swarm cluster. In Barolli, L., Takizawa, M., Xhafa,

F., and Enokido, T., editors, Advanced Information

Networking and Applications, pages 846–858, Cham.

Springer International Publishing.

Maniatis, P. and Tarlow, D. (2023). Large sequence models

for software development activities.

Rajagopalan, S. and Jamjoom, H. (2015a). App–Bisect:

Autonomous healing for Microservice-Based apps. In

7th USENIX Workshop on Hot Topics in Cloud Com-

puting (HotCloud 15), Santa Clara, CA. USENIX As-

sociation.

Rajagopalan, S. and Jamjoom, H. (2015b). App–Bisect:

Autonomous healing for Microservice-Based apps. In

7th USENIX Workshop on Hot Topics in Cloud Com-

puting (HotCloud 15), Santa Clara, CA. USENIX As-

sociation.

Salva, S. and Blot, E. (2020). Cktail: Model learn-

ing of communicating systems. In Proceedings of

the 15th International Conference on Evaluation of

Novel Approaches to Software Engineering - Volume

1: ENASE,, pages 27–38. INSTICC, SciTePress.

Salva, S. and Sue, J. (2023). Automated test case gen-

eration for service composition from event logs. In

38th IEEE/ACM International Conference on Auto-

mated Software Engineering, ASE 2023 - Workshops,

Luxembourg, September 11-15, 2023, pages 127–134.

IEEE.

Salva, S. and Sue, J. (2024). Security testing of restful

apis with test case mutation. In Kaindl, H., Mannion,

M., and Maciaszek, L. A., editors, Proceedings of the

19th International Conference on Evaluation of Novel

Approaches to Software Engineering, ENASE 2024,

Angers, France, April 28-29, 2024, pages 582–589.

SCITEPRESS.

Schneider, C., Barker, A., and Dobson, S. (2015). A survey

of self-healing systems frameworks. Software: Prac-

tice and Experience, 45(10):1375–1398.

Schuster, R., Song, C., Tromer, E., and Shmatikov, V.

(2021). You autocomplete me: Poisoning vulnerabili-

ties in neural code completion. In 30th USENIX Secu-

rity Symposium (USENIX Security 21), pages 1559–

1575. USENIX Association.

Subramanian, S., Thiran, P., Narendra, N. C., Mostefaoui,

G. K., and Maamar, Z. (2008). On the enhancement of

bpel engines for self-healing composite web services.

In 2008 International Symposium on Applications and

the Internet, pages 33–39.

Sue, J. and Salva, S. (2025). Dynamic mitigation

of restful service failures using llms, compan-

ion site. https://github.com/JarodSue/Fine-grained-

Restful-Service-Healing-using-LLMs.

Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R.,

Yao, H., Finn, C., and Manning, C. D. (2023). Just ask

for calibration: Strategies for eliciting calibrated con-

ﬁdence scores from language models ﬁne-tuned with

human feedback.

Tosi, D., Denaro, G., and Pezze, M. (2007). Shiws: A self-

healing integrator for web services. In 29th Interna-

tional Conference on Software Engineering (ICSE07

Companion), pages 55–56, Los Alamitos, CA, USA.

IEEE Computer Society.

Vizcarrondo, J., Aguilar, J., Exposito, E., and Subias, A.

(2012). Armiscom: Autonomic reﬂective middleware

for management service composition. In 2012 Global

Information Infrastructure and Networking Sympo-

sium (GIIS), pages 1–8.

Wang, Y. (2019). Towards service discovery and autonomic

version management in self-healing microservices ar-

chitecture. In Proceedings of the 13th European Con-

ference on Software Architecture - Volume 2, ECSA

’19, page 63–66, New York, NY, USA. Association

for Computing Machinery.

Xiong, M., Hu, Z., Lu, X., LI, Y., Fu, J., He, J., and Hooi, B.

(2024). Can LLMs express their uncertainty? an em-

pirical evaluation of conﬁdence elicitation in LLMs.

In The Twelfth International Conference on Learning

Representations.

Yasunaga, M. and Liang, P. (2020). Graph-based, self-

supervised program repair from diagnostic feedback.

In Proceedings of the 37th International Conference

on Machine Learning, ICML’20. JMLR.org.

ICSOFT 2025 - 20th International Conference on Software Technologies