Bridging the Explanation Gap in AI Security: A Task-Driven Approach

to XAI Methods Evaluation

Ondrej Lukas

and Sebastian Garcia

Department of Computer Science, Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic

Keywords:

Explainable AI, Functional Metrics, Explanation Evaluation, Network Security.

Abstract:

Deciding which XAI technique is best depends not only on the domain, but also on the given task, the dataset

used, the model being explained, and the target goal of that model. We argue that the evaluation of XAI

methods has not been thoroughly analyzed in the network security domain, which presents a unique type of

challenge. While there are XAI methods applied in network security there is still a large gap between the needs

of security stakeholders and the selection of the optimal method. We propose to approach the problem by ﬁrst

deﬁning the stack-holders in security and their prototypical tasks. Each task deﬁnes inputs and speciﬁc needs

for explanations. Based on these explanation needs (e.g. understanding the performance, or stealing a model),

we created ﬁve XAI evaluation techniques that are used to compare and select which XAI method is best

for each task (dataset, model, and goal). Our proposed approach was evaluated by running experiments for

different security stakeholders, machine learning models, and XAI methods. Results were compared with the

AutoXAI technique and random selection. Results show that our proposal to evaluate and select XAI methods

for network security is well-grounded and that it can help AI security practitioners ﬁnd better explanations for

their given tasks.

1 INTRODUCTION

Explaining the different parts of the machine learning

pipeline (XAI) is critical for many tasks, such as if a

model is trusted, or knowing which parts of a model

to modify to improve it (Arrieta et al., 2019). The

explanation methods should also be evaluated to un-

derstand their properties.

The evaluation methods for XAI (e.g. stability and

robustness) help understand if the XAI is aligned with

the decision task. While some work is done for such

evaluation in other domains, they seem insufﬁcient in

the network security domain.

The network security domain presents unique

characteristics that make adapting current XAI eval-

uation methods problematic (Jacobs et al., 2022).

In particular, this domain suffers from three prob-

lems: (i) lack of good datasets, (ii) imbalance (Sacher,

2020), and (iii) high costs of errors (Amit et al., 2019).

With some datasets having almost a 1:100000000 ra-

tio of attack-to-benign packets, even the better false

positive rates need to be in the order of 0.001% to

https://orcid.org/0000-0002-7922-8301

https://orcid.org/0000-0001-6238-9910

be usable. The majority of available datasets are

synthetic or represent too small networks. Addition-

ally, privacy concerns and legal issues forbid many

datasets from being available or veriﬁed.

Evaluating XAI methods, in general, is difﬁcult

because there is no unique correct result (van der Waa

et al., 2021; Sokol and Flach, 2020). The majority of

existing work therefore focuses on the evaluation of a

single property of the explanation. OpenXAI (Agar-

wal et al., 2022) proposes several metrics for XAI

method comparison while AutoXAI (Cugny et al.,

2022) aims to automate both the method selection

process and hyperparameter tuning. However, the

user input is required in the form of desired properties

of the explanation. Neither of the methods includes

any security-related dataset in their comparisons.

In the cybersecurity domain, XAI depends also on

the security tasks (Warnecke et al., 2020); which are

representative of typical cybersecurity roles. When

XAI is applied to these security tasks, it is seen that

the traditional evaluation methods of XAI may fall

short of the desired cybersecurity needs.

Given the impact of machine learning (ML) mod-

els in network security (Aledhari et al., 2021), there is

an increasing need for veriﬁed XAI explanations to

1370

Lukas, O. and Garcia, S.

Bridging the Explanation Gap in AI Security: A Task-Driven Approach to XAI Methods Evaluation.

DOI: 10.5220/0012475200003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 3, pages 1370-1377

ISBN: 978-989-758-680-4; ISSN: 2184-433X

enhance transparency, comprehension, trust, ethics,

and cooperation between humans and AI systems in

network security. Therefore, a correct understanding

of the properties of network security, XAI, and their

evaluations is beneﬁcial for the community.

This paper proposes a method to select the best-

suited XAI method for each speciﬁc network secu-

rity task according to the conditions and restrictions

of these tasks, the datasets, and the ML model used.

Our proposal understands the most common tasks for

the network security stakeholders, then identiﬁes the

exact evaluation properties they need in their explana-

tions, and uses these evaluation techniques in state-of-

the-art XAI methods. Finally, our proposed method

chooses the XAI that best explains the selected se-

curity task to the stakeholder. A key part of our

technique is the exploration of the dependence of the

evaluation and XAI methods the dataset-model-goal

triplet.

Our method ﬁrst identiﬁes four tasks that repre-

sent archetypal security stakeholders: model extrac-

tion, model improvement, model understanding, and

single sample understanding. Each task has a spe-

ciﬁc security goal and restrictions. The tasks differ

not only in the goal but also in the access to the model

and both training and testing data. Lastly, we pro-

pose a scoring technique to evaluate and compare the

performance of XAI algorithms in an absolute way to

answer the question, ’How good was this XAI to ex-

plain this security task?’. Finally, our method chooses

the best-suited XAI algorithm.

In our experiments, we use four ML models (Ran-

dom Forest, Gradient Boosting Trees, Multi-layer

Perceptron, and Support Vector Machines) and four

XAI methods (SHAP, LIME, Anchors, and RISE) for

each of the proposed tasks.

The performance of an ML model generally de-

pends on the data it has been trained on and on

the goal that has been optimized to perform (Sarker,

2021); therefore, the performance of an XAI algo-

rithm at least also depends on the model, the data, and

the optimized goal. Therefore, we include a random

explanation baseline in our evaluation to understand

if the differences in XAI methods are signiﬁcant and

to what extent.

Results show that selecting the best XAI method

highly depends on the task, the model to explain, and

the dataset chosen, conﬁrming previous work and hy-

potheses. Given the variability of the models and data,

we did not even ﬁnd an XAI method that consistently

outperforms others. We conclude that XAI techniques

should always be evaluated concerning the triplet of

task-model-data because their power of explanation

should be aligned with the needs of the stakeholders

of those tasks. Also, the speciﬁc evaluation technique

plays a crucial role in selecting XAI methods, and

stakeholders should consider their capabilities when

choosing according to the domain.

The contributions of this paper are:

• An identiﬁcation of four main tasks of network

security stakeholders needs.

• A new evaluation technique of XAI methods for

scoring and comparing XAI algorithms based on

adversarial approaches.

• An evaluation on four ML algorithms in a cyber-

security task and four XAI algorithms.

• Use of a real-world labeled network security

dataset for XAI evaluation.

2 RELATED WORK

The widely accepted approach to the evaluation of

explanation originally proposed in (Doshi-Velez and

Kim, 2017) splits the evaluation approaches into cat-

egories according to human involvement in the evalu-

ation process.

The variety of needs for XAI recipients is exam-

ined from the point of goal (Arrieta et al., 2020), audi-

ence type (stakeholders) (Nadeem et al., 2022; Doshi-

Velez and Kim, 2017) and speciﬁc needs, concerns,

and goals of the recipients of the explanations (Jin

et al., 2023) and their domain knowledge (Nguyen

et al., 2020).

Among the most widely used functional metrics

for assessing the quality of the explanations are Faith-

fulness (Alvarez-Melis and Jaakkola, 2018)(also re-

ferred to as ﬁdelity(Yeh et al., 2019)) which is a

measurement of how much the explanation approx-

imates the prediction of the model that is being ex-

plained and Stability (Krishna et al., 2022) or Ro-

bustness (Alvarez-Melis and Jaakkola, 2018) are met-

rics that evaluate the simple assumption of explana-

tions:similar inputs should give rise to similar ex-

planations. Both metrics are computed by measur-

ing how much the explanations change when there

are small perturbations to the input, model represen-

tation, or output of the underlying predictor. Com-

pactness of the explanation is another desirable prop-

erty (Mothilal et al., 2021). In cases, such as rule

or example sets, we measure the cardinality of those

sets. Other approaches limit the size/length of the ex-

planation and measure how well it covers the space of

input samples (Schwalbe and Finzel, 2023).

Explainable Security (Vigano and Magazzeni,

2020; Parmar, 2021) establishes the ﬁeld of explain-

able security, proposing stakeholder differentiation

Bridging the Explanation Gap in AI Security: A Task-Driven Approach to XAI Methods Evaluation

1371

and important questions for building and evaluating

XAI for security.

Nadeem et al.(Nadeem et al., 2022) provide a

summary of notable work in both XAI methods in

security and their evaluation. It shows that only 1-

in-5 applications of XAI in security were accompa-

nied by any kind of evaluation of the explanations. A

combination of general and security-related criteria is

proposed by Warnecke et al. (Warnecke et al., 2020).

Additionally, the authors show that XAI methods are

unsuitable for general usage in security, most notably

GuidedBackProp and GradCam (GradCAM++).

The landscape of the explainable methods is di-

verse and creating a meaningful taxonomy is not a

simple task. We adopt the categorization ﬁrst pro-

posed by Molnar (Molnar, 2022) and extended by

in (Schwalbe and Finzel, 2023) by introducing a com-

prehensive taxonomy of XAI based on several criteria

such as input, output, form of presentation or interac-

tivity of the methods.

Currently, the most adopted explanation methods

are (i) interpretable surrogates (decision trees, lin-

ear models), (ii) rules, and (iii) model-agnostic meth-

ods. In the former category, methods focusing on the

importance of input features such as SHAP (Lund-

berg and Lee, 2017), LIME (Ribeiro et al., 2016) or

RISE (Petsiuk et al., 2018) are widely used. Another

approach is explaining models by examples - proto-

types (Ribeiro et al., 2018; Kim et al., 2016), or coun-

terfactuals (Wachter et al., 2017).

Historically, the majority of the XAI methods

were created in other domains (such as computer vi-

sion). While adapting such methods to network secu-

rity is somewhat possible, shown in (Nadeem et al.,

2022), the speciﬁcs of the target domain are not con-

sidered for most of the XAI methods.

3 METHODOLOGY

Our work aims to select the best XAI method to ex-

plain a model in a speciﬁc task based on a particu-

lar goal of the task, the ML model to be explained,

and the dataset. In the context of this paper, the goal

is represented by a task for which the XAI method

can be used, such as Model output understanding

or Model Improvement. The suitability of an XAI

method for a given task and ML model is measured

by the level of fulﬁllment of the task. The metrics dif-

fer in each task and are described in subsections 3.1.

Our methodology consists of (i) deﬁning four se-

curity tasks and corresponding metrics for measur-

ing the level of fulﬁllment of the test; (ii) training

4 ML methods on a dataset of a real network se-

curity problem (detection of attacks only looking at

encrypted TLS trafﬁc); Applying the selected XAI

methods (SHAP, LIME, Anchor and RISE) to deﬁned

tasks and computing the metrics; (v) select the best

XAI for the task-model-data triplet.

Unlike previously proposed methods, our evalua-

tion does not require Ad Hoc selection of metrics and

their importance for a given task which requires deep

understanding on the side of the user. Functional met-

rics such as robustness or ﬁdelity evaluate purely the

relation between the data point and the explanation or

model output and the evaluation, not the impact on the

goal, being solved with the XAI method.

3.1 Network Security Tasks

We identiﬁed four common network security stake-

holders: the ﬁnal user, the model creator, the model

evaluator, and the adversarial attacker. Each of these

stakeholders has different needs, domain knowledge,

and available assets. The end-user commonly does

not have full access to the ML model and only has

a few data points or even a single one. The primary

goal for this stakeholder is to explain why the data

point is classiﬁed to the given class (single sample de-

cision understanding). The model creator is building

the ML model, has plenty of data, and needs to under-

stand how to improve the model performance (model

improvement). The model evaluator has an ML model

and a lot of data and needs to explain why all the data

had these results (model decision understanding). The

adversarial attacker has an ML model, and a lot of

data and wants to use explanations to extract (steal)

the model (model extraction).

While the stakeholders’ needs and requirements

can beneﬁt the XAI method selection, how to link

these requirements to the evaluation criteria remains

unsolved. The main problem is that the stakeholders

group description does not sufﬁciently describe the

desired properties of the XAI methods. To overcome

this issue, we propose the previously mentioned high-

level security tasks for each stakeholder that has iden-

tiﬁed goals and the XAI method for achieving them.

Thus, by evaluating the goal fulﬁllment, we evaluate

the inﬂuence the XAI has in the particular task.

3.1.1 Model Decision Understanding

A better understanding of the model is one of the main

motivations for XAI methods in domains in which the

ML models often serve as support tools and in which

the user (human) makes the ﬁnal decision based on

the understanding of the model output.

With the increasing size of the modern ML mod-

els, it becomes more challenging to use them locally,

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1372

and as a result, access to the model itself is limited.

Therefore, we assume a black-box scenario for the

model understanding task. Secondly, access to the

training dataset and model parameters is not assumed.

The core premise of the model understanding task is

the following: The output of the XAI method should

be as short as possible while capturing the critical

components for the provided ML model output.

We propose to evaluate how well the explanation

captures the merit of the model’s decision-making

given a sample x by using the explanation to trans-

form x into an adversarial example. Given the trained

ML model f (x) → ˆy, the explanation system is de-

ﬁned as e( f , x, ˆy) → R

which given an ML model,

data point and label, produces the importance scores

for the parts of the data point x. Next, we assume a

perturbation mechanism I(x, R

) → x

which modiﬁes

the data point based on the given vector of importance

scores. Lastly, we assume a collection of testing data

points D

test

= {x

, x

, ..., x

We measure the average amount of perturbation

steps required until the data point is adversarial:

|D|

∑

i=1

J f (I(x

i−1

, e( f , x

, ˆy

)) = f (x

)K (1)

Where x

is the original sample from D

test

being grad-

ually more modiﬁed by the perturbation mechanism

and JexpressionK represents a logical value of the ex-

pression inside the brackets.

3.1.2 Single Sample Decision Understanding

The single sample decision understanding is a spe-

cial variant of the task described in Section 3.1.1 of

model understanding when the size of the D

test

= 1.

This task refers to the problem of having only one data

point (instead of a whole dataset) and therefore being

subject to little information to make a decision. This

is implemented by randomly selecting one data point

from the dataset and running the XAI only on this data

point. The previously proposed selection mechanism

for the XAI method is still applicable but is prone to

high variance due to the lack of data.

3.1.3 Model Improvement

The analysis of model errors and weak points is an-

other area of ML where XAI methods are often used.

It differs signiﬁcantly from the previous task, as such

analysis is commonly performed either by the model

developer or auditor that has access to the model and

its parameters and architecture as well as the labeled

dataset. The XAI use for model improvement in-

cludes augmenting the training data (Weber et al.,

2023; Teso and Kersting, 2019).

For the model modiﬁcation task, we propose

to apply the XAI for the augmentation of the

training data by identifying the most inﬂuential

features. Assuming having labeled data X =

{(x

, y, 1), ..., (x

, y, n)}, model f : X → y and an XAI

method E( f (x), x, ˆy) → e

. We split X split into three

parts X

train

, X

validation

, X

test

, and use X

train

for the

training of the model f .

Then, E( f (x), x, ˆy) is used for every x ∈ X

validation

and corresponding model prediction ˆy resulting in set

of explanations {e

val

, e

val

, ..., e

val

}. Each e

val

ranks

the features of x according to their importance. The

augmented training dataset X

train

is created by remov-

ing the k features with the lowest average ranking and

used for re-training the model using the same hyper-

parameters.

The relative change in the model performance

quantiﬁes the impact of the XAI method on the train-

ing process. In our experiments, we use F

to measure

the model performance.

3.1.4 Model Extraction

While the XAI methods provide insight into the

model’s internal representation and decisions, they

can be used in attacks such as model inversion and

model extraction (Yan et al., 2023; Kuppa and Le-

Khac, 2021). For the evaluation of the applicability

of the explanation systems for model attacks, we pro-

pose the task of model extraction guided by the expla-

nation.

We assume the target of the attack to be an arbi-

trary black-box model f : X → y which can be queried

by the attacker. The goal of the attacker is to construct

a surrogate model f

: X → y which mimics the behav-

ior of the target model f . The fulﬁllment of the task is

measured by the agreement between the models w.r.t.

to a testing dataset X

test

. The basic model extraction

attack consists of the following steps:

1. Generation of a set of unlabeled data X

, x

, ..., x

} using original dataset X of limited

size

2. Querying the target model f using the x

∈ X

and

obtaining the labels ˆy, resulting in training dataset

= {(x

, y

), (x

, y

), ..., (x

, y

)}

3. Training of the surrogate f

using D

4. Evaluation of agreement between the models f

and f

The role of the XAI in this task lies in the gen-

eration of training data X

for the surrogate. The ex-

planator output is used to guide the sampling of the

dataset X

.For each of the x ∈ X, we use the explana-

tion method E( f (x), x, ˆy) → e

to obtain the explana-

tion e

Bridging the Explanation Gap in AI Security: A Task-Driven Approach to XAI Methods Evaluation

1373

The form of the explanation can vary, in the

case of feature importance methods such as SHAP

or LIME, the explanation consists of a vector of real

numbers R

kxk

which identiﬁes the importance score

for each of components of x. In the case of the Anchor

method, the explanation is a sequence of IF-THEN

rules that not only identify the feature for which the

rule is applicable but also the exact value.

For each of the x ∈ X, k perturbed data points are

added to the dataset X

. The perturbation is as fol-

lows:

= x + ε

where ε represents zero mean random vector:

ε ∼ N (0, e

)

Such perturbation strategy allows for better sampling

in parts of the feature space which are identiﬁed by

the explanation method as important for the model’s

decision making.

We measure the fulﬁllment of the model extrac-

tion task as the agreement between the target model f

and the surrogate f

. In our experiments, we use the

F1 score for the evaluation. The primary reason is the

imbalance of labels in favor of the ’Benign’ which is

a common problem for cybersecurity problems.

3.2 Dataset

The CTU-50 dataset (Stratosphere, 2015), containing

malicious and benign trafﬁc collected in 10 hosts over

5 days, is used. Various malware samples were used

to avoid overﬁtting to a single malware family. The

benign trafﬁc captures real users’ activity over ﬁve

consecutive days.

Each data point in the dataset aggregates one-hour

time windows for each 4-tuple (source IP, destination

IP, destination port, and protocol). This 4-tuple iden-

tiﬁes all ﬂows related to a speciﬁc service. Features

related to the TLS trafﬁc of each 4-tuple are aggre-

gated in numerical values, resulting in a feature vec-

tor of 38 features used for model training. In total,

there are 36168 Malicious samples (Both normal and

background trafﬁc) and 1729 Malicious samples.

For the comparison with AutoXAI, the smaller

variant of the dataset is included, in which all cate-

gorical features as well as highly correlated features

are removed to accommodate the limitations of avail-

able AutoXAI implementation.

3.3 Experiment Setup

For the experiment, we use four ML models that are

commonly used in cybersecurity applications: Ran-

dom Forest (RF), Gradient Boosting Tree (GBT),

Support Vector Machine (SVM), and Multi-Layer

Perceptron (MLP). Due to the model variety, we

chose to use model-agnostic XAI methods.LIME

SHAP

, Anchors

, and RISE are used in all experi-

ments. Moreover, a random or uniform baseline for

each experiment is included. The implementation can

be found in the project repository

For the AutoXAI comparison, we select ﬁdelity,

robustness, and conciseness as target properties and

1,1,0.5 as their respective weights.

4 RESULTS

The results of our experiments can be seen in Ta-

ble 1 for the model understanding task when the re-

placement of the features is done by using outlier

values. The values for evaluating XAI methods are

better when they are smaller, meaning fewer interac-

tions were necessary to create adversarial examples

because the features selected were very important.

Table 2 shows the same results but for replacing

features using the median value of that feature. This

corresponds to the idea of hiding the value of the fea-

ture among the data in order to decrease its impor-

tance for the XAI.

Table 3 shows the results for the adversarial at-

tacker task of extracting the models. Table 4 shows

the results of the task of model improvement.

Table 1: Model understanding task - outliers. The perfor-

mance of XAI methods is measured as the smallest amount

of input data point perturbation required for the creation of

an adversarial sample. The important parts of the input data

point (as identiﬁed by the explanation) are set to the outlier

value -1. Lower is better, with 1 being the optimal value.

LIME SHAP Anchor RISE Random

Random Forest 1.505 5.180 9.718 7.535 8.660

Gradient Boosting Tree 24.196 35.047 35.704 26.396 27.486

MLP 8.161 7.174 7.066 13.841 16.719

SVM 1.162 1.052 1.263 1.028 1.330

Table 2: Model understanding task - median. The perfor-

mance of XAI methods is measured as the smallest amount

of input data point perturbation required for the creation of

an adversarial sample. The important parts of the input data

point (as identiﬁed by the explanation) are set to the median.

Lower is better, with 1 being the optimal value.

LIME SHAP Anchor RISE Random

Random Forest 1.302 1.048 1.338 1.006 1.526

Gradient Boosting Tree 1.346 1.004 1.316 1.005 1.574

MLP 2.136 2.156 2.64 2.85 4.024

SVM 1.196 1.332 1.35 1.288 1.407

https://pypi.org/project/lime/

https://pypi.org/project/shap/

https://pypi.org/project/anchor-exp/

https://github.com/stratosphereips/sec-xai

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1374

Figure 1: Model understanding experiment - MLP Box-

plot representation of the results shown the third row of Ta-

ble 1.

Table 3: Model extraction. A comparison of the inﬂu-

ence of XAI methods on the surrogate model creation. The

model agreement is measured as the F1 score of the output

of the target and surrogate models. Higher is better.

target model surrogate model LIME SHAP Anchor RISE Uniform

RF RF 0.6857 0.6479 0.6427 0.5646 0.5544

GBT RF 0.7053 0.6550 0.5936 0.6106 0.5682

MLP RF 0.6179 0.6402 0.7588 0.6449 0.6457

SVM RF 0.7372 0.7418 0.7712 0.0 0.0

RF GBT 0.6321 0.6501 0.6012 0.6108 0.5038

GBT GBT 0.6991 0.6882 0.7044 0.8284 0.6413

MLP GBT 0.6670 0.7358 0.6568 0.6940 0.6250

SVM GBT 0.7136 0.7378 0.6967 0.3302 -

Table 4: Model Improvement. Relative change in the

model F1 score after retraining on the dataset augmented

by the respective XAI method. Higher is better.

LIME SHAP Anchor RISE Random

Random Forest -0.10% -0.48% 0.36% 0.62% -3.11%

Gradient Boosting Tree -1.17 % -0.95% -0.22% -0.11% -4.41%

MLP 7.80% 17.52% -24.90% 2.58 -67.83

SVM -3.07% -20.99% -4.37% -19.07% -5.21%

5 ANALYSIS

The results of the experiment show that none of the

XAI methods is outperforming the others in all the

proposed tasks. A high variance of the results can

be observed between the tasks and also between the

target ML models.

In the model understanding task, we evaluated the

proposed task and metrics in two experiments that dif-

fered in the value used for the masking-out of the fea-

tures identiﬁed by the XAI methods.

In Table 1, the masking value is set to a constant,

which is not present in the original feature values. In

Table 2 the feature values were replaced by a median

of the corresponding column in the dataset. The result

comparison shows, that using the median resulted in

lower variance of the method scores.

In the experiment, where features were masked

with median, RISE and SHAP explanation resulted in

Table 5: Overview of best-performing XAI method per task

and target ML model.

Task Model Best XAI method

Model understanding (outliers) RF LIME

Model understanding (outliers) GBT LIME

Model understanding (outliers) MLP Anchors

Model understanding (outliers) SVM RISE

Model understanding (median) RF RISE

Model understanding (median) GBT LIME

Model understanding (median) MLP LIME

Model understanding (median) SVM LIME

Model extraction (RF) RF LIME

Model extraction (RF) GBT LIME

Model extraction (RF) MLP Anchors

Model extraction (RF) SVM Anchors

Model extraction (GBT) RF SHAP

Model extraction (GBT) GBT RISE

Model extraction (GBT) MLP SHAP

Model extraction (GBT) SVM SHAP

Model improvement RF RISE

Model improvement GBT RISE

Model improvement MLP SHAP

Model improvement SVM LIME

Table 6: AutoXAI results for smaller dataset.

target model Suggested XAI method

RF LIME

GBT LIME

MLP SHAP

SVM LIME

the fastest creation of adversarial samples for the en-

semble methods (Random Forrest, Gradient Boosting

Trees). At the same time, LIME showed better per-

formance for the Neural network and the SVM. All

XAI methods consistently outperformed the baseline

mode.

During the creation of adversarial samples by re-

placing features with outlier values (Table 1), LIME

performed best in all models except for the MLP. Fig-

ure 1 shows the boxplot representation of the third

row of Table 1. It shows that while, on average LIME

required 1 step more than SHAP to create adversar-

ial samples for the MLP, SHAP performed better for

most data points.

The results in Table 4 show that in the training data

augmentation, LIME was the most consistent XAI

method across all of the ML models.

The XAI methods were used to create the surro-

gate training data for the model extraction task. From

the experiment results in Table 3, we can see that

LIME performed the best when creating RF surro-

gates for models of similar types. At the same time,

Anchors performed best when the target model ar-

chitectures were completely different (MLP, SVM),

achieving over 70% of agreement with the original

model measured by the F1 score. When the GBT was

used as a surrogate, SHAP was the most consistent

Bridging the Explanation Gap in AI Security: A Task-Driven Approach to XAI Methods Evaluation

1375

XAI method for the task.

A comparison with the AutoXAI results shown

in 6 shows that while for some model types, both

methods suggest the same XAI method, the task sepa-

ration allows for more precise matching of the expla-

nation system to the desired goal. It is important to

note, that further hyperparameter search for the Au-

toXAI could provide higher granularity in the results.

5.1 Method Limitations

There are some limitations of our proposed method

and some future work that would help better evalu-

ate it. Currently, the technique strongly needs fea-

ture importance to work, which limits the evaluation

of generic XAI methods. Secondly, the evaluation

depends on the model and data which makes the re-

sults not directly transferable and, the evaluation itself

time-consuming. Lastly, the perturbation scheme can

not accommodate non-tabular data.

6 CONCLUSION

This paper proposes an approach to evaluating model-

agnostic explanation methods focusing on the net-

work security domain.

Our method evaluates the explanation systems in

the context of tasks in which the explanations are

used. We formulated and evaluated three high-level

tasks that represent typical applications of XAI meth-

ods in the development and usage of ML models in

security, each represented by a triplet of dataset, ML

model, and goal. In the experimental evaluation, we

compared four model-agnostic XAI methods with a

baseline (LIME, SHAP, RISE, Anchors) in each pro-

posed task. The evaluation included four model types

(Random forest, Gradient Boosting Trees, SVM, and

MLP.

The experimental evaluation showed that the for-

mulated tasks are diverse and that none of the xai

methods outperformed the others. That supports

the hypothesis of the importance of the relationship

between the XAI method and the underlying ML

model. In particular, in most of the experiments,

LIME showed the best results when the underlying

model was a Random Forest classiﬁer, whereas for

the Neural networks and SVM, SHAP and Anchors

performed better.

We have further shown that the proposed tasks can

be used for the evaluation of the XAI methods for net-

work security.

6.1 Future Work

Adapting the method to other XAI is planned to

broaden the usability of the proposed evaluation

framework. Such adaptation includes support for

other data types such as time series or graphs. The

second extension includes evaluating a wider variety

of security datasets or adaptations to other domains

and the incorporation of our evaluation in the Au-

toXAI framework.

ACKNOWLEDGEMENTS

The authors acknowledge support from the Strategic

Support for the Development of Security Research

in the Czech Republic 2019–2025 (IMPAKT 1) pro-

gram, by the Ministry of the Interior of the Czech Re-

public under No. VJ02010020.

REFERENCES

Agarwal, C., Krishna, S., Saxena, E., Pawelczyk, M., John-

son, N., Puri, I., Zitnik, M., and Lakkaraju, H. (2022).

Openxai: Towards a transparent evaluation of model

explanations. In Koyejo, S., Mohamed, S., Agar-

wal, A., Belgrave, D., Cho, K., and Oh, A., editors,

Advances in Neural Information Processing Systems,

volume 35, pages 15784–15799. Curran Associates,

Inc.

Aledhari, M., Razzak, R., and Parizi, R. M. (2021). Ma-

chine learning for network application security: Em-

pirical evaluation and optimization. Computers &

Electrical Engineering, 91:107052.

Alvarez-Melis, D. and Jaakkola, T. S. (2018). On

the Robustness of Interpretability Methods.

arXiv:1806.08049 [cs, stat].

Amit, I., Matherly, J., Hewlett, W., Xu, Z., Meshi, Y.,

and Weinberger, Y. (2019). Machine Learning in

Cyber-Security - Problems, Challenges and Data Sets.

arXiv:1812.07858 [cs, stat]. arXiv: 1812.07858.

Arrieta, A. B., D

ıaz-Rodr

ıguez, N., Del Ser, J., Bennetot,

A., Tabik, S., Barbado, A., Garc

ıa, S., Gil-L

opez, S.,

Molina, D., Benjamins, R., Chatila, R., and Herrera,

F. (2019). Explainable Artiﬁcial Intelligence (XAI):

Concepts, Taxonomies, Opportunities and Challenges

toward Responsible AI. arXiv:1910.10045 [cs].

Arrieta, A. B., D

ıaz-Rodr

ıguez, N., Del Ser, J., Bennetot,

A., Tabik, S., Barbado, A., Garc

ıa, S., Gil-L

opez, S.,

Molina, D., Benjamins, R., et al. (2020). Explainable

artiﬁcial intelligence (xai): Concepts, taxonomies, op-

portunities and challenges toward responsible ai. In-

formation fusion, 58:82–115.

Cugny, R., Aligon, J., Chevalier, M., Roman Jimenez, G.,

and Teste, O. (2022). Autoxai: A framework to auto-

matically select the most adapted xai solution. In Pro-

ceedings of the 31st ACM International Conference on

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

1376

Information & Knowledge Management, CIKM ’22,

page 315–324, New York, NY, USA. Association for

Computing Machinery.

Doshi-Velez, F. and Kim, B. (2017). Towards A Rig-

orous Science of Interpretable Machine Learning.

arXiv:1702.08608 [cs, stat].

Jacobs, A. S., Beltiukov, R., Willinger, W., Ferreira, R. A.,

Gupta, A., and Granville, L. Z. (2022). AI/ML for

Network Security: The Emperor has no Clothes. In

Proceedings of the 2022 ACM SIGSAC Conference

on Computer and Communications Security, pages

1537–1551, Los Angeles CA USA. ACM.

Jin, W., Fan, J., Gromala, D., Pasquier, P., and Hamarneh,

G. (2023). Invisible Users: Uncovering End-Users’

Requirements for Explainable AI via Explanation

Forms and Goals. arXiv:2302.06609 [cs].

Kim, B., Khanna, R., and Koyejo, O. O. (2016). Exam-

ples are not enough, learn to criticize! Criticism for

Interpretability. In Lee, D., Sugiyama, M., Luxburg,

U., Guyon, I., and Garnett, R., editors, Advances in

Neural Information Processing Systems, volume 29.

Curran Associates, Inc.

Krishna, S., Han, T., Gu, A., Pombra, J., Jabbari, S., Wu, S.,

and Lakkaraju, H. (2022). The Disagreement Problem

in Explainable Machine Learning: A Practitioner’s

Perspective. arXiv:2202.01602 [cs].

Kuppa, A. and Le-Khac, N.-A. (2021). Adversarial xai

methods in cybersecurity. IEEE transactions on in-

formation forensics and security, 16:4924–4938.

Lundberg, S. and Lee, S.-I. (2017). A Uniﬁed Approach

to Interpreting Model Predictions. arXiv:1705.07874

[cs, stat].

Molnar, C. (2022). Interpretable Machine Learning.

Christoph Molnar, 2 edition.

Mothilal, R. K., Mahajan, D., Tan, C., and Sharma, A.

(2021). Towards Unifying Feature Attribution and

Counterfactual Explanations: Different Means to the

Same End. In Proceedings of the 2021 AAAI/ACM

Conference on AI, Ethics, and Society, pages 652–

663. arXiv:2011.04917 [cs].

Nadeem, A., Vos, D., Cao, C., Pajola, L., Dieck, S., Baum-

gartner, R., and Verwer, S. (2022). SoK: Explain-

able Machine Learning for Computer Security Appli-

cations. arXiv:2208.10605 [cs].

Nguyen, H. D., Do, N. V., Tran, N. P., Pham, X. H., Pham,

V. T., and Minutolo, A. (2020). Some criteria of the

knowledge representation method for an intelligent

problem solver in stem education. Appl. Comp. Intell.

Soft Comput., 2020.

Parmar, M. (2021). Xaisec-explainable ai security: An early

discussion paper on new multidisciplinary subﬁeld in

pursuit of building trust in security of ai systems.

Petsiuk, V., Das, A., and Saenko, K. (2018). Rise: Ran-

domized input sampling for explanation of black-box

models.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ”Why

Should I Trust You?”: Explaining the Predictions of

Any Classiﬁer. arXiv:1602.04938 [cs, stat].

Ribeiro, M. T., Singh, S., and Guestrin, C. (2018). An-

chors: High-Precision Model-Agnostic Explanations.

Proceedings of the AAAI Conference on Artiﬁcial In-

telligence, 32(1). Number: 1.

Sacher, D. (2020). Fingerpointing False Positives: How to

Better Integrate Continuous Improvement into Secu-

rity Monitoring. Digital Threats: Research and Prac-

tice, 1(1):1–7.

Sarker, I. H. (2021). Machine Learning: Algorithms, Real-

World Applications and Research Directions. SN

Computer Science, 2(3):160.

Schwalbe, G. and Finzel, B. (2023). A Comprehen-

sive Taxonomy for Explainable Artiﬁcial Intelligence:

A Systematic Survey of Surveys on Methods and

Concepts. Data Mining and Knowledge Discovery.

arXiv:2105.07190 [cs].

Sokol, K. and Flach, P. (2020). Explainability fact sheets: A

framework for systematic assessment of explainable

approaches. In Proceedings of the 2020 Conference

on Fairness, Accountability, and Transparency, FAT*

’20, page 56–67, New York, NY, USA. Association

for Computing Machinery.

Stratosphere (2015). Stratosphere laboratory datasets.

Retrieved March 13, 2020, from https://www.

stratosphereips.org/datasets-overview.

Teso, S. and Kersting, K. (2019). Explanatory interac-

tive machine learning. In Proceedings of the 2019

AAAI/ACM Conference on AI, Ethics, and Society,

AIES ’19, page 239–245, New York, NY, USA. As-

sociation for Computing Machinery.

van der Waa, J., Nieuwburg, E., Cremers, A., and Neer-

incx, M. (2021). Evaluating XAI: A comparison of

rule-based and example-based explanations. Artiﬁcial

Intelligence, 291:103404.

Vigano, L. and Magazzeni, D. (2020). Explainable Secu-

rity. In 2020 IEEE European Symposium on Secu-

rity and Privacy Workshops (EuroS&PW), pages 293–

300, Genoa, Italy. IEEE.

Wachter, S., Mittelstadt, B., and Russell, C. (2017). Coun-

terfactual Explanations Without Opening the Black

Box: Automated Decisions and the GDPR. SSRN

Electronic Journal.

Warnecke, A., Arp, D., Wressnegger, C., and Rieck, K.

(2020). Evaluating Explanation Methods for Deep

Learning in Security. In 2020 IEEE European Sympo-

sium on Security and Privacy (EuroS&P), pages 158–

174.

Weber, L., Lapuschkin, S., Binder, A., and Samek, W.

(2023). Beyond explaining: Opportunities and chal-

lenges of xai-based model improvement. Information

Fusion, 92:154–176.

Yan, A., Huang, T., Ke, L., Liu, X., Chen, Q., and Dong, C.

(2023). Explanation leaks: Explanation-guided model

extraction attacks. Information Sciences, 632:269–

284.

Yeh, C.-K., Hsieh, C.-Y., Suggala, A. S., Inouye, D. I., and

Ravikumar, P. (2019). On the (In)ﬁdelity and Sensi-

tivity for Explanations. arXiv:1901.09392 [cs, stat].

Bridging the Explanation Gap in AI Security: A Task-Driven Approach to XAI Methods Evaluation

1377