On Handling Concept Drift, Calibration and Explainability in

Non-Stationary Environments and Resources Limited Contexts

Sara Kebir

and Karim Tabia

Univ. Artois, CNRS, CRIL F-62300 Lens, France

Keywords:

Concept Drift, Lightweight Incremental Learning, Calibration, XAI, Feature Attribution.

Abstract:

In many real-world applications, we face two important challenges: The shift in data distribution and the

concept drift on the one hand, and on the other hand, the constraints of limited computational resources,

particularly in the ﬁeld of IoT and edge AI. Although both challenges have been well studied separately,

it is rare to tackle these two challenges together. In this paper, we put ourselves in a context of limited

resources and we address the problem of the concept and distribution shift not only to ensure a good level of

accuracy over time, but also we study the impact that this could have on two complementary aspects which

are the conﬁdence/calibration of the model as well as the explainability of the predictions in this context.

We ﬁrst propose a global framework for this problem based on incremental learning, model calibration and

lightweight explainability. In particular, we propose a solution to provide feature attributions in a context of

limited resources. Finally, we empirically study the impact of incremental learning on model calibration and

the quality of explanations.

1 INTRODUCTION

In some applications, data properties are not station-

ary (they may change over time) and shifts in the

statistical properties of some classes may occur im-

pacting negatively the performance of the used ma-

chine learning (ML) models. This is a well-known

problem called concept drift (Lu et al., 2019) and its

treatment consists in detecting such drifts then updat-

ing the used models with recent available data. In

modern applications, it is no longer enough to have

an accurate ML model, but to have complete conﬁ-

dence in these systems, it is also necessary to have

well-calibrated models (providing good estimates of

their predictive uncertainty) and explainable predic-

tions. These problems are relatively well studied in

the literature. However, this problem in a context of

limited resources is very little explored. Indeed, if

we consider the problem of anomaly detection in the

case of a smart home where several sensors are used

and where the detection is done in an egde AI fash-

ion (locally and closer to the data collection sites),

it is essential to take into account the changes and

shifts that may occur over time (e.g. because people’s

https://orcid.org/0000-0002-4471-9119

https://orcid.org/0000-0002-8632-3980

habits change, the context too, etc.). It is also essen-

tial to provide a precise estimate of the conﬁdence of

the models and to have explanations when an alert is

raised.

In an era where the Internet of Things (IoT) has

rapidly permeated our lives, the promise of intercon-

nected devices ushering in a new age of convenience

and efﬁciency is undeniable. From smart homes that

adjust lighting and temperature to intrusion detec-

tion systems keeping us safe, the applications of IoT

streaming data are diverse. Nevertheless, the resource

constraints in this context are a harsh reality. The

IoT devices are designed to be power-efﬁcient, often

equipped with minimal processing power, memory,

and energy resources (Cook et al., 2020). This inher-

ent limitation forces us to consider innovative strate-

gies for processing, analyzing, and acting upon the

data they generate. One of the main challenges ac-

companying this context is the dynamic and evolving

nature of the data analyzed in the ever-changing and

non-stationary real environments. As the AI used sys-

tems rely on historical data and trained models, they

struggle to maintain their accuracy and effectiveness

when facing shifts in data distributions, new patterns,

or changing user habits causing model degradation

and detection failure.

In the realm of addressing this challenge of concept

336

Kebir, S. and Tabia, K.

On Handling Concept Drift, Calibration and Explainability in Non-Stationary Environments and Resources Limited Contexts.

DOI: 10.5220/0012382200003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 2, pages 336-346

ISBN: 978-989-758-680-4; ISSN: 2184-433X

drift, numerous models and techniques have emerged,

showcasing the pressing demand for adaptive ma-

chine learning strategies. These methodologies span

a spectrum of domains, including incremental learn-

ing algorithms like Hoeffding Trees (HTs) (Lu et al.,

2019) and robust concept drift detection mechanisms

such as ADWIN (Bifet, 2009). These models aim

to accommodate the evolving nature of data distribu-

tions, allowing machine learning systems to maintain

their predictive accuracy over time. Although efforts

are numerous around this issue, there are few works

that address the impact of concept drift and adaptive

strategies on conﬁdence, calibration and explainabil-

ity in non-stationary contexts. In this work, we focus

on problems where resources are limited and the con-

text is non-stationary while trying to shed light on the

trustworthiness of the evolutive AI systems. The main

contributions of the paper are :

1. We ﬁrst, propose a framework to treat the concept

drift problem on stream data using a lightweight

windowing ensemble model consuming less time

and memory compared to the adaptive state-of-

the-art methods;

2. We then, provide the ﬁrst preliminary results on

explainability in a lightweight context. The pro-

posed scheme is designed to use very few re-

sources and provides explanations as close to a

standard explanation method, like SHAP, which

are very demanding;

3. We ﬁnally, draw attention to the impact of con-

cept drift and incremental learning on the calibra-

tion and the quality of explainabilty of the used

model over time and open up the question to new

perspectives.

2 RELATED WORKS

Before diving into the issue of handling concept drift

in resource constrained environments, we present,

through this section, the related works to the concept

drift detection, adaptation and the potential impact on

calibration and explainability of the used models over

time.

2.1 Concept Drift Detection

Concept drift problems frequently arise in IoT data

due to its non-stationary nature and the dynamic envi-

ronments in which IoT systems operate. This can lead

to the deterioration of the performance of ML models.

Detecting concept drift in IoT data presents two pri-

mary challenges: the presence of numerous potential

causing factors and multiple types of drifts (Agrahari

and Singh, 2022; Lu et al., 2019). The most common

type of drifts is sudden drift, where the data distri-

bution changes abruptly, often due to external factors

like a shift in user behavior or a change in the envi-

ronment. Gradual drift, on the other hand, involves

a more gradual and consistent change where a new

concept gradually replaces an old one over a period

of time, making it challenging to detect. Incremen-

tal drift occurs when the drift happens in small, incre-

mental steps. Finally, recurring drift involves periodic

changes, often inﬂuenced by seasonal or cyclical pat-

terns in the data. Detecting and adapting to these var-

ious types of concept drift is essential for maintaining

the accuracy and reliability of machine learning mod-

els.

To effectively address the issue of concept

drift, various detection techniques can be employed.

Among the commonly used approaches for this pur-

pose are distribution-based and performance-based

methods (Yang and Shami, 2023).

Distribution-based methods rely on the use of data

buffers, which can either be ﬁxed-sized sliding win-

dows or adaptive windows, to monitor different con-

cepts. These methods are speciﬁcally designed to de-

tect concept drift by observing changes within these

windows. ADWIN, a well-known distribution-based

approach, utilizes an adaptable sliding window to

identify concept drift (Bifet, 2009). It does so by com-

paring key characteristic values of old and new data

distributions, like mean and variance values. A signif-

icant alteration in these characteristic values over time

serves as an indicator of a drift occurrence. ADWIN

is particularly effective at dealing with gradual drifts

and long-term changes. However, it can sometimes

generate false alarms and unnecessary model updates.

Performance-based methods, on the other hand,

assess model performance over time to recognize con-

cept drift. These methods gauge the rate of degra-

dation in model performance. Early Drift Detec-

tion Method (EDDM) (Baena-Garc

ıa et al., 2006),

a widely-used performance-based technique, tracks

changes in model performance based on the rate of

change in a learning model’s error rate and standard

deviation. By employing drift and warning thresh-

olds, EDDM can effectively identify instances of

model performance degradation and the occurrence

of concept drift, particularly sudden drift. However,

it may not be as proﬁcient as distribution-based meth-

ods in detecting gradual drift. Performance-based

methods can effectively detect the drifts that cause

model degradation, but they require the availability

of ground-truth labels (Yang and Shami, 2021).

On Handling Concept Drift, Calibration and Explainability in Non-Stationary Environments and Resources Limited Contexts

337

2.2 Concept Drift Adaptation

Once concept drift is detected, it is crucial to effec-

tively address it to enable the learning model to adapt

to the new data patterns. Several techniques can be

used to handle concept drift.

Incremental learning methods involve the partial

updating of the learning model whenever new sam-

ples arrive or concept drift is identiﬁed. This al-

lows the model to adapt incrementally to changing

data. Hoeffding Trees (HTs) (Lu et al., 2019) rep-

resents a fundamental incremental learning technique

that utilizes the Hoeffding inequality to calculate the

minimum required number of data samples for each

split node. This allows it to update nodes and adapt

to new data samples. Extremely Fast Decision Tree

(EFDT) (Manapragada et al., 2018) is a cutting-edge

incremental learning approach and an enhanced ver-

sion of HTs. It promptly selects and deploys node

splits as soon as they attain a conﬁdence value indi-

cating their utility. EFDT excels at adapting to con-

cept drift more accurately and efﬁciently compared

to HTs. Online Passive-Aggressive (OPA) (Cram-

mer et al., 2006) is another incremental learning al-

gorithm that treats drifts by passively reacting to cor-

rect predictions and aggressively responding to errors.

Numerous incremental approaches have been devel-

oped by leveraging conventional machine learning al-

gorithms. For instance, K-Nearest Neighbors with the

ADWIN drift detector (KNN-ADWIN) (Losing et al.,

2016) represent an enhanced iteration of the tradi-

tional KNN model designed for real-time data anal-

ysis. KNN-ADWIN, incorporates an ADWIN drift

detector into the conventional KNN model and em-

ploys a dynamic window to determine which samples

should be retained for model updates.

Ensemble online learning models represent ad-

vanced methods for adapting to concept drift, inte-

grating the predictions of multiple base learners to

enhance performance. Leverage Bagging (LB) (Bifet

et al., 2010), is a fundamental ensemble technique

that creates and combines multiple base learners, such

as Hoeffding Trees (HTs), using bootstrap samples

and a majority voting strategy. Adaptive Random For-

est (ARF) (Gomes et al., 2017) and Streaming Ran-

dom Patches (SRP) (Gomes et al., 2019) are two so-

phisticated ensemble online learning approaches that

utilize multiple HTs as base models and incorporate a

drift detector, such as ADWIN, for each HT to handle

concept drift. ARF employs the local subspace ran-

domization strategy to construct trees, while SRP uti-

lizes global subspace randomization to generate ran-

dom feature subsets for model learning. The use of

global subspace randomization enhances SRP’s learn-

ing performance but comes at the cost of increased

model complexity and longer learning times. Multi-

Stage Automated Online Network Data Stream An-

alytics (MSANA) is a framework that has been pro-

posed by (Yang and Shami, 2023) for IoT Systems

where they use a window-based strategy and select

lightweight base learners with greater computational

speeds to build the ensemble model. Their frame-

work consists of dynamic data pre-processing, drift-

based dynamic feature selection, dynamic model se-

lection, and online model ensemble using a novel W-

PWPAE approach (Yang et al., 2021). While ensem-

ble online learning techniques generally surpass in-

cremental learning methods in the realm of dynamic

data stream analysis, they often come with a signiﬁ-

cant computational cost (Yang and Shami, 2023).

2.3 Classiﬁer Calibration and

Explainability

In this context of adaptive learning, where models are

continuously updated with new data, concept drift can

have a profound impact on the model’s predictive per-

formance. Some of the crucial aspects that may also

be affected by concept drift are the conﬁdence, quality

of calibration and explainability of the used models.

Calibration measures how well a model’s pre-

dicted probabilities align with the actual likelihood of

an event occurring (Vaicenavicius et al., 2019). When

concept drift occurs and the model is not regularly

updated to account for it, the calibration may dete-

riorate over time, leading to unreliable and mislead-

ing probability estimates. To maintain the quality of

calibration in incremental learning, models must be

regularly monitored and adapted to evolving data dis-

tributions, ensuring that predictions remain trustwor-

thy and valuable for decision-making. As with the

measurement of classiﬁer efﬁciency, there are vari-

ous metrics for measuring calibration. Some com-

monly used measures are Expected Calibration Error

(ECE), Average Calibration Error (ACE) and Maxi-

mum Calibration Error (MCE). Miscalibration mea-

sures assess errors by classifying samples according

to their conﬁdence level, and then evaluating the ac-

curacy within each class. For example, MCE is sim-

ply the weighted average difference between the clas-

siﬁer’s conﬁdence and the observed accuracy (on a

test set) in each bin. Similarly, the maximum cali-

bration error (MCE) simply gives the maximum de-

viation (Naeini et al., 2015). Negative log likelihood

(NLL) can also be used to indirectly measure model

calibration since it penalizes high probability scores

assigned to incorrect labels and low probability scores

assigned to correct labels (Ashukha et al., 2020). Cal-

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

338

ibration quality improves as these metrics decrease.

Explainability in machine learning is crucial for

understanding model decisions. XAI post-hoc expla-

nation methods like SHAP (Lundberg and Lee, 2017)

and LIME (Ribeiro et al., 2016) are well-known fea-

ture attribution methods. Although proﬁcient at of-

fering insights into the used models’ predictions, they

are often resource-intensive, especially when dealing

with complex models or high-dimensional datasets

(Van den Broeck et al., 2021). This computational

intensity can pose considerable limitations in applica-

tions where real-time decision-making is crucial, such

as the detection of emergency situations affecting the

elderly in smart homes. In addition, their high com-

putational demands can lead to increased energy con-

sumption and costs, which may prove unsustainable

in resource-constrained environments. There is there-

fore a compelling need for lightweight XAI systems

that strike a balance between explainability and real-

time feasibility.

3 PROPOSED FRAMEWORK

In response to the possible ﬂuctuations in data dis-

tribution over time driven by concept drift, it be-

comes crucial to adopt an incremental learning strat-

egy. This involves continually training and updating

the ML model as new data becomes available. The

process of labeling data in such cases can be quite de-

manding and resource-intensive, often requiring the

involvement of domain experts in real-world applica-

tions. Our framework, illustrated in Figure 1 and de-

scribed in this section, summarizes the different steps

of the process going from an initial ofﬂine training

of the used model to its deployment in a resource-

constrained environment and update over time when

receiving the stream data. This will be followed by

the proposed lightweight explanation approach.

3.1 Global Overview of the Solution

As illustrated on Figure 1, the proposed framework is

composed of three basic building blocks each ensur-

ing an important functionality, namely, i) incremental

learning to adapt to the concept drift, ii) model cali-

bration to provide a good estimate of the model’s con-

ﬁdence, and ﬁnally, ii) the explanation of the model

predictions (to verify and trust the predictions). Of

course, this solution is specially designed for con-

strained environments and limited in computational

resources.

When deployed in a resource-constrained environ-

ment and when new data becomes available, this sys-

tem makes predictions by combining the votes from

the base classiﬁers within the ensemble model to de-

termine both the target value and the level of con-

ﬁdence. After this prediction phase, an explanation

is generated. This explanation, when coupled with

the predicted class label and the machine learning

model’s conﬁdence level, enhances trust in the predic-

tion and aids the expert in assigning the most precise

label. We will focus in the following on each building

bloc.

3.2 Model Ofﬂine / Adaptive Learning

To meet the constraints of limited resources, we chose

an incremental learning scheme based on windowing

ensemble models composed of a few lightweight ba-

sic classiﬁers, as they offer a good compromise be-

tween adaptation to the concept drift and low resource

consumption.

First, we initiate the training of the windowing

ensemble model during an ofﬂine phase, utilizing a

training dataset ensuring that all its base classiﬁers are

initially trained on the same data at this stage.

Since the combination of initial training data and

the incoming data stream is effectively endless, mak-

ing online approaches inefﬁcient in terms of time and

computational resources, we opt for the use of a win-

dow to store the most recent incoming data until the

next training iteration. The window size can be dy-

namic or ﬁxed in advance depending on the nature of

the data. Pre-selecting the data to be saved in this

window can also be adopted to limit recurring occur-

rences that would not have much impact on retrain-

ing. Finally, the incremental and adaptive re-training

is exclusively applied on the least efﬁcient base clas-

siﬁer during each iteration to effectively handle po-

tential concept drift in a less resource-intensive way,

all while preserving the accumulated knowledge to

maintain the continuous performance of the ensemble

model. To further meet the challenge of limited re-

sources, this incremental re-training can be triggered

in two ways. The ﬁrst is by setting up a concept drift

detector based on either the data contained in the win-

dow, or the model’s performance, or a combination of

both. The second is to perform this re-training peri-

odically when the length of the window storing new

data reaches its threshold.

3.3 Lightweight Prediction Explanation

To ensure more efﬁcient and lightweight XAI sys-

tems, we propose a shift from traditional, resource-

On Handling Concept Drift, Calibration and Explainability in Non-Stationary Environments and Resources Limited Contexts

339

Model

Ofﬂine learning

Adaptive learning

Online prediction

XAI

User

Storage window

Streaming data

Initial training data

Labeling

Target value

Calibration rate

Explanation

Figure 1: Concept drift adaptation proposed framework.

intensive XAI models to the use of lightweight

models, such as linear regressors and decision trees

for example. Our fundamental concept centers around

reducing the computational overhead linked to gener-

ating explanations by replacing them with regressions

using ML predictive models which are much lighter

in terms of prediction time and memory consumption.

Fig. 2 illustrates how we generate lightweight expla-

nations.

Instead of using an E explanation as a resource

-overly greedy shap, to explain a predicition y for a

data sample x, we propose to replace (more precisely

to approximate) the explainer E by a set of regression

models f

... f

where each model f

tries to pre-

dict E(x

) the feature attribution computed by E for

the attribute x

. The advantage of doing so is to learn

ofﬂine models (in an environment that is resource-

constrained) and then deploy these models in the con-

strained environment. These models which are sup-

posed to approximate the Explainer E are of light size

and above ensure feature attribution with a minimum

of resources. To set up this solution, we have to

• Build an explanation dataset: to be able to ap-

proximate an explaner E which provides for a data

sample x = (x

, .., x

) and a prediction y made by

the model to be explained, a vector of feature at-

tributions ( f

, .., f

) of scores where each score

tells how much the feature X

was inﬂuential

in the prediction y for x. Thus, to learn models

that approximate E, we build a dataset composed

of data samples x, their predictions by the model

By lightweight ML models we mean models with low

complexity (determining model size) and prediction time.

to be explained y as well as the feature attribution

vectors computed by the explainer E as illustrated

on the Table 1. This dataset can be easily built by

taking up the dataset D which is used to train the

model to be explained, the model’s predictions on

D and the explanations provided by the Explainer

chosen E for data sample in D.

• Build regressors to provide feature attribu-

tions: Once explanation dataset has been built, we

can train regression models (or only one model

with several outputs in case of neural network-

based regression) . If we build a model by feature

, then we have to train the regression model on

the data play D and the corresponding explanation

column f

only.

4 EXPERIMENTAL STUDY

4.1 Experimental Settings

To assess the effectiveness of the proposed frame-

work, four public IoT datasets related to anomaly

detection and human activity recognition in a smart

home environment facing concept drifts, are used:

NSL-KDD, CICIDS2017, IoTID20 and Aruba.

• NSL-KDD is a balanced benchmark dataset for

concept drift and network intrusion detection (Liu

et al., 2020). For this dataset, it is known that there

is a sudden drift when transitioning from the train-

ing set to the test set (Yang and Shami, 2021).

• CICIDS2017 is a dataset provided by the Cana-

dian Institute of Cybersecurity (CIC), involving

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

340

Training data D

Classiﬁer f

Class label y

Explanation f

Explainer E

Training

input (x

, y

)

Training

expected output f

Lightweight explainer

Figure 2: Lightweight explanation proposed approach.

Table 1: Illustration of the explanations dataset.

Input features X Prediction Feature attributions by an explainer E

... ... ... X

Y f

... ... ... f

12 ”B” ... ... ... ”SF” 1 .34 .001 ... ... ... .12

8 ”C” ... ... ... ”UG” 0 .05 .21 ... ... ... .02

55 ”A” ... ... ... ”PS” 0 .07 .31 ... ... ... 0

... ... ... ... ... ... ... ... ... ... ... ... ...

100 ”B” ... ... ... ”SF” 1 .02 0 ... ... ... .05

cyber-attack scenarios. As different types of at-

tacks were launched in different time periods

to create this dataset, the attack patterns in the

dataset change over time, causing multiple con-

cept drifts (Sharafaldin et al., 2018).

• IoTID20 is an IoT trafﬁc dataset with unbalanced

data samples (94% normal samples versus 6% ab-

normal samples) for abnormal IoT device detec-

tion (Ullah and Mahmoud, 2020).

• Aruba is a dataset collected within the CASAS

smart home project (Cook, 2012). This dataset

collected different data sources in the home of an

adult volunteer. The resident of the house was

a woman who received visits from her children

and grandchildren regularly between 4 November

2010 and 11 June 2011. Two data sources gave

rise to the information, the ﬁrst source was binary

and was made up of movement and contact sen-

sors, and the second source was made up of tem-

perature sensors.

For the purpose of this work, a representative

IoTID20 subset with 6,252 records and a sampled CI-

CIDS2017 subset with 14,155 records, as well as a

reduced Aruba dataset that has 200,784 records com-

bining the ﬁrsts and lasts months of the experiment

are used for the model evaluation.

These datasets allow us to see how our frame-

work performs in binary classiﬁcation for anomaly

detection and in multi-class classiﬁcation for activity

recognition. They also allow us to observe how it per-

forms in dealing with class imbalance, which is quite

prevalent in the IoTID20 and CICIDS2017 datasets.

Two validation methods, namely hold-out and pre-

quential, are utilized for evaluation. In the hold-out

evaluation, the initial 20% of the data is utilized for

the initial model training, while the remaining 80%

is reserved for online testing. In prequential valida-

tion, also known as test-and-train validation, each in-

put sample from the online test set serves a dual pur-

pose: ﬁrst, it tests the learning model, and then it con-

tributes to model training and updating (Yang et al.,

2021).

The windowing ensemble model tested is com-

posed of three decision trees as base classiﬁers. The

model is updated in an incremental way by retrain-

ing the least efﬁcient base classiﬁer using windows of

length 2000, 50, 500 and 10000 records for the NSL-

KDD, CICIDS2017, IoTID20 and Aruba datasets, re-

spectively.

On Handling Concept Drift, Calibration and Explainability in Non-Stationary Environments and Resources Limited Contexts

341

The evaluation of the proposed framework’s per-

formance relies on different metrics linked to pre-

diction quality, model calibration and lightweight in

terms of total time and memory used during learning

(accuracy, F1-score, NLL, ECE, MCE, total training

time as well as occupied and peak memory).

4.2 Lightweight Adaptive Learning

Tables 2, 3, 4, 5 show the performance comparison

of the proposed framework with other state-of-the-

art drift adaptive approaches, including ARF (Gomes

et al., 2017), SRP (Gomes et al., 2019), EFDT (Man-

apragada et al., 2018), KNN-ADWIN (Losing et al.,

2016), OPA (Crammer et al., 2006), LB (Bifet et al.,

2010) and MSANA (Yang and Shami, 2023). We can

see that the ensemble model used without re-training

(Baseline) was impacted by the various concept drifts

contained in the four datasets tested, which explains

its poorer performance compared to the online mod-

els. However, after the adaptive training phase (L-

Ens), the model’s performance improved on all tested

datasets. The obtained results are better than those

achieved with state-of-the-art methods on NSL-KDD

and IoTID20 and almost similar on CICIDS2017 and

Aruba. Figures 3 and 4 illustrate the continuous varia-

tion in accuracy of all the methods tested, and we can

visually conﬁrm that our approach adapts very well

after the onset of concept drifts, which is not the case

with the baseline model used without retraining. Fur-

thermore, according to the calibration measurements

(NLL, ECE, MCE) in Tables 2, 3, 4, 5, we can say that

with the online training performed by the state-of-the-

art methods, the calibration deteriorated over time,

which is not the case with our ensemble method even

after periodic re-training. Regarding the lightweight

nature of the proposed framework, our approach re-

quired much less training time and used less memory

than all the other state-of-the-art tested methods. This

proves its effectiveness and good adaptation to con-

cept drift, while answering the lightweight challenge.

4.3 Lightweight Explanation

Our proposal, although preliminary, has been tested

with the four datasets presented above, by ﬁrst, using

as main classiﬁer f a LightGBM (Jin et al., 2020),

which is a fast and powerful ML model based on the

ensemble of several decision trees, then, SHAP tree-

Explainer (Lundberg et al., 2020) as the basic expla-

nation model, and ﬁnally, a decision tree model as the

classiﬁer f

′

generating the lightweight explanations.

Table 6 shows the results of our preliminary ap-

proach (L-exp) compared with those obtained using

SHAP. The evaluation measures used are related, on

the one hand, to the quality of the explanations gener-

ated, and on the other, to the time and memory occu-

pied when predicting a series of 1000 instances. The

quality of the explanations was analyzed on differ-

ent levels. The ﬁrst is the reconstruction error mse,

describing the gap between the explanations gener-

ated by our approach (through regression) and the

true explanations (set of test instances generated by

SHAP). The second level is linked to the predicted

features. For a given explanation, we check on the

ﬁrst 2, 5 and 10 features, the rate of those that are

common with the SHAP explanation, among which

we note the rate of series where the features are iden-

tical, as well as the rate linked to the order of their

appearance. From the results obtained, whether on

the ﬁrst 2, 5 or 10 features among the total of 41, 77,

31 and 38 features of the tested datasets, respectively

NSL-KDD, CICIDS2017, IoTID20 and Aruba, we

note that the SHAP explanations are almost similar to

those generated with our approach, while meeting the

lightweight criterion. Indeed, our approach (L-exp)

takes much less time and memory than SHAP, espe-

cially on multi-class sets where XAI consumes much

more memory than our framework.

4.4 Impact of Incremental Learning on

Model Calibration and

Explainability

It is also important to highlight the impact of concept

drift on the explanations generated, as illustrated in

Figures 5 and 6 for the NSL-KDD and CICIDS2017

datasets respectively. Indeed, we can see at the con-

cept drift detection points in the dataset an increase in

the mse error related to the quality of the generated

explanations. Furthermore, based on the calibration

measurements (NLL, ECE, MCE) in Tables 2, 3, 4

and 5, it can be observed that the online training con-

ducted by state-of-the-art methods led to a degrada-

tion in their calibration over time. Consequently, the

issue of adaptation not only in relation to the perfor-

mance of the classiﬁers, but also to that of the calibra-

tion and quality of the explanation of the techniques

currently in use over time, is put into perspective.

5 DISCUSSION AND

CONCLUDING REMARKS

Through this work, we have highlighted the impor-

tance of taking into account the occurrence of con-

cept drift and its impact on the ML models perfor-

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

342

Table 2: Performance comparison of our approach and state-of-the-art methods on NSL-KDD.

Model Acc% F1% NLL ECE% MCE%

Training Occupied Peak

time (s) mem (Kb) mem (Kb)

ARF-ADWIN 97.45 97.36 0.08 1.92 21.95 303.51 4977.42 8641.43

ARF-EDDM 98.05 97.99 0.07 1.14 18.34 465.56 10462.17 23007.39

SRP 97.93 97.86 0.08 1.41 22.62 1869.27 11779.89 19309.99

EFDT 90.50 89.96 1.13 3.27 12.27 6097.24 3399.01 4454.46

KNN-ADWIN 91.93 91.43 0.48 5.09 23.62 45.55 137.82 314.48

OPA 90.47 90.17 0.24 0.52 2.56 19.24 91.87 145.67

LB 97.53 97.44 0.09 0.57 12.70 1400.55 24131.05 24167.58

MSANA 92.90 92.49 0.17 2.13 6.34 2313.64 22522.52 74407.18

Baseline 95.99 95.75 1.43 0.01 3.98 0.78 57.88 6685.43

L-Ens 98.17 98.10 0.28 0.02 15.08 2.04 60.52 6708.88

Table 3: Performance comparison of our approach and state-of-the-art methods on CICIDS2017.

Model Acc% F1% NLL ECE% MCE%

Training Occupied Peak

time (s) mem (Kb) mem (Kb)

ARF-ADWIN 98.55 93.58 0.08 0.59 18.71 33.46 943.91 1869.14

ARF-EDDM 98.73 94.34 0.08 0.74 15.12 33.08 596.64 971.64

SRP 98.98 95.53 0.10 0.58 13.80 272.39 2125.76 7027.66

EFDT 95.01 80.33 1.13 0.99 11.72 332.31 1244.13 1532.5

KNN-ADWIN 98.77 94.72 0.16 0.68 9.30 12.95 138.86 383.19

OPA 98.26 92.43 0.11 6.05 16.81 4.92 43.70 147.05

LB 98.13 91.86 0.14 0.45 15.40 397.88 2460.46 5667.54

MSANA 98.96 95.40 0.07 2.87 15.86 562.71 2519.62 12394.82

Baseline 86.58 0.13 4.84 0.00 13.42 0.11 12.60 1052.09

L-Ens 97.27 87.74 0.38 0.00 1.37 2.43 77.38 1067.70

Table 4: Performance comparison of our approach and state-of-the-art methods on IoTID20.

Model Acc% F1% NLL ECE% MCE%

Training Occupied Peak

time (s) mem (Kb) mem (Kb)

ARF-ADWIN 98.26 99.08 0.09 1.44 21.17 13.78 1053.30 1369.63

ARF-EDDM 98.00 98.95 0.10 1.65 26.32 14.86 1691.27 1702.45

SRP 98.72 99.32 0.10 1.37 19.15 64.27 3002.57 3075.86

EFDT 96.02 97.92 0.40 0.94 1.39 102.42 566.49 772.45

KNN-ADWIN 95.92 97.85 0.30 3.45 25.37 2.88 83.90 243.33

OPA 93.74 96.69 0.19 4.32 11.40 1.00 75.96 115.20

LB 98.06 98.98 0.09 0.74 9.21 95.72 2482.13 2553.32

MSANA 98.58 99.25 0.06 2.12 10.49 323.75 5008.29 6888.13

Baseline 99.26 99.61 0.27 0.00 0.74 0.05 9.14 247.97

L-Ens 99.26 99.61 0.11 0.00 0.19 0.20 38.50 249.79

mance. To address the challenges posed by this con-

cept drift, we explored the beneﬁts of adaptive learn-

ing in resource-constrained environments, while as-

sessing its impact on model performance and calibra-

tion, as well as the quality of explanations provided

over time. Our proposed approach is based on the use

of a lightweight windowing ensemble model that is

incrementally updated. It also includes preliminary

work related to the generation of lightweight expla-

nations. The results obtained on binary and multi-

class datasets demonstrate its effectiveness over time,

while maintaining very low resource costs. These re-

sults also raise questions about the quality of calibra-

tion after this concept drift adaptation stage, and how

to generate high-quality, adaptive explanations over

time. As a future direction, the exploration of uncer-

tainty and reliability in incremental AI facing concept

drift involves enhancing calibration and the quality

of explainability approaches while considering vari-

ous classiﬁers and XAI methods. This exploration

On Handling Concept Drift, Calibration and Explainability in Non-Stationary Environments and Resources Limited Contexts

343

Table 5: Performance comparison of our approach and state-of-the-art methods on Aruba.

Model Acc% F1% NLL ECE% MCE%

Training Occupied Peak

time (s) mem (Kb) mem (Kb)

ARF-ADWIN 96.10 82.47 0.22 2.45 24.48 305.86 453.07 1687.51

ARF-EDDM 96.51 83.82 0.19 3.25 24.22 316.47 831.43 1171.09

SRP 97.51 89.77 0.16 2.13 25.30 2057.65 3583.76 9093.03

EFDT 94.33 73.17 0.78 2.93 15.56 10526.64 6333.47 6501.39

KNN-ADWIN 98.06 93.77 0.15 1.64 58.47 85.15 93.86 268.95

OPA 2.12 4.01 35.26 97.26 98.59 31.52 72.71 134.77

LB 96.24 87.69 0.34 1.45 19.95 1791.96 1184.22 2547.22

Baseline 80.35 50.81 7.08 0.00 19.65 1.92 37.36 15241.17

L-Ens 89.77 60.00 2.27 0.00 5.69 3.61 52.81 15242.80

Figure 3: Accuracy variation on NSL-KDD.

Figure 4: Accuracy variation on CICIDS2017.

Table 6: Evaluation of the quality of generated lightweight explanations.

Dataset

Explanation quality Time (s) Memory (Kb)

Mse

Shared features (%) Same set (%) Same order (%)

SHAP L-exp SHAP L-exp

2 5 10 2 5 10 2 5 10

NSL-KDD 0.0004 95.59 96.34 97.31 91.84 85.57 81.35 97.51 92.33 91.74 0.21 0.002 339.71 322.41

IoTID20 0.04 99.04 99.33 97.60 98.16 96.90 83.01 99.04 96.93 94.51 5.62 0.002 276.46 243.85

CICIDS2017 9.55e-05 97.59 98.38 98.59 96.23 93.06 87.90 99.97 93.27 95.25 0.086 0.002 619.82 603.19

Aruba 84.19 93.68 93.67 95.05 90.37 80.30 75.37 98.27 92.63 87.56 8.06 0.002 3406.64 299.29

should also encompass the assessment of the inﬂu-

ence of data pre-processing and balancing on a con-

tinual basis.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

344

Figure 5: Impact of concept drift on the mse error associated with NSL-KDD explanations.

Figure 6: Impact of concept drift on the mse error associated with CICIDS2017 explanations.

ACKNOWLEDGEMENTS

This work has been supported by the Vivah project

’Vers une Intelligence artiﬁcielle

a VisAge Humain’

supported by the ANR.

REFERENCES

Agrahari, S. and Singh, A. K. (2022). Concept drift de-

tection in data stream mining : A literature review.

Journal of King Saud University - Computer and In-

formation Sciences, 34(10, Part B):9523–9540.

Ashukha, A., Lyzhov, A., Molchanov, D., and Vetrov, D. P.

(2020). Pitfalls of in-domain uncertainty estimation

and ensembling in deep learning. In 8th International

Conference on Learning Representations, ICLR 2020,

Addis Ababa, Ethiopia, April 26-30, 2020. OpenRe-

view.net.

Baena-Garc

ıa, M., Campo-

Avila, J., Fidalgo-Merino, R.,

Bifet, A., Gavald, R., and Morales-Bueno, R. (2006).

Early drift detection method.

Bifet, A. (2009). Adaptive learning and mining for data

streams and frequent patterns. SIGKDD Explor.,

11:55–56.

Bifet, A., Holmes, G., and Pfahringer, B. (2010). Leverag-

ing bagging for evolving data streams. In Balc

azar,

J. L., Bonchi, F., Gionis, A., and Sebag, M., ed-

itors, Machine Learning and Knowledge Discovery

in Databases, pages 135–150, Berlin, Heidelberg.

Springer Berlin Heidelberg.

Cook, A. A., Mısırlı, G., and Fan, Z. (2020). Anomaly de-

tection for iot time-series data: A survey. IEEE Inter-

net of Things Journal, 7(7):6481–6494.

Cook, D. (2012). Learning setting-generalized activity

models for smart spaces. IEEE Intelligent Systems,

27(1):32–38.

Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S.,

and Singer, Y. (2006). Online passive-aggressive al-

gorithms. Journal of Machine Learning Research,

7:551–585.

Gomes, H., Read, J., and Bifet, A. (2019). Streaming ran-

dom patches for evolving data stream classiﬁcation.

In 2019 IEEE International Conference on Data Min-

ing (ICDM), pages 240–249, Los Alamitos, CA, USA.

IEEE Computer Society.

Gomes, H. M., Bifet, A., Read, J., Barddal, J. P., En-

On Handling Concept Drift, Calibration and Explainability in Non-Stationary Environments and Resources Limited Contexts

345

embreck, F., Pfharinger, B., Holmes, G., and Ab-

dessalem, T. (2017). Adaptive random forests for

evolving data stream classiﬁcation. Machine Learn-

ing, 106(9):1469–1495.

Jin, D., Lu, Y., Qin, J., Cheng, Z., and Mao, Z. (2020).

Swiftids: Real-time intrusion detection system based

on lightgbm and parallel intrusion detection mecha-

nism. Computers & Security, 97:101984.

Liu, J., Kantarci, B., and Adams, C. (2020). Ma-

chine learning-driven intrusion detection for contiki-

ng-based iot networks exposed to nsl-kdd dataset. In

Proceedings of the 2nd ACM Workshop on Wireless

Security and Machine Learning, WiseML ’20, page

25–30, New York, NY, USA. Association for Com-

puting Machinery.

Losing, V., Hammer, B., and Wersing, H. (2016). Knn clas-

siﬁer with self adjusting memory for heterogeneous

concept drift. In 2016 IEEE 16th International Con-

ference on Data Mining (ICDM), pages 291–300.

Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., and Zhang,

G. (2019). Learning under concept drift: A review.

IEEE Transactions on Knowledge and Data Engineer-

ing, 31(12):2346–2363.

Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin,

J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N.,

and Lee, S.-I. (2020). From local explanations to

global understanding with explainable ai for trees. Na-

ture Machine Intelligence, 2(1):56–67.

Lundberg, S. M. and Lee, S.-I. (2017). A uniﬁed approach

to interpreting model predictions. In Proceedings of

the 31st International Conference on Neural Informa-

tion Processing Systems, NIPS’17, page 4768–4777,

Red Hook, NY, USA. Curran Associates Inc.

Manapragada, C., Webb, G. I., and Salehi, M. (2018). Ex-

tremely fast decision tree. In Proceedings of the 24th

ACM SIGKDD International Conference on Knowl-

edge Discovery & Data Mining, KDD ’18, page

1953–1962, New York, NY, USA. Association for

Computing Machinery.

Naeini, M. P., Cooper, G. F., and Hauskrecht, M. (2015).

Obtaining well calibrated probabilities using bayesian

binning. In Proceedings of the Twenty-Ninth AAAI

Conference on Artiﬁcial Intelligence, AAAI’15, pages

2901–2907. AAAI Press.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ”why

should i trust you?”: Explaining the predictions of any

classiﬁer. In Proceedings of the 22nd ACM SIGKDD

International Conference on Knowledge Discovery

and Data Mining, KDD ’16, pages 1135–1144, New

York, NY, USA. Association for Computing Machin-

ery.

Sharafaldin, I., Lashkari, A. H., and Ghorbani, A. A.

(2018). Toward generating a new intrusion detection

dataset and intrusion trafﬁc characterization. In Inter-

national Conference on Information Systems Security

and Privacy.

Ullah, I. and Mahmoud, Q. H. (2020). A scheme for gen-

erating a dataset for anomalous activity detection in

iot networks. In Advances in Artiﬁcial Intelligence:

33rd Canadian Conference on Artiﬁcial Intelligence,

Canadian AI 2020, Ottawa, ON, Canada, May 13–15,

2020, Proceedings, page 508–520, Berlin, Heidel-

berg. Springer-Verlag.

Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten,

F., Roll, J., and Sch

on, T. (2019). Evaluating model

calibration in classiﬁcation. In Chaudhuri, K. and

Sugiyama, M., editors, Proceedings of the Twenty-

Second International Conference on Artiﬁcial Intelli-

gence and Statistics, volume 89 of Proceedings of Ma-

chine Learning Research, pages 3459–3467. PMLR.

Van den Broeck, G., Lykov, A., Schleich, M., and Suciu, D.

(2021). On the tractability of SHAP explanations. In

Proceedings of the 35th AAAI Conference on Artiﬁcial

Intelligence.

Yang, L., Manias, D. M., and Shami, A. (2021). Pwpae: An

ensemble framework for concept drift adaptation in iot

data streams. In 2021 IEEE Global Communications

Conference (GLOBECOM), pages 01–06.

Yang, L. and Shami, A. (2021). A lightweight concept

drift detection and adaptation framework for iot data

streams. IEEE Internet of Things Magazine, 4:96–

101.

Yang, L. and Shami, A. (2023). A multi-stage automated

online network data stream analytics framework for

IIoT systems. IEEE Transactions on Industrial Infor-

matics, 19(2):2107–2116.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

346