A Self-Conﬁguration Controller To Detect, Identify, and Recover

Misconﬁguration at IoT Edge Devices and Containerized Cluster System

Areeg Samir

and H

avard Dagenborg

Department of Computer Science, UIT The Arctic University of Norway, Norway

Keywords:

Markov Decision Process, Hierarchical Hidden Markov Model, Self-Conﬁguration, Detection, Recovery,

Misconﬁguration, Performance, Threat, Edge Computing, Cluster, Containers, and Medical Devices.

Abstract:

Misconﬁguration of IoT edge devices and containerized backend components can lead to various complica-

tions like performance degradation, non-compliant data ﬂows, and external vulnerabilities. In this paper, we

propose a self-conﬁgurable cluster controller that uses the hierarchical hidden Markov model to detect, iden-

tify, and recover from misconﬁguration at the container and network communication level. Our experimental

evaluations show that our controller can reduce the effects of misconﬁguration and improves system perfor-

mance and reliability.

1 INTRODUCTION

Misconﬁguration in edge devices and core backend

components can impact the workload and how data

ﬂows within and across systems and services. This

might lead to ineffective resource utilization and per-

formance degradation or, more alarmingly, enable

unauthorized access. For critical systems like the ones

used in healthcare, avoiding misconﬁgurations is par-

ticularly important. For example, at the edge level,

a misconﬁguration in the Conexus telemetry proto-

col (e.g., CVE-2019-6538) could affect the scores

of Medtronic and PolySomnoGraphy heart-rate mon-

itors and allow unauthorized changes to a patient’s

device. At the cluster level, misconﬁguration (e.g.,

CVE-2019-5736, CVE-2022-0811) might occur due

to network rules, root/less privileged user access,

wrong pod label speciﬁcations, or forgetting to en-

force network policies after writing them. To address

these challenges in a shared multi-clustered environ-

ment, like those often found in healthcare, systems

need to manage the workload and the ﬂow of infor-

mation from edge devices to the system and within

the system clusters.

Several recent works have looked at workload and

information-ﬂow management (Moothedath et al.,

2020), (Kraus et al., 2021), (Sklavos et al., 2017),

(Luo et al., 2020), (M

akitalo et al., 2018), (Guo et al.,

2019). However, more work is needed on correlating

https://orcid.org/0000-0003-4728-447X

https://orcid.org/0000-0002-1637-7262

the misconﬁguration of medical edge devices and sys-

tem clusters to its observed performance degradation

and identifying its reasons for optimizing the system’s

information ﬂow and performance.

In this paper, we propose a self-conﬁgurable con-

troller that detects, identiﬁes, and recovers from mis-

conﬁgurations and limits their impact on the work-

load and faulty ﬂow of information of edge devices

and container-based clusters. The proposed controller

is in accordance with the common misconﬁgurations

of Azure, Docker, and Kubernetes security reported

in 2022 by CVE, the National Institute of Standards

and Technology (NIST) NIST SP 800-190, OWASP

Container Security Veriﬁcation Standards

, OWASP

Kubernetes Security Testing Guide

, and OWASP

A05:2021

– Security Misconﬁguration. Our pro-

posed controller is based on Hierarchical Hidden

Markov Models (HHMMs) (Fine, 1998) and Markov

Decision Processes (MDPs) (Derman, 1970), which

are useful for modeling a wide range of security and

optimization problems. We used HHMMs to char-

acterize the dependency of misconﬁguration in a hi-

erarchical structure by mapping the observed perfor-

mance anomalies to hidden resources and identifying

the root causes of the observed anomalies to improve

reliability. We chose MDP to determine the optimal

https://owasp.org/www-project-container-security-

veriﬁcation-standard/

https://owasp.org/www-project-kubernetes-security-

testing-guide/

https://owasp.org/Top10/A05 2021-Security Misconf

iguration/

Samir, A. and Dagenborg, H.

A Self-Conﬁguration Controller To Detect, Identify, and Recover Misconﬁguration at IoT Edge Devices and Containerized Cluster System.

DOI: 10.5220/0011893700003405

In Proceedings of the 9th International Conference on Information Systems Security and Privacy (ICISSP 2023), pages 765-773

ISBN: 978-989-758-624-8; ISSN: 2184-4356

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

765

Figure 1: An example of a multi-cluster architecture for edge-based fog/cloud computing typically found in the healthcare

industry.

recovery policy as the performance of running med-

ical edge devices and the system’s containers-based

clusters at a particular time instant is uncertain, and

the model is memoryless. The controller adapted the

Monitor, Analysis, Plane, Execute, and Knowledge

(MAPE-K) adaptive control to optimize the control

over the system under observation.

2 MAPPING FAILURES TO

FAULTS

To analyze and classify the various misconﬁguration,

failures, and faults, we use the ISO-27001 risk man-

agement framework. As a use case for our discus-

sions, we consider the multi-cluster environment of

modern medical organizations consisting of various

IoT devices and containerized services organized in a

fog-like architecture, as illustrated in Figure 1. We

focused on the most common security issues and

technical concerns for IoT edge devices, especially

medical heart-rate devices and container-based clus-

ters reported in literature studies and in the indus-

try, like Azure, Kubernetes, and Docker. We targeted

the data transmission and the conﬁguration connec-

tivity of IoT edge medical devices to/from the sys-

tem. We consider a failure as the inability of any

such system component to perform its functions in ac-

cordance with speciﬁed requirements, both functional

and non-functional (e.g., performance). Faults are

system properties that describe an exceptional condi-

tion occurring in the system operation that may cause

one or more failures (IEEE Standard Classiﬁcation,

2009). For instance, if the application gateway in Fig-

ure 1 has a fault, it may redirect edge-device trafﬁc to

the wrong healthcare system cluster node or be used

to stage malicious activities (Jin et al., 2022).

Threats are malicious actions or interactions with

the system or its environment that can result in a fault

and, thereby, possibly in a failure (ISO/IEC/IEEE

15026, 2019). Any abnormal ﬂow of information oc-

curring during an execution of a component is consid-

ered a fault or anomaly. Such faults can occur due to

a misconﬁguration. Such abnormal ﬂows character-

ized stealthy threat strategies conditioned on the sys-

tem model, while normal ﬂows deﬁned active defense

strategies, which enabled misconﬁguration detection.

Observed failures were analyzed along three di-

mensions: risk identiﬁcation (observed behavior),

risk assessment, risk treatment (recovery), and risk

severity (in percentage) relating to the system perfor-

mance in terms of the objectives that were not met by

the observed metrics and benchmarks. We used Key

Performance Indicators (KPIs) to determine whether

the motoring metrics met the maintenance goals and

the system’s performance (e.g., resource utilization,

latency, response time, network congestion, through-

put). The higher the value, the more severe impact on

the system’s performance will be.

Sudden Stop of Edge Device. Risk Identiﬁcation:

the edge device, after running successfully, stopped

for a speciﬁc period (e.g., a minute). In such a case,

the logs indicated that the device failed to connect

to the IoT hub over AMQP and WebSocket, and the

edge device existed. Risk Assessment: a host network

misconﬁguration prevented the IoT edge agent from

reaching the network. The agent attempted to con-

nect over AMQP (port 5671) and WebSockets (port

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

766

443) as the edge device runtime set up a network for

each module to communicate, either using a bridge

network or NAT. Risk Treatment: protect the IoT Hub

resources against attack by reconﬁguring network-

related resources (e.g., ﬁrewall conﬁguration). Risk

Severity: 70%. Risk Type: Conf

Spike Trafﬁc Received by System. Risk Identiﬁca-

tion: system services are not working properly, its re-

sources are excessively saturated Risk Assessment: a

Distributed Denial of Service attack (DDOS) prevents

access to the system’s network. Risk Treatment: se-

cure the routing protocol and enforce policy on the

trafﬁc entering or leaving the system’s default net-

work namespace. Risk Severity: 98%∼99%. Risk

Type: Conf

Connectivity Error. Risk Identiﬁcation: IoT Edge

modules that connect to cloud services, including the

runtime modules, stop working and return network

failure. Risk Assessment: IP packet forwarding is dis-

abled, and modules connected to cloud services aren’t

working. Risk Treatment: enable IP packet forward-

ing, specify multiple DNS servers and public DNS

server in case the device cannot reach any of the IP ad-

dresses speciﬁed, and check port conﬁguration rules.

Restart the network service and docker service. Risk

Severity: 99%. Risk Type: Conf

Empty Conﬁguration File. Risk Identiﬁcation: the

device has trouble starting modules deﬁned in the de-

ployment. Only the edge agent is running but continu-

ally reporting empty conﬁguration ﬁles. Risk Assess-

ment: the device may be having trouble with DNS

server name resolution within the private network.

Risk Treatment: specify the DNS server in container

engine settings, and restart the container. Risk Sever-

ity: 20%∼30%. Risk Type: Conf

Edge Hub Failure. Risk Identiﬁcation: the edge

Hub module fails to start. Risk Assessment: some

process on the host machine has bound a port that the

edge Hub module is trying to bind. The IoT Edge hub

maps ports 443, 5671, and 8883 for use in gateway

scenarios. The module fails to start if another process

has already bound one of those ports. Risk Treatment:

stop the process using the impacted ports or change

the create options in a deployment ﬁle. Risk Severity:

20%∼30%. Risk Type: Conf

Unable to Access Module Image. Risk Identiﬁca-

tion: the edge agent logs show a 403 error as a con-

tainer fails to run. Risk Assessment: the IoT edge

agent doesn’t have permission to access a module’s

image as registry credentials are incorrectly speciﬁed.

Risk Treatment: reconﬁgure the credentials of a con-

tainer registry. Risk Severity: 20%∼30%. Risk Type:

Conf

Unable to Access Registry. Risk Identiﬁcation: the

edge device voltage increased and decreased, which

generated errors as the attack gained root privileges.

Risk Assessment: the IoT edge agent doesn’t have

permission to access a module’s image, as registry

credentials are incorrectly speciﬁed. Risk Treatment:

reconﬁgure the credentials for accessing the registry.

Specify the correct name of the registry. Risk Sever-

ity: 40%∼70%. Risk Type: Conf

Data Leakage. Risk Identiﬁcation: sensitive med-

ical data was leaked. Risk Assessment: deploying

high-sophisticated malware leading to the theft of

sensitive medical data involving compromising sen-

sitive medical information. Risk Treatment: recover

weak passwords and misconﬁgured endpoints. En-

hance data encryption. Risk Severity: 40%∼70%.

Risk Type: Conf

Privilege Escalation Flaw. Risk Identiﬁcation:

sensitive medical data was leaked. Risk Assessment:

a docker Engine function option (e.g., users-remap)

gives access to remapped root and allows privilege

escalation to the root level. Risk Treatment: con-

ﬁgure user namespace remapping. Risk Severity:

80%∼90%. Risk Type: Conf

Privilege Escalation Flaw and Redeployment Fail.

Risk Identiﬁcation: sensitive medical data were

leaked. Risk Assessment: an Azure function

(e.g., SCM RUN FROM PACKAGE) gave access to

remapped root and allowed privilege escalation to the

root level. Risk Treatment: function conﬁguration

(e.g., SAS token) and redeployment. Risk Severity:

80%∼90%. Risk Type: Conf

3 SELF-CONFIGURATION

CONTROLLER

In this section, we described our controller, which

consists of (1) Monitoring that collected the perfor-

mance data of such as CPU, memory, and network

metrics; (2) Detection and Identiﬁcation that detected

misconﬁguration and identiﬁes its type; (3) Recovery

that selected the optimal recovery policy.

3.1 Monitor System Under Observation

We checked the normality of ﬂow and workload for

the components under observation by utilizing spear-

man’s rank correlation coefﬁcient to estimate the dis-

sociation between the emitted observations (failures)

and the amount of ﬂow. To achieve that, we wrote an

A Self-Conﬁguration Controller To Detect, Identify, and Recover Misconﬁguration at IoT Edge Devices and Containerized Cluster System

767

algorithm to be used as a general threshold to high-

light the occurrence of abnormal ﬂow in the man-

aged components (more details, see (Samir and Pahl,

2019)). The controller checked the conﬁguration set-

tings against the benchmarks of Azure Security, CIS

Docker, and Kubernetes. For any mismatch between

the settings and the requirements of secure deploy-

ment in components, the controller reevaluates the de-

ployment of the impacted component, applies the re-

quired conﬁguration, and redeploys the component.

Otherwise, the controller proceeds with the deploy-

ment.

3.2 Misconﬁguration Detection and

Identiﬁcation

The Hierarchical Hidden Markov Model (HHMM)

is a generalization of the Hidden Markov Model

(HMM), where the states that are hidden from the ob-

server might emit visible observation sequences that

constitute observation space.

The system under observation has a hierarchical

structure, and it consists of one or more clusters Cl

j=1

root state that has an internal state N

j=2

(node) with

horizontal transition i (states at the same level), and

vertical transition j (sub-state). The sub-state C

j+1

represents containers (e.g., C

at vertical level 3 and

horizontal level 1), and the production state S

rep-

resents services that emit Observation Space OS

CPU

, L

Memory

, ···, L

Responsetime

} that is associated to

the computing resources saturation L

CPU

. The State

Space SP is mapped to Cluster Space ClS, which con-

sists of a set of Ns, containers C, and services S. The

edge direction indicates the information ﬂow between

states.

The model vertically calls one of its sub-states

= {C

, C

}, N

, N

= {C

, C

} with “vertical

transition χ ” and j index (superscript), where j =

{1, 2, 3}. Since node N

is the abstract state, it

enters its child HMM sub-states containers C

and

. (C

, C

), and C

have deployed services

, S

respectively. Since (S

, ···, S

) are pro-

duction states, they emit observations and may make

the horizontal transition with horizontal index i (sub-

script), where i = {1, 2, 3, 4}, from S

to S

. Once

there is no other transition, S

transits to the end state

eCid

, which ends the transition for this sub-state, to

return the control to the calling state C

. Once the

control returns to the state C

, it makes a horizontal

transition (if it exists) to state C

. Once the horizon-

tal transition ﬁnishes, the transition goes to the End

state C

eNid

to make a vertical transition to State N

Once all the transitions under this node are achieved,

the control returns to N

. The observation O

refers to

the failures observations sequence (high CPU utiliza-

tion, slow network trafﬁc, and slow response time),

which might reﬂect workload ﬂuctuation. This ﬂuc-

tuation is associated with a probability that reﬂects

the state transition status from AF (Abnormal Flow)

to NL (Normal Flow) at a failure rate ℜ, which indi-

cates the number of failures for a S, C, N, or Cl over

a period of time. We used the Baum-Welch algorithm

to train the model by calculating the probabilities of

the model parameters. The algorithm’s output is used

to train the Viterbi algorithm to ﬁnd the abnormal in-

formation ﬂow path of the detected states under mis-

conﬁguration. Once all the recursive transitions are

ﬁnished, we obtained a hierarchy of abnormal ﬂow

path AF

seq

= {Cl, N

, N

, C

, S

} that is affected by

the component under misconﬁguration (N

The model’s output is used to identify the type

of misconﬁguration (hidden states) given the obser-

vation (observed failures) emitted from the system

components. The observations are compared with the

risk severity to infer the misconﬁguration type. For

each misconﬁguration, we initialized the model states

Con f

i j

and observations F

{1,···,T }

parameters through

a misconﬁguration state graph length Con f Leng and

observations of length T . The probability of oc-

currence Con f

i j

was calculated assuming that the

misconﬁguration started at the initial state Con f

The probabilities of Con f

i j

and observations F

{1,···,T }

were stored in matrix Con f Mat[Con f Leng, T ]. We

computed the F probability by summing the previ-

ous forward path probability from the previous time

step t − 1 weighted by their transition probabilities

Con f prob

Con f

′

,Con f

, and multiplied by the observa-

tion probability F prob

Con f

). We sum over the

probabilities of all possible hidden misconﬁguration

(Con f

i j

, ···, Con f

N j

) that could generate the obser-

vation sequence F

{1,t+1,···,T }

. Each Con f represented

the probability of being in Con f

after seeing the ﬁrst

observations. In case of a misconﬁguration, we

checked its type (Con f

, Con f

) con-

sidering the Risk Type for that state; otherwise, the

model returns to check the next state. The model re-

sults were stored in Knowledge storage to enhance the

future of detection.

We focused on the most common misconﬁgura-

tion types reported about the edge cluster and cloud

layer according to National Vulnerability Database,

Cybersecurity & Infrastructure Security Agency,

Common Attack Pattern Enumeration and Classiﬁ-

cation, OWASP Top 10 Security Risks, ENISA Top

15 Threats, and AT&T Cybersecurity. To enhance

the identiﬁcation process, we created pre-deﬁned mis-

conﬁguration description proﬁles with common mis-

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

768

conﬁguration types and stored them in the knowledge

storage.

We extended HHMM with controlling labels

made up of tags, each of which stands for a certain

integrity issue (such as private data) and outlines the

information ﬂow permitted. Thus, a tag model is de-

ﬁned to deliver a policy formulation.

We further created an information access con-

trol list to control the access to the system compo-

nents. The list identiﬁed a set of access variables for

each participant, such as roles, actions, access to the

API, authorizations they hold, permissions, permis-

sion boundaries, and conditions.

The permission boundary deﬁnes the maximum

permissions granted to participants and roles by utiliz-

ing an enumeration-type action with two values (true

and false). If the action of permission is true, then the

permission is allowed; otherwise, it is rejected. More-

over, we assumed that if no information ﬂow policy is

speciﬁed in the domain, the inbound and the outbound

ﬂow will be set if the policy has any outbound rules.

The policies in the observed system do not conﬂict as

they are addictive.

3.3 Misconﬁguration Recovery

We used the MDP to model the recovery as a set of

states (identiﬁed misconﬁguration type) and actions

that can be performed to control the system’s behavior

through an agent. The agent interacted with the envi-

ronment by choosing the optimal actions to maximize

a long-term measure of total reward without knowing

about past situations.

The agent at state St

chose an action A

from the

action space AS, (A

∈ A(S

)), where A(St

) was the

set of actions (a

, ···, a

) (Risk Treatment) available

in state St

at time t. Here, the a

denoted that the

system’s component (containers deployed services)

is terminated when a container escapes vulnerability

and allows an attacker to obtain host root access. The

denoted that the system’s component is reconﬁg-

ured and redeployed. Here, the affected conﬁguration

is updated and redeployed. The a

denoted that the

system’s component was reconﬁgured and the whole

cluster was restarted. The a

denoted that no ac-

tion was applied because the status of the component

couldn’t be obtained. Depending on the action taken,

the starting state, and the subsequent state P

, the

agent receives a numerical reward R

t+1

∈ ℜ ⊂ R to

maximize the reward (performance). Then, the sys-

tem returns a new state St

t+1

from the state space SP

at time t.

The state space includes three possible situations,

1) the state is at the initial status (misconﬁgured) St

Int

2) successful recovery St

, 3) the recovery fail St

and 4) recovered state St

. Hence, if the state is St

Int

then one recovery action from the actions in AS is ap-

plied. If the action is applied successfully, the state

transits to St

, which indicates the success of recov-

ery. Here, the state is marked as recovered St

other-

wise, the state turns to St

, and another action is ap-

plied to recover the state P : S × S × A 7 −→ [0, 1] until

the state is recovered with the probability of transition

P(St1, St2, A1) = 1 from state St1 to state St2 upon

using action A1. If the recovery isn’t applied suc-

cessfully, the state is marked as non-recovered with a

probability of failure P(St1, St2, A1) = F, and a con-

stant failure rate FR, which results in an exponential

failure distribution as shown in (1).

The reward could be a set of possible rewards with

a reward probability R

: (1) positive value, which

refers to a successfully applied recovery action that

enhanced the observed metrics. (2) negative value

refers to non-successful applied recovery action that

declined the observed metrics. The reward function

R represents the gain for using action A1 in state S

as R : S × A 7 −→ R. The gain of the reward is de-

termined by deﬁning the performance function based

on the action chosen at each time slot t, denoted by

PerSC

= Q

. The Q

denoted the system’s com-

ponent performance at t, and Q

denoted the edge

device’s performance at t. The greater the value of

Q, the better a chosen action for a state with a policy

π. To control the importance of immediate and future

rewards for each state, we used a discount factor of D

to maximize the cumulative reward we got from each

state.

(t)

= FR × exp

−FR(t+t

′

)

(1)

The controller stops the iteration as soon as there

is no more policy enhancement. We enhanced the

controller performance during the selection process

of the optimal policy as shown in (2) by taking the

best action over all actions considering the converge

of the expected return sequence upsilon

and the op-

timal returned policy upsilon

optimal

, ∀s ∈ S.

w+1

(s) = max

∑

′

p(s

′

, r|s, a) × [ευ

′

) + r]

(2)

4 EVALUATION

To evaluate the controller, we ran several experiments.

Our setup consisted of three VM instances (2 VM for

the heart rate monitor application and 1 VM for the

controller). Each VM is equipped with 3 VCPUs and

2 GB VRAM; and runs Ubuntu 22.10 and Xen 4.11.

A Self-Conﬁguration Controller To Detect, Identify, and Recover Misconﬁguration at IoT Edge Devices and Containerized Cluster System

769

Agents are installed on each VM to collect and trans-

fer monitoring data for external storage and process-

ing. The VMs are connected through a 100 Mbps

network. For each VM, we deployed three contain-

ers. A K8s cluster, consisting of one master node and

three worker nodes, was deployed using Kubeadm,

running K8s version 1.19.2. All nodes have 4 VC-

PUs and 8GB RAM, and all were deployed on the

same machine to eliminate variations in network de-

lay. We created 30 namespaces, each with 4 microser-

vices (pods) used for performance measurements, and

assigned the same number of network policies. The

number of created policies was 900, which were or-

dered, managed, and evaluated.

4.1 Data Collection and System

Monitoring

The installed agents collect data about CPU, Mem-

ory, Network, ﬁlesystem changes, information ﬂow

(i.e., no. of ﬂows issued to component), patient health

information, device operation status, the device id,

and service status from the system components. The

agents exposed log ﬁles of system components to

the storage to be used in the analysis. Edge devices

with similar functionality are grouped and allocated

to a respective group (pool of heart monitor edge de-

vices). We used the Datadog tool to obtain a live data

stream for the running components and to capture the

request-response tuples and associated metadata. The

collected data are grouped and stored in a time series

database. We used the ”Logman” command in Ku-

bernetes and Docker to trace remote procedure call

(RPC) events to forward container logs as event trac-

ing in the window. The gathered data are stored in

real-time/historical storage to enhance future detec-

tion.

The dataset is divided into a 50% training set and

50% testing set. We used NNM iSPI Performance to

collect data about the information ﬂow from the sys-

tem under observation (e.g., device id, device type,

max/mean/min size of the packet sent, total packets,

max/mean/min amount of time of active ﬂow, dura-

tion of ﬂow). We stored the conﬁguration ﬁles of

the components in the GitOps version control to sim-

plify the rollback of conﬁguration change. We wrote

our conﬁguration ﬁles using YAML. We managed the

conﬁgurations, deployments, and dependencies using

kubectl and Skaffold.

After training the models on the gathered data,

we noticed at the cluster level a sudden increase in

request latency and the request rate falling, which

caused excessive consumption of resource usage

(CPU, memory, network). This occurred because of

Table 1: Detection Evaluation.

Metrics HHMM CRFs DBMs

RMSE 0.2003 0.4860 0.2600

PFD 0.3065 0.3976 0.3948

Recall 94.49% 91.69% 93.62%

Accuracy 93% 92% 92%

the deployment of the incorrectly conﬁgured version

(pod replacement) that allows root access to the host

as the privileged and hostPID were true. Moreover, a

critical improper access control occurred at the edge

level due to no encryption to secure the communica-

tion protocol, and the protocol lacks authentication

for legitimate devices.

4.2 The Detection Assessment

The performance of the detection model is evaluated

by Root Mean Square Error (RMSE) and Probabil-

ity of False Detection (PFD), which are the com-

monly used metrics for evaluating detection accuracy.

The RMSE measures the differences between the de-

tected value and the observed one by the model. A

smaller RMSE value indicates a more effective detec-

tion scheme. The PFD measures the number of the

normally detected component which has been miss-

detected as anomalous by the model. A smaller PFD

value indicates a more effective detection scheme.

The efﬁciency of the model is compared with Con-

ditional Random Fields (CRFs) and Deep Boltzmann

Machines (DBMs); see Table 1. We noticed that

the computation of CRFs is harder than the HHMM.

The results show that the performance of the pro-

posed detection is better than the CRF, as it correctly

identiﬁed true positives (TP) of abnormal ﬂow and

misconﬁguration with 94% recall and 93% accuracy.

The DBMs achieved promising results; however, the

learning procedure was too slow.

4.3 The Recovery Assessment

This section evaluates the controller by measuring the

reliability of recovery, deployment, and performance

of the controller. We used Mean Time to Recovery

(MTTR) to measure the average time the recovery

process takes to recover a component after observing

a failure on the monitored metrics. The failures refer

to a component that cannot meet its expected perfor-

mance metrics. A higher MTTR indicates the exis-

tence of inefﬁciencies within the recovery process or

the component itself. We conducted two scenarios.

The ﬁrst one corresponds to the selection of the opti-

mal policy. The second relates to selecting a random

policy, where the agent randomly selects one or more

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

770

actions with uniform distribution. For each scenario,

we aimed to assess the average time the recovery pro-

cess took to recover a container and an edge device.

In the ﬁrst scenario, the MTTR for the edge device

was 20 s, and the MTTR to recover the container was

approximately 43 s with a grace period of 80 s (default

30 s in Kubernetes) for service image size (110 MB)

with service image number 30. For the second sce-

nario, the MTTR for the edge device was 53 s, and

the MTTR to recover the container was roughly 71

seconds under the same settings. We noticed that the

container and the edge device function normally after

that interval in both scenarios regarding the assigned

rewards. The result of the ﬁrst scenario led to a sig-

niﬁcantly short recovery time as the average achieved

rewards through the optimal policy were remarkably

higher than the random policy. Moreover, we found

that for some actions, such as function conﬁgura-

tion and conﬁguring user namespace remapping, the

more rewards are assigned during the recovery pro-

cess, the average time decreases as the detection time

is short, demonstrating a signiﬁcant difference in the

controller performance. However, to recover from

failure efﬁciently, the average recovery time increased

when the failure rate increased.

We measured the Overall Accuracy of Recovery

(OAR). The OAR measures the average rate of suc-

cessfully recovered components to the total number of

failures emitted by all components. When computed

over multiple runs, the OAR was around 97.66%,

which means that the trained recovery policy could

not handle a small number of failures. The accuracy

was still more than 96%, though the unhandled fail-

ures decreased dramatically with more training data.

We veriﬁed the ability of the controller using a

long time-span dataset (from 1 July 2021 to 1 Novem-

ber 2022). For some misconﬁguration types (e.g.,

CVE-2019-5736, CVE-2022-0811, CVE-2019-6538,

CVE-2021-21284, CVE-2019-9946, and CVE-2020-

10749), the trained recovery policy enhanced the

performance of the system resources. The results

show that the average amount of resource consump-

tion (CPU, memory, network), with no misconﬁg-

uration, was approximately the same, with respec-

tive values varying around 30%∼60% (normal behav-

ior). Resource consumption due to misconﬁguration

increased and was over 98% (overloaded resources),

demonstrating the impact of improper conﬁguration

on the system resources. The recovered misconﬁgu-

ration impacted the saturated resource as the values of

the monitored resources varied around 38.4%∼64.6%

(normal behavior). The controller performance was

almost the same, with a minor recovery time devi-

ation of around 100 seconds for some failure types,

like container privileged access and wrong pod label.

The deviation returned to the correlation with the fail-

ure in the system. Hence, we used the sequence of

failures occurring during the recovery process to re-

ﬂect the type of failure, which represents the failures

that share the same observations corresponding to a

unique fault. If the container privileged access and

wrong pod label sequence of failures occurred, we fo-

cus on the container privileged access failure to rep-

resent its failure type and relate it to its fault, which is

Privilege Access Escalation Management. We choose

the initial failure that occurred as it is representative

enough of the observations to which it belongs, which

allows us to save the recovery time without trying

many recovery actions. We found that some failures

in the test set, such as CVE-2022-0811, are not cov-

ered by the training set, which might impact accuracy.

The result stated that the controller performed better

with the increase in the training dataset size.

5 RELATED WORK

Several studies have addressed workload and in-

formation ﬂow management in dynamic environ-

ments. Sorkunlu (2017) identiﬁed system perfor-

mance anomalies by analyzing the correlations in the

resource usage data. Wang (2018) proposed a model-

based approach to correlate the workload and the re-

source utilization of applications to characterize the

system status. In the work of Moothedath et al.

(2020), an information-ﬂow tracking model was de-

veloped for detecting suspicious ﬂows in the system

and performing security analysis for unauthorized use

of data. A formal model that optimizes the runtime

performance of data-ﬂow applications focusing on de-

tecting the latency variation in IoT is developed by

Luo et al. (2020).

Many literature studies have used Markov mod-

els and their derivations to detect anomalous behav-

ior. For instance, Sohal (2018) used Markov models

to categorize and identify malicious edge devices, and

Sukhwani (2014) implemented various techniques to

detect network anomalies and intrusions. Faults can

also be detected in real-time embedded systems by

describing the healthy and faulty states of a system’s

hardware components (Ge, 2015). In contrast, Borgi

(2018) proposed an intrusion-detection solution that

collects data at the system level. This solution track

information ﬂows to ﬁnd links between related and

unrelated attacks at the network level and to recog-

nize the reconstructed attack campaigns using HMM.

However, the previously mentioned literature

studies provided limited scope for dynamically in-

A Self-Conﬁguration Controller To Detect, Identify, and Recover Misconﬁguration at IoT Edge Devices and Containerized Cluster System

771

tegrating different policies to manage medical edge

devices’ and clusters’ conﬁgurations. In particu-

lar, existing frameworks have paid limited atten-

tion to the critical role of efﬁcient recovery man-

agement (Alessandro, 2022), (CISKubernetes, 2022),

(Darryl, 2022), (Fairwinds, 2023), (Joe, 2022), (Kyle,

2020). Hence, this paper: (1) mapped the observed

performance degradation (failure) to its hidden abnor-

mal ﬂow of information (fault) and misconﬁguration

type (error) and (2) selected the optimal recovery pol-

icy with optimum actions to optimize the performance

of the system under observation.

6 CONCLUSIONS AND FUTURE

WORK

Securing workloads and information ﬂow against

misconﬁguration in container-based clusters and edge

medical devices is an important part of overall system

security. This paper presented a controller that ana-

lyzes the misconﬁguration, maps the observation to

its hidden misconﬁguration type, and selects the op-

timal recovery policy to maximize the performance

of deﬁned metrics. In the future, we will integrate

streaming from different edge devices, expand the re-

covery mechanism, and conduct more experiments.

ACKNOWLEDGEMENT

This research was funded in part by The Research

Council of Norway under grant numbers 274451 and

263248.

REFERENCES

Moothedath, S., Sahabandu, D., Allen, J., Clark, A., Bush-

nell, L., Lee, W., and Poovendran, R. (2020). Dy-

namic Information Flow Tracking for Detection of

Advanced Persistent Threats: A Stochastic Game Ap-

proach. In arXiv:2006.12327.

Kraus, S., Schiavone, F., Pluzhnikova, A., and Invernizzi,

A. C. (2021). Digital Transformation in Healthcare:

Analyzing The Current State-of-Research. Journal of

Business Research, 123:557–567.

Sklavos, N., Zaharakis, I. D., Kameas, A., and Kalapodi, A.

(2017). Security & Trusted Devices in the Context of

Internet of Things (IoT). In The proceedings of 20th

EUROMICRO Conference on Digital System Design,

Architectures, Methods, Tools (DSD’17), pages 502–

509.

Luo, Y., Li, W., and Qiu, S. (2020). Anomaly Detec-

tion Based Latency-Aware Energy Consumption Opti-

mization For IoT Data-Flow Services. Sensors, 20:1–

20.

akitalo, N., Ometov, A., Kannisto, J., Andreev, S., Kouch-

eryavy, Y., and Mikkonen, T. (2018). Safe and Se-

cure Execution at the Network Edge: A Framework

for Coordinating Cloud, Fog, and Edge. IEEE Soft-

ware, 35:30–37.

Guo, M., Li, L., and Guan, Q. (2019). Energy-Efﬁcient and

Delay-Guaranteed Workload Allocation in IoT-Edge-

Cloud. IEEE Access, 7:78685–78697.

Fine, S., Singer, Y., and Tishby, N. (1998). The Hierarchical

Hidden Markov Model: Analysis and Applications.

Machine Learning, 32:41–62.

Derman, C. (1970). Finite State Markovian Decision Pro-

cesses. Academic Press, New York

Sorkunlu, N., Chandola, V., and Patra, A. (2017). Track-

ing System Behavior from Resource Usage Data. In

The proceedings of IEEE International Conference on

Cluster Computing (ICCC), pages 410–418.

Wang, T., Xu, J., Zhang, W., Gu, Z., and Zhong, H.

(2018). Self-Adaptive Cloud Monitoring with On-

line Anomaly Detection. Future Generation Com-

puter Systems, 80:89–101.

Sohal, A.S., Sandhu, R., Sood, S.K., and Chang, V. A.

(2018) Cybersecurity Framework to Identify Mali-

cious Edge Device in Fog Computing and Cloud-of-

Things Environments. Computer Security, 74:340–

354.

Sukhwani, H., Sharma, V., and Sharma, S. (2014). A Sur-

vey of Anomaly Detection Techniques and Hidden

Markov Model. International Journal of Computer

Applications, 93:975–8887.

Ge, N., Nakajima, S., and Pantel, M. (2015). Online Diag-

nosis of Accidental Faults for Real-Time Embedded

Systems Using a Hidden Markov Model. Simulation,

91:851–868.

Borgi, G. (2018). Real-Time Detection of Advanced Per-

sistent Threats Using Information Flow Tracking and

Hidden Markov. Doctoral Dissertation.

Alessandro, M. (2022). Nearly One Million Exposed

Misconﬁgured Kubernetes Instances Could Cause

Breaches. https://www.infosecurity-magazine.co

m/news/misconfigured-kubernetes-exposed/

CIS Kubernetes Benchmarks. (2022). Securing Kubernetes

An Objective, Consensus-Driven Security Guideline

For The Kubernetes Server Software. https://www.ci

security.org/benchmark/kubernetes

Darryl, T. (2022). ARMO: Misconﬁguration Is Number 1

Kubernetes Security Risk. https://thenewstack.io/arm

o-misconfiguration-is-number-1-kubernetes-securit

y-risk/

Fairwinds. (2023). Kubernetes Conﬁguration Benchmark

Report. https://www.fairwinds.com/kubernetes-con

ﬁg-benchmark-report

Joe, P. (2020). Common Kubernetes Misconﬁguration Vul-

nerabilities. https://www.fairwinds.com/blog/kuberne

tes-misconfigurations

Kyle, A. (2020). Major Vulnerability Found in Open Source

Dev Tool For Kubernetes. https://venturebeat.com/se

curity/major-vulnerability-found-in-open-source-dev

-tool-for-kubernetes/

ICISSP 2023 - 9th International Conference on Information Systems Security and Privacy

772

Jin, X., Katsis, C., Sang, F., Sun, J., Kundu, A., and Kom-

pella, R. (2022). Edge Security: Challenges and Is-

sues. In arXiv:2206.07164.

IEEE. (2009). IEEE Standard Classiﬁcation for Software

Anomalies (IEEE 1044–2009). In IEEE Std 1044-

2009 (Revision of IEEE Std 1044-1993), 2010:1–23.

ISO. (2019). Systems and Software Engineering — Sys-

tems and Software Assurance — Part 1: Concepts and

Vocabulary (ISO/IEC/IEEE 15026-1:2019).

Samir, A. and Pahl, C. (2019). A Controller Architec-

ture For Anomaly Detection, Root Cause Analysis

and Self-Adaptation for Cluster Architectures. In The

Eleventh International Conference on Adaptive and

Self-Adaptive Systems and Applications (ADAPTIVE),

pages 75–83.

A Self-Conﬁguration Controller To Detect, Identify, and Recover Misconﬁguration at IoT Edge Devices and Containerized Cluster System

773