difference in the controller performance. However, to
recover from failure efficiently, the average recovery
time increased when the failure rate increased.
4.4.2 Deployment and Performance Evaluation
We verified if the captured variation in performance
was due to a misconfiguration within the clusters and
edge devices. Hence, spearman’s rank correlation co-
efficient is used to estimate the correlation between
valid configurations and normal system performance
based on observing monitored metrics as the correct
system behavior (more details, see (Samir and Pahl,
2019)). The decrease in the correlation degree tells
that the observed degradation is not due to misconfig-
uration; otherwise, it refers to its existence. The gen-
eral observation indicated a higher number of notice-
able correlations between the misconfiguration and
the performance indicators. The highest correlation
was 0.82, and the lowest correlation was 0.34.
The configuration settings were checked against
the benchmarks for misconfigurations during the de-
ployment of a component (edge devices, containers,
edge gateway, and clusters). The controller iterated
all the security policies and guidelines (Azure, CIS
Docker, and Kubernetes Benchmarks). In case of a
mismatch between the settings and the requirements
of secure deployment in one or more component(s),
the controller reevaluates the deployment of the im-
pacted component, applies the required reconfigura-
tion, and redeploys the component. Otherwise, the
component settings are secure as per security guide-
lines, and the controller proceeds with the deploy-
ment. The controller checked the misconfiguration,
which needs to be addressed in components as a flag.
Hence, we measured the average redeployment time
for the component after observing anomalous behav-
ior until the successful recovery of a component. The
container redeployment average time was 210 sec-
onds, with no observing overheads associated with
Kubernetes and Docker Swarm. For the edge de-
vice, the average redeployment time required to send
a redeployment request and to receive a response to
the corresponding edge gateway successfully was 185
seconds for the redeployment package with 110 MB.
For the edge gateway, the average redeployment time
was 95 seconds. Over multiple runs, the average rede-
ployment time was reduced by 15∼30%, and the per-
formance improved by 20% depending on the content
and structure of the container’s image and the avail-
able network bandwidth. In this sense, the platform
had a significant impact on the redeployment time.
Moreover, the results show that the average
amount of resource consumption (CPU, memory, net-
work), with no misconfiguration, was approximately
the same, with respective values varying around
30%∼60% (normal behavior). Resource consump-
tion due to misconfiguration increased and was over
98% (overloaded resources), demonstrating the im-
pact of improper configuration on the system re-
sources. The recovered misconfiguration impacted
the saturated resource as the values of the monitored
resources varied around 38.4%∼64.6% (normal be-
havior). The controller performance was almost the
same, with a minor recovery time deviation of around
100 seconds for some failure types, like container
privileged access and wrong pod label. The devia-
tion returned to the correlation with the failure in the
system. Hence, we used the sequence of failures oc-
curring during the recovery process to reflect the type
of failure, which represents the failures that share the
same observations corresponding to a unique fault. If
the container privileged access and wrong pod label
sequence of failures occurred, we focus on the con-
tainer privileged access failure to represent its failure
type and relate it to its fault, which is Privilege Ac-
cess Escalation Management. We choose the initial
failure that occurred as it is representative enough of
the observations to which it belongs, which allows us
to save the recovery time without trying many recov-
ery actions.
In the end, we found that some anomalous behav-
ior in the test set, such as CVE-2022-0811, is not cov-
ered by the training set, which might impact accuracy.
The result stated that the controller performed better
with the increase in the training dataset size. More-
over, we measured the average rate of successfully
recovered components to the total number of mis-
configurations in all anomalous components. After
multiple runs, the average rate was around 97.66%,
which means that the recovery could not handle a
small number of misconfigurations, though the un-
handled anomalous behavior decreased dramatically
with more training data.
5 RELATED WORK
This section explores the recovery of misconfigura-
tion in literature.
Various frameworks for managing workload and
information flow in Edge/Fog environments have
been developed; however, they provided limited
scope for integrating different policies to manage the
configurations of medical edge devices and clusters
dynamically. In particular, existing frameworks have
paid limited attention to the critical role of efficient
recovery management (Mascellino, 2022), (Nie et al.,
2021b), (Nie et al., 2021a), (Tang et al., 2018), (Taft,
CLOSER 2023 - 13th International Conference on Cloud Computing and Services Science
250