which reduces the performance of the classifier
(Kulkarni et al., 2020). The problem of imbalanced
data becomes even more apparent when attackers
actively modify malware to evade detection and pose
a constant threat. In this study, an imbalanced data
environment was modeled to analyze its effect on
model performance. Furthermore, an optimal
iterative learning-based IDS that performs iterative
learning by incorporating new input data is proposed
to address these challenges. This study offers the
following major contributions:
The challenges in ML-based anomaly detection
were comprehensively analyzed.
The data imbalance scenario was modeled by
collecting data using an endpoint detection and
response (EDR) tool, demonstrating the
performance degradation in real-world IDS
environments.
An optimal iterative learning-based IDS that
determines the optimal number of iterations to
achieve efficient learning was proposed.
The remainder of this paper is organized as
follows. Section 2 provides a research overview on
ML-based anomaly detection and efforts to mitigate
data imbalance in ML. Section 3 analyzes the
challenges of ML-based IDS, and Section 4 proposes
an optimal iterative learning method to address the
IDS performance degradation caused by data
imbalance. Finally, Section 5 concludes the paper and
discusses future research directions.
2 RELATED WORK
With the rise of intelligent cyberattacks, ML-based
anomaly detection for cybersecurity has recently
become a research focus. However, conventional
ML-based anomaly detection studies struggle to
deploy the trained models in real-world environments
because they often overlook the differences between
training and deployment environments. In this section,
existing ML-based anomaly detection research and its
limitations are examined.
2.1 Machine Learning-Based Anomaly
Detection
Abbasi et al. (2022) proposed a particle swarm
optimization (PSO)-based clustering algorithm to
overcome the limitations of behavior-based
ransomware detection models. They classified the
presence of attacks using random forest (RF) and
achieved 97% accuracy in detecting attacks. However,
this study shows significant performance degradation
when applied to multi-classification models. In
addition, it lacks an F-score measurement, making it
difficult to evaluate performance in environments
with data imbalance.
Sun et al. (2022) proposed an intelligent attack
detection technique based on a frequency-differential
selection (FDS) feature selection algorithm and a
weighting calculation. They collected attack and
normal data from Android devices and achieved an
accuracy of 99 %. However, this study did not
consider the data-imbalance problem in ML, and the
performance degraded when deployed in a real-world
environment.
Gezer et al. (2019) applied four ML techniques to
detect Trojan horse malware and selected the optimal
hyperparameters and features. Using an RF classifier,
the model achieved an accuracy of approximately
99.95%. However, it is difficult to say whether the
model was tested in a reliable evaluation environment,
as the authors measured the accuracy on an
imbalanced dataset.
2.2 Mitigating the Data Imbalance in
Cybersecurity
Thirumuruganathan et al. (2024) proposed a method
to effectively detect and mitigate data imbalance even
in environments with unlabeled data. Their method
addresses unknown attacks and data imbalance
problems in real-world environments by assigning
pseudo-labels to unlabeled data through unsupervised
learning. However, their method is limited in that
performance degrades when there is a distributional
difference between the test and training data. In
addition, manual labeling was required for some data
during the initial training stage.
Balla et al. (2023) improved performance
degradation due to data imbalance in IDS and
intrusion prevention systems (IPS). They classified
imbalanced datasets into majority and minority
classes by applying undersampling to the majority
class data and oversampling to the minority class data.
They demonstrated that their proposed method
outperforms conventional methods by measuring the
accuracy, precision, detection rate, and F-score for
four public datasets. However, the proposed method
has problems in detecting attacks in real time as it
requires retraining to compensate for data imbalances.
Moreover, it is difficult to counteract adversarial
attacks as this study utilizes the sampling method.
Wang et al. (2021) proposed an oversampling
method to overcome the challenges posed by the
imbalance between attack and normal data in a
network IDS. Their approach uniformly adjusts the