Anomaly Detection in IoT Networks: A Performance Comparison of

Transformer, 1D-CNN, and GrowNet Models on the Bot-IoT Dataset

Aurelia Kusumastuti

, Denis Rangelov

1 a

, Philipp L

ammel

1 b

, Michell Boerger

1 c

Andrei Aleksandrov

1 d

and Nikolay Tcholtchev

1,2 e

Fraunhofer Institute for Open Communication Systems (FOKUS), Berlin, Germany

RheinMain University of Applied Sciences, Wiesbaden, Germany

ﬁ

Keywords:

Transformer, Random Forest, Deep Neural Networks, CNN, Anomaly Detection, Intrusion Detection, IoT.

Abstract:

This paper presents an exploratory analysis of deep learning techniques for intrusion detection in IoT networks.

Speciﬁcally, we investigate three innovative intrusion detection systems based on transformer, 1D-CNN and

GrowNet architectures, comparing their performance against random forest and three-layer perceptron models

as baselines. For each model, we study the multiclass classiﬁcation performance using the publicly available

IoT network trafﬁc dataset Bot-IoT. We use the most important performance indicators, namely, accuracy,

F1-score, and ROC, but also training and inference time to gauge the utility and efﬁcacy of the models. In

contrast to earlier studies where random forests were the dominant method for ML-based intrusion detection,

our ﬁndings indicate that the transformer architecture outperforms all other methods in our approach.

1 INTRODUCTION

Internet of Things (IoT) refers to a group of intercon-

nected devices which exchange and collect data with-

out human intervention (Hounsell et al., 2009), e.g.

over the cloud or a blockchain infrastructure (Kullig

et al., 2020). As more IoT devices connect to the in-

ternet on a daily basis, the potential for the technology

to alter businesses and sectors grows rapidly. Among

the beneﬁts of using so-called smart devices are valu-

able data insights, cost savings due to task and con-

trolling automation, and the ability to connect various

type of devices and data. In IoT’s strengths, however,

also lie its weaknesses. For example, data collected

by smart devices in the health and insurance industry

could be leaked, exposing large volumes of sensitive

medical and ﬁnancial data (Shahid et al., 2022) (Chat-

terjee and Ahmed, 2022). The ability to carry out

tasks remotely means possibly exposing equipments

to bad actors that might try to tamper with IoT de-

vices, be it on a software or a hardware level (Stel-

lios et al., 2018). In addition, the vast amount of de-

https://orcid.org/0000-0002-2006-4218

https://orcid.org/0000-0002-4411-0557

https://orcid.org/0000-0002-5741-9043

https://orcid.org/0000-0002-4717-4206

https://orcid.org/0000-0001-6821-4417

vices introduces heterogenous software and hardware

stacks, which means more attack surface (Stoyanova

et al., 2020). One way to detect possible vulnerabil-

ities or attacks is to use anomaly detection methods

for incoming trafﬁc at various levels, from the IoT

network to the data center. Anomaly detection is a

data analysis process that looks for irregular patterns,

unexpected behavior, or deviations from the standard

mode of operation of a given system.

As mentioned previously, IoT devices are widely

used for collecting and transporting sensitive data in

a variety of sectors, including medical, manufactur-

ing, and ﬁnance. Therefore, anomaly detection in IoT

networks are crucial to obtain critical actionable in-

formation for e.g. fraud detection (Min et al., 2021)

or condition monitoring (Li et al., 2020).

Most IoT anomaly detection approaches require

extensive human involvement and optimizations, de-

spite initiatives for autonomic and automated network

resilience (Chaparadza et al., 2013) (Tcholtchev and

Chaparadza, 2010). Establishing an automated model

in an IoT setting presents various challenges. It is dif-

ﬁcult and not always possible to appropriately char-

acterize and categorize various types of anomalous

data, especially when labeled training data is not ac-

cessible. The concept of normal behavior is continu-

ally changing and evolving in various domains (Ryu

et al., 2021). Furthermore, noise is always present

Kusumastuti, A., Rangelov, D., Lämmel, P., Boerger, M., Aleksandrov, A., Tcholtchev and N.

Anomaly Detection in IoT Networks: A Performance Comparison of Transformer, 1D-CNN, and GrowNet Models on the Bot-IoT Dataset.

DOI: 10.5220/0013637600003967

In Proceedings of the 14th International Conference on Data Science, Technology and Applications (DATA 2025), pages 633-643

ISBN: 978-989-758-758-0; ISSN: 2184-285X

633

in observed data, and when the signal-to-noise ratio is

low, the size of noise resembles actual anomalies. The

quantity of interconnected systems and data types fur-

ther enhance the complexity (Chatterjee and Ahmed,

2022).

To address some of these challenges, researchers

often rely on AI-based mechanisms due to AI’s abil-

ity not just to process and analyze large amounts of

data in real time, but also adapt to new incoming data.

Deep learning (DL) algorithms have evolved into the

most extensively used and viable intrusion detection

technology in networks. Deep learning is in general

widely employed in cybersecurity because it can de-

tect previously unknown patterns in raw data (Khan

et al., 2022). Due to the large body of work in the DL

domain, it would be of interest to compare the differ-

ent approaches implementing the concept of DL using

different architectures.

1.1 Objective and Scope of the Paper

With this study we aim to provide a performance

comparison of different DL architectures for detecting

IoT-trafﬁc anomalies. To this end, we present the re-

sults of experiments utilizing three novel DL architec-

tures based on transformer, 1D-CNN, and GrowNet

models, as well as a simple random forest classiﬁer

and a multilayer perceptron for baseline benchmark-

ing. The experiments are run on Bot-IoT - a pub-

licly available dataset containing malign as well as

benign network trafﬁc observations from IoT-devices.

We then present the empirical results of the individual

performances based on previously deﬁned metrics.

While previous research identiﬁed random forests as

the top-performing method for ML-based intrusion

detection, we will reveal that the transformer architec-

ture exceeds the performance of all other techniques

in our approach.

1.2 Structure of the Paper

Having explained the background and scope of this

paper, in the following sections we will proceed with

reviewing the related work in anomaly detection, the

role of DL algorithms for intrusion detection, and

publicly available datasets for network intrusion de-

tection. Then we describe in brief the different DL

architectures up for comparison, as well as the rele-

vant methodology we use to compare them, including

the metrics used to evaluate the models’ performances

and the dataset on which we run our experiments. Us-

ing the aforementioned metrics we will then discuss

the experiment results and examine its implications.

Finally, we conclude with a summary of the experi-

ments and discuss possible future opportunities based

on our empirical ﬁndings.

2 RELATED WORK &

TECHNOLOGICAL CONTEXT

2.1 Anomaly Detection in IoT Networks

At present, most techniques for identifying anoma-

lies in IoT networks use a considerable amount of

human intervention to set up the systems and ana-

lyze the data produced. As the number of intercon-

nected devices grows, effectively analyzing and inter-

preting data becomes increasingly complex, even for

experts. In the following paragraphs, existing auto-

mated methods that enable experts to focus only on

the most signiﬁcant observed events are brieﬂy de-

scribed. However, a detailed comparison of related

approaches with our presented approach and results

is discussed in Section 4.3.

Statistical and probabilistic methods use previ-

ously recorded data to model expected network be-

havior. New observations are then compared against

the statistical model. If an observation fails to ﬁt the

model, it is classiﬁed as an anomaly (Markou and

Singh, 2003). Examples for probabilistic methods

are Hidden Markov Models (G

ornitz et al., 2015) and

Bayesian Networks (Hill et al., 2007).

Using long-term trafﬁc trends, it is also possi-

ble to create regression models that predict expected

network trafﬁc behavior. New observations are then

compared to the previously generated expected be-

havior. If the observed and expected values vary

greatly, the observed incident is marked as anomalous

(Giannoni et al., 2018). The complexity of the pre-

diction model depends on the choice of architecture.

Simple Support Vector Machines (SVMs) (Shahid

et al., 2015) could already yield useful results, but

more complex models using Deep Neural Networks

(DNNs) and Long-Short-Term Memory (LSTM) are

also viable, if not more widely used nowadays (Mal-

hotra et al., 2015).

2.2 Deep Learning Algorithms for

Anomaly Detection

A large number of deep anomaly detection meth-

ods have been introduced, demonstrating signiﬁcantly

better performance than conventional anomaly detec-

tion on addressing network intrusion. Major chal-

lenges in anomaly detection, which deep learning

tackles, include class imbalances (anomalies are by

DATA 2025 - 14th International Conference on Data Science, Technology and Applications

634

deﬁnition much rarer than the standard case) and sen-

sitivity to noise (Pang et al., 2020). For example,

Meidan et al. train deep autoencoders to learn nor-

mal behavior in IoT network trafﬁc and try to re-

construct new, unseen trafﬁc (Meidan et al., 2018).

If the autoencoder fails to reconstruct the input data

accurately, then it is a strong indication that the ob-

served behavior is anomalous. Beneﬁts of using au-

toencoders to detect anomalous behavior are hetero-

genity tolerance and the ability to ﬂag a previously

unseen behavior.

Another deep learning architecture recently used

for anomaly detection is Generative Adversarial Net-

work (GAN). The model is composed of two neural

networks, a generator and a discriminator, which are

trained in adversarial manner. Iliyasu et al. (Iliyasu

and Deng, 2022) use GAN in a semi-supervised man-

ner, in that the discriminator trains on a few mali-

cious examples in addition to the normal ones to learn

adequate representations. The generator then recon-

structs from the latent features in the network trafﬁc

feature space to compute the anomaly score.

Architectures based on Long short-term memory

(LSTM) are also commonly used for anomaly detec-

tion, especially to detect anomalous behaviors over

time. This is because LSTM models use so-called

gates which choose what to keep or discard in the

memory as well as incorporate changes over time.

This gives LSTM the ability to capture long-term de-

pendencies thereby leading to LSTM being used as

a basis algorithm for anomaly detection. For exam-

ple, Imrana et al. extend the LSTM architecture by

building a bidirectional LSTM network (Imrana et al.,

2021). The solution the authors propose is making use

of two LSTM networks. The ﬁrst LSTM trains us-

ing the normal training data and the second uses a re-

versed version of the data, thus providing more time-

related context to the algorithm and solving the van-

ishing gradient problem. When compared with other

machine learning methods like SVM, Multilayer Per-

ceptron, and Recurrent Neural Network, LSTM per-

formed better by 6-20 percent (Imrana et al., 2021).

2.3 Publicly Available IoT Datasets for

Anomaly Detection

Several datasets for anomaly detection in IoT net-

works have been made publicly available for develop-

ing machine learning algorithms to detect and prevent

IoT malware infections.

IoT-23 (Garcia et al., 2020) belongs to the most

used publicly available dataset for network intrusion

detection. It has twenty malware as well as three nor-

mal IoT-trafﬁc captures containing more than 50 mil-

lion records ranging from 2018 to 2019. The malware

trafﬁc are generated by executing the respective mal-

ware on a Raspberry Pi. Among the executed attacks

are Mirai and Okiru botnets as well as DDoS and

C&C. The benign trafﬁc records are recorded from

three different IoT devices.

Another widely used dataset is Bot-IoT (Koroni-

otis et al., 2019), which was collected in a virtual

environment and also contains normal and malware

IoT trafﬁc. The attack launched for generating trafﬁc

data include data exﬁltration, keylogging, and DDoS.

The entire dataset consists of more than 72 million

records. We will provide a more detailed description

including a feature and class analysis of the BoT-IoT

dataset in Section 3.2.

The N BaIoT dataset (Meidan et al., 2018) gath-

ered malware trafﬁc data by injecting commercial IoT

devices with Mirai and BASHLITE botnets. The bot-

nets carry in sum ten different attacks, including net-

work ﬂooding, scanning for vulnerable connection,

and sending spam data. N BaIoT also contains nor-

mal IoT trafﬁc data. More than seven million records

comprise the dataset in total.

3 METHODOLOGY

This section explains the details of the approach we

use in this study, from the choice of DL algorithms

to the datasets and metrics we chose to evaluate said

algorithms.

3.1 Selection of DL Algorithms

The algorithm selection process employed during our

research efforts was guided by two main criteria: in-

novation and performance. More speciﬁcally, we se-

lected relatively recent and novel deep learning ar-

chitectures, examined their practical applications in

the ﬁeld of intrusion detection and compared them

to more traditional methods. By focusing on these

novel algorithms/architectures, we aim at contribut-

ing to the current body of research and achieving pos-

sible improvements over the existing anomaly detec-

tion methods. Furthermore, performance is a criti-

cal factor in assessing the efﬁciency and reliability of

deep learning algorithms, so the selected algorithms

were evaluated based on their accuracy in network

intrusion detection. In our evaluation, we compare

the novel DL algorithms with baseline algorithms,

namely random forest classiﬁers (RFC) and deep neu-

ral networks (DNN).

RFC are a type of machine learning algorithm that

predict the outcome of a certain event by combining

Anomaly Detection in IoT Networks: A Performance Comparison of Transformer, 1D-CNN, and GrowNet Models on the Bot-IoT Dataset

635

multiple decision trees (Breiman, 2001). A decision

tree in a classiﬁcation context is a graph that decides

which class a sample belongs to. The algorithm starts

from a root node and traverses through the next nodes

conditionally, evaluating a speciﬁc criterion at each

node. It then arrives at a leaf node representing a

class. An RFC is made up of many decision trees,

each trained on a different subset of the data and us-

ing a random selection of features. This makes the

model less prone to overﬁtting than a single decision

tree. To make a prediction, the random forest clas-

siﬁer combines the output of all the decision trees to

come up with a ﬁnal decision.

In contrast, DNNs are made up of multiple lay-

ers of connected nodes or artiﬁcial neurons (see Fig.

1). Each layer takes input from the previous layer and

performs calculations, which are then passed on to the

next layer. The idea is that each layer extracts more

abstract information from the input, allowing the net-

work to perform complex tasks.

As a baseline model, we opt for a simple multi-

layer perceptron (MLP). MLPs can be seen as a spe-

ciﬁc type of DNN with a simpler architecture; our

MLP has only one hidden layer (see Figure 1a). The

ﬁrst layer of the neural network receives the input

data. Each node in this layer represents a feature or

attribute of the input data. The values of each neuron

are then multiplied by their respective weights and

added up to compute the weighted sum of neurons.

Each connection between nodes in the neural network

has an associated weight that controls the strength of

the connection. The network learns to adjust these

weights during training in order to minimize the error

between the predicted output and the actual output.

The weighted sum of neurons is then forwarded to

the activation function that decides whether the neu-

ron should be activated or not. An activation func-

tion is a non-linear function applied to the output of

each node in a layer. The activation function intro-

duces non-linearities into the model and enables it to

capture complex relationships between the input and

output data. The output of the activation function is

then forwarded to the hidden layer. After computing

the weighted sum of neurons and passing it through

Input layer

Hidden layer

Output layer

(a) Example of a simple

neural network

i-1

i+1

i-1j

, b

i-1j

, b

i+1j

, b

i+1j

Σ : weighted sum of inputs

f : activation function

(b) Inner working of a

neuron

Figure 1: Structure of a neural network.

Input image Convolution

Pooling

Convolution

Pooling

Flatten

Fully-

connected

Output

Feature extraction Classification

Output 1: 0.0...

Output 2: 0.1...

Output n: 0.0...

. . .

Probability

distribution

Figure 2: Example structure of a CNN.

an activation function, the result is then forwarded to

the ﬁnal output layer.

Since our problem is that of a multiclass classiﬁca-

tion, we use softmax as the activation function for the

output layer. The softmax function outputs the prob-

ability of the input belonging to a certain class (Bri-

dle, 1989). To assess the model’s accuracy, we com-

pute the loss function, which calculates the difference

between the predicted and the original class or label

of the input data. To decrease prediction error, the

gradient of the cost function with respect to the net-

work weights is computed using the chain rule. This

process is commonly known as backpropagation. We

then use the gradient to update the network’s param-

eters by moving in the direction of steepest descent,

which is why this process is called gradient descent.

With θ

denoting network parameters at iteration i, η

denoting the learning rate, and ∇J(θ

) denoting the

gradient of the cost function regarding θ

, the gradi-

ent descent is deﬁned as follows:

i+1

= θ

− η∇J(θ

)

Moving towards the direction of steepest descent

minimizes the cost function, and the rate of the move-

ment is denoted by the aforementioned learning rate

η. If the learning rate for a deep neural network is set

too high, the training process of the model might be-

come unstable and fail to converge to an optimal solu-

tion. This might cause strong oscillations or spikes in

the loss function, preventing the network from learn-

ing successfully. The optimization method may over-

shoot the ideal solution, causing the loss function to

increase rather than decrease. This phenomenon is

often referred to as ”exploding gradients”. Further-

more, the model will have low accuracy with unseen

data, resulting in high generalization error.

Most of the DL algorithms we are comparing in

this paper are based on Convolutional Neural Net-

work (CNN). CNNs were introduced to solve im-

age recognition problems and are similar to regu-

lar DNNs in that they are made up of neurons that

optimize themselves through learning (LeCun et al.,

1995). The basic components of a CNN include con-

volution layers, pooling layers, and fully connected

layers. The convolution layer is the core component

of a CNN, where the input image is convolved with

DATA 2025 - 14th International Conference on Data Science, Technology and Applications

636

a set of ﬁlters, each of which captures different fea-

tures of the image. The ﬁlters move across the image

and perform dot products at each position, generat-

ing a feature map that represents the presence or ab-

sence of various features in the input image. The ﬁl-

ters are learned during training, meaning the network

will learn to automatically identify features that are

useful for the classiﬁcation task.

Pooling layers are used to reduce the spatial di-

mensions of the feature maps generated by the con-

volution layers. Pooling typically involves summa-

rizing a group of neighboring pixels by taking their

maximum or average value. This helps to reduce the

dimensions of the feature maps while preserving im-

portant information.

Finally, fully connected layers are used to clas-

sify the features learned by the convolution layers

into speciﬁc classes or categories. These ﬁnal lay-

ers take in the output of the convolution and pooling

layers, and apply traditional deep learning concepts of

weights and biases to classify objects.

Standard DNNs were not effective in recognizing

patterns in images because they did not take into ac-

count the spatial relationships between the pixels in

an image. CNNs solve this problem by using convo-

lutional layers to ﬁlter the image and extract features

at different scales. This allows the network to identify

patterns and features in the image regardless of its lo-

cation in the image. Additionally, pooling layers are

used to downsample the image and reduce the dimen-

sionality of the convolved feature map, which helps

to reduce overﬁtting.

With this in mind, the DL architectures examined

throughout our work are deﬁned as follows:

• SoftOrdering1DCNN uses one-dimensional

CNNs to classify tabular data. CNNs are a widely

used NN architecture for solving computer vision

problem. That is because convolutional kernels

extract both the local connectivity and the spa-

tial locality of the input image. Because tabu-

lar features are often not spatially connected, we

cannot input a tabular dataset directly to a convo-

lutional layer. SoftOrdering1DCNN uses a fully

connected layer to expand the size of the input

(tabular data), apply non-linear combination, and

sort the original features as to create spatial cor-

relation. Following the reshaping, features are re-

trieved in numerous 1-dimensional convolutional

layers connected by a skip-like network. After

ﬂattening, the collected features are used to pre-

dict targets via a fully linked layer. This algo-

rithm won second place in the Mechanisms of

Action (MoA) Prediction Research Code Compe-

tition to develop accurate and efﬁcient computa-

tional models for predicting the mechanism of ac-

tion of new drugs (noa, ).

• The FT-Transformer (Feature Tokenizer +

Transformer) algorithm (Gorishniy et al., 2021)

modiﬁes the transformer architecture (Vaswani

et al., 2017) for the use on tabular data. FT-

Transformer ﬁrst tokenizes the input features into

embeddings, which are then processed by the

transformer. In comparison to eight other deep

learning methods for tabular data classiﬁcation

over eleven datasets, FT-Transformer outperforms

the other methods in six cases, including real es-

tate (Pace and Barry, 1997), income prediction

(Kohavi et al., 1996), and simulated physical par-

ticles (Baldi et al., 2014).

• GrowNet is a novel neural network model that

combines the strengths of neural networks and

boosted trees (Badirli et al., 2020). The model

trains using boosted trees that have a shared fea-

ture map. In a boosted tree model, individual

decision trees are trained sequentially on subsets

of the data, with each subsequent tree attempt-

ing to correct the errors made by the previous

trees. Each decision tree is built using a split-

ting criterion that optimizes the reduction in the

error of the current model. The ﬁnal prediction is

made by combining the predictions of all the indi-

vidual trees, weighted by their accuracy (Hastie

et al., 2009). The trees are built by simultane-

ously predicting the gradient of the loss function

and the output at each iteration. The authors of

the GrowNet paper compare it with other boost-

ing methods, and GrowNet achieves better perfor-

mance in regression, classiﬁcation and learning to

rank on multiple datasets.

3.2 Datasets

We choose the Bot-IoT dataset (Koroniotis et al.,

2019) as the benchmark dataset to evaluate the per-

formance of the deep learning algorithms for network

intrusion detection. Bot-IoT is a publicly available

dataset emulating the real-world network trafﬁc of

IoT devices. The dataset has been used in several pre-

vious research studies to assess the performance of

various intrusion detection algorithms (Bovenzi et al.,

) (Saba et al., 2022) (Ibitoye et al., 2019), providing a

sound basis for comparison.

3.2.1 Description of the Dataset

Koroniotis et. al (Koroniotis et al., 2019) constructed

the BoT-IoT dataset by emulating a normal and botnet

network trafﬁc scenario. The dataset includes DDoS,

Anomaly Detection in IoT Networks: A Performance Comparison of Transformer, 1D-CNN, and GrowNet Models on the Bot-IoT Dataset

637

DoS, OS and Service Scan, Keylogging and Data ex-

ﬁltration attacks. They then sampled 5% of the origi-

nal dataset. The sample consists of about three mil-

lion records. Using statistical measures of correla-

tion coefﬁcient (Sedgwick, 2012) and entropy (Lesne,

2014), the authors selected the most important fea-

tures of the data, which are described in Table 1.

The labels in the dataset describe different attack sce-

narios:

• Denial of Service (DoS): attacks aim to ﬂood a

network, server, or website with trafﬁc and re-

quests, resulting in system failure or slowdown.

The attacker’s purpose is to disrupt the target sys-

tem’s regular operation, denying legitimate users

access to its services or information. The dataset

contains DoS as well as Distributed DoS (DDoS),

each on TCP, UDP, and HTTP based networks

(Koroniotis et al., 2019). While DoS attacks are

carried out by a single device, DDoS involves

many devices, usually through networks of con-

trolled computers that have been compromised

with malwares, also known as botnets (Kolias

et al., 2017).

• Information Theft: refers to a class of attacks

in which an adversary attempts to steal sensitive

data. Two types of information theft attacks are

included in the dataset, namely data exﬁltration

and keylogging. As the name suggests, data ex-

ﬁltration attacks target a remote machine and at-

tempt to obtain unauthorized access to data (Sabir

et al., 2021). Keylogging, on the other hand, at-

tempts to compromise a remote machine in order

to capture a user’s keystrokes and potentially steal

user credentials (Singh et al., 2021).

• Scanning: or ﬁngerprinting attacks are malicious

operations that search remote machines for user

information based on the speciﬁcations of said

machines. Based on the type of information

scanned, there are two key subcategories scanning

attacks, namely OS and service scanning (Hoque

et al., 2014). In a OS scanning attack, an adver-

sary acquires information on the remote operat-

ing system by comparing its replies to pre-existing

ones or based on OS differences in TCP/IP stack

implementations. In a service scan attack, an ad-

versary sends request packets to identify the ser-

vices that run behind the system ports (0-65535)

(Hoque et al., 2014).

3.2.2 Data Preprocessing

Before training and evaluating the investigated ML

models using the Bot-IoT dataset, we ﬁrst prepro-

cess the data by imputing missing values, standard-

Table 1: Description of the features in Bot-IoT dataset.

Feature Description

proto Transaction protocol

sport Source port number

dport Destination port number

seq Argus sequence number

stddev Standard deviation of aggregated records

N IN Conn P SrcIP Num. of inbound conn. per source IP

min Minimum duration of aggregated records

state Feature state

mean Average duration of aggregated records

N IN Conn P DstIP Num. of inbound conn. per dest IP

drate Destination-to-source packets per second

srate Source-to-destination packets per second

category Trafﬁc category

subcategory Trafﬁc subcategory

izing the numerical features, and one-hot-encoding

string-formatted features. We also combined the la-

bels category and ”subcategory”, resulting in ten pos-

sible classes network ﬂows fall into. Table 2 de-

picts the distribution of labels. As we can see, the

dataset exhibits a strong class imbalance, which can

cause the ML models to overﬁt on the higher repre-

sented classes and perform poorly in recognising the

underrepresented classes. To avoid this, we oversam-

ple the data using the SMOTE (Synthetic Minority

Over-sampling Technique) algorithm (Chawla et al.,

2002). SMOTE detects minority class instances that

have majority class nearest neighbors. After discov-

ering the minority-class cases, the algorithm chooses

one of them and determines its k nearest neighbors.

Then, at random, it selects one of the nearest neigh-

bors and constructs a new synthetic instance between

them based on the difference between the features of

the two selected instances. This process is repeated

until a speciﬁed number of synthetic instances are

produced. This approach reduces the class imbalance

by balancing the number of instances in the minority

and majority classes.

Table 2: Distribution of labels in the Bot-IoT dataset, from

most to least amount of records.

Label Count

DoS (UDP) 1032975

DDoS (TCP) 977380

DDoS (UDP) 948255

DoS (TCP) 615800

Service Scan 73168

OS Scan 17914

DDoS (HTTP) 1485

DoS (HTTP) 989

Normal 477

Keylogging 73

Data exﬁltration 6

DATA 2025 - 14th International Conference on Data Science, Technology and Applications

638

3.3 Evaluation Metrics

To measure the performance of the above-mentioned

algorithms, we evaluate not only the correctness of

the result, but also the efﬁcacy. The correctness met-

rics (accuracy, F1-score, and AUC) are based on the

components of the confusion matrix. The confusion

matrix compares the actual and predicted classes and

partition the data as follows:

• True Positive (TP): is the number of samples cor-

rectly classiﬁed as belonging to a certain class.

• False Positive (FP): is the number of samples in-

correctly classiﬁed as belonging to a certain class.

• False Negative (FN): is the number of samples

incorrectly classiﬁed as not belonging to a certain

class.

• True Negative (TN): is the number of samples

correctly classiﬁed as not belonging to a certain

class.

Following are the metrics we use in evaluating the DL

algorithms.

• Accuracy: is one of the most widely used met-

rics for multi-class classiﬁcation problems. It is

the ratio of correctly classiﬁed samples to the to-

tal amount of samples.

Accuracy =

T P + T N

T P + T N + FP + FN

• To calculate the F1-score, we need to ﬁrst calcu-

late precision and recall, which are deﬁned as fol-

lows:

Precision =

T P

T P +FP

Recall =

T P

T P +FN

The F1-score is consequently given as the har-

monic mean of precision and recall.

F1-score = 2 ×

Precision × Recall

Precision + Recall

In our analysis, we employed micro-averaging for

multiclass classiﬁcation to compute precision and

recall, ensuring that all instances across classes

are considered equally, which helps provide a

more comprehensive measure of model perfor-

mance across imbalanced datasets.

• A receiver operating characteristic curve (ROC

curve) is a graph that depicts the performance of

a classiﬁcation model over all classiﬁcation lev-

els by comparing True Positive Rate (TPR) and

False Positive Rate (FPR). The area under the

ROC curve, mostly just called Area Under Curve

(AUC), indicates how well the model can discrim-

inate between classes. AUC may be interpreted as

the likelihood that the model rates a random posi-

tive case higher than a random negative example.

• Training time measures the time it takes to train

a DL model. In its training phase, a DL model

learns to recognize relationships and patterns by

being exposed to a large amount of data to be able

to precisely classify similar data in the future. The

training process can be time-consuming and in-

tensive in terms of computing resources.

• Inference time, on the other hand, measures the

time it takes for a previously trained model to

perform classiﬁcation on previously unseen data.

Because the model can depend on pre-learned

weights and doesn’t need to perform as many cal-

culations as during the training phase, inference

time is typically much quicker than training time.

4 EXPERIMENTS AND RESULTS

4.1 Experimental Setup

We use the complete Bot-IoT dataset (Koroniotis

et al., 2019) with over three million trafﬁc records

to evaluate the performance of the deep learning al-

gorithms. We repeat the experiments using the cross

validation method. We perform the experiments on

a JupyterHub server running CUDA 11.8 with 4x

12GB-NVIDIA GPUs.

4.2 Results and Discussion

The experiment results are illustrated in Figures 3a,

3b, 3c, 3d, and the ROC-AUC is plotted in Figure

3e. As seen in Figure 3a, for the performed multiclass

classiﬁcation task, FT-Transformer achieved the high-

est accuracy and F1-Score of 0.991, followed by RFC

with 0.974, SoftOrdering1DCNN, MLP, and lastly

GrowNet with 0.904. In Figure 3c, FT-Transformer

takes the least amount of time to train at 226 seconds,

followed by MLP with 256 seconds. GrowNet took

the longest time to train at 3088 seconds. RFC has

the shortest inference time of 2.663 seconds, followed

by MLP with 7.64 seconds. GrowNet has the longest

inference time of 15 seconds. In Figure 3e, we see

FT-Transformer with the highest AUC of 0.99, fol-

lowed by MLP and SoftOrdering1DCNN with 0.95,

RFC with 0.90, and GrowNet with 0.88.

In the previous subsection, particularly in Figures

3a and 3b, we see very similar values of accuracy and

F1-Score. This is because we oversampled the dataset

Anomaly Detection in IoT Networks: A Performance Comparison of Transformer, 1D-CNN, and GrowNet Models on the Bot-IoT Dataset

639

(a) Accuracy of the algorithms (b) F1-Score of the algorithms (c) Training time of the algorithms in

seconds

(d) Inference time of the algorithms in sec-

onds

(e) Area Under Curve of the algorithms

Figure 3: Experiment Results.

to become balanced across all classes. Had we not

done so, the model would over-predict the majority

class, resulting in an F1-score that was signiﬁcantly

lower than the accuracy.

In our experiments, FT-Transformer shows the

best training time. This comes down to the

self-attention mechanism and parallelization in FT-

Transformer. Transformers’ self-attention mecha-

nism allows the model to learn dependencies be-

tween distinct segments in the input sequence. During

training, this attention mechanism allows the model

to focus on key sections of the input, which can

lead to faster convergence and improved performance

(Vaswani et al., 2017). Moreover, transformers pro-

cess input sequences in parallel. This paralleliza-

tion is accomplished by self-attention mechanisms,

in which each token in the input sequence may con-

currently attend to all other tokens. By lowering the

sequential computation required in classic recurrent

neural networks, this parallel processing capabilities

speeds up training (Schlag et al., 2021). However,

RFC shows the fastest inference time by far, followed

by MLP. This is because RFC and MLP have the sim-

plest model structures in comparison to the other al-

gorithms we use.

Additionally, the RFC is a collection of deci-

sion trees that can be processed in parallel with

the proper setup in terms of software and hardware.

This also speeds up inference. The RFC all-in-all

achieves very good results, even besting MLP, Soft-

Ordering1DCNN, and GrowNet in terms of accuracy

and F1-score, all of which are more powerful neural-

network-based models. An explanation for this could

be its robustness against noise in the training data

(Ishii and Ljunggren, 2021) (Xu et al., 2023).

The MLP shows high AUC with somewhat lower

accuracy. This could be explained when looking at

the confusion matrix: There is an entire class which is

misclassiﬁed, namely data exﬁltration. The same case

can be made with the SoftOrdering1DCNN. Most

likely the extracted data exﬁltration patterns aren’t

sufﬁcient during the training process. The MLP

is a simple three-layer-perceptron with one hidden

layer, while SoftOrdering1DCNN has multiple fully-

connected and convolutional layers, which possibly

explain SoftOrdering1DCNN’s higher accuracy. This

would also explain the shorter training and inference

time taken by MLP.

GrowNet shows the poorest performance in all as-

pects in our experiments. The slow training time, for

one, can be explained by the architecture. The model

is ﬁrst initialized with a single neural network, then it

adds new neural network to the ensemble iteratively.

Each iteration goes through forward propagation, loss

calculation, backward propagation, and the addition

of the new neural network, with virtually no paral-

lelization (Badirli et al., 2020). Moreover, with each

iteration there are more neural networks already in the

ensemble, meaning each iteration takes longer than

the last. This training strategy might also explain the

suboptimal accuracy, F1-Score, and AUC: GrowNet

relies on its weak neural networks, and its sequential

DATA 2025 - 14th International Conference on Data Science, Technology and Applications

640

training technique may be less successful in captur-

ing complex dependencies in the data than other deep

learning algorithms’ end-to-end training approach.

In contrast, we see FT-Transformer showing the

best performance across the board. This comes down

largely to its attention mechanism, as explained pre-

viously. Transformers’ self-attention mechanism en-

ables the model to properly represent complex rela-

tionships between features in the dataset. By paying

attention to key features and learning how they inter-

act, the Transformer can get a greater understanding

of the connections between distinct features, result-

ing in higher classiﬁcation accuracy (Vaswani et al.,

2017).

4.3 Comparison with State of the Art

We compare our experiment results with other works

which also use the Bot-IoT as their baseline dataset.

Ferrag et al. (Ferrag et al., 2020) compared among

others the performance of CNN, Recurrent Neural

Network (RNN), and DNN. The authors showed that

a CNN achieves the highest accuracy of 98.371%,

followed by RNN (98.331%), then DNN (98.221%),

though DNN has the fastest training time of 991.6

seconds. This is consistent with our results (Figure

3a), which show that a CNN (SoftOrdering1DCNN)

achieves higher accuracy (93%) than a DNN/MLP

(91%). We also show that the MLP has faster training

time than the SoftOrdering1DCNN (256 and 376 sec-

onds respectively). The difference in accuracy as well

as training time in our results in comparison to Ferrag

et. al can be explained by the difference in model

complexity: Our MLP has only one hidden layer

while Ferrag’s et. al’s DNN has at least two, and our

SoftOrdering1DCNN has one-dimensional convolu-

tional layers, while Ferrag et. al use two-dimensional

convolutional layers.

A comparison between a simple neural network

and a random forest classiﬁer was done by Churcher

et. al (Churcher et al., 2021), which shows higher

accuracy and F1-score of the neural network (97%)

than the random forest classiﬁer (95%). This differs

from our results, which shows higher accuracy and

F1-score for RFC in Figures 3a and 3b in compari-

son to MLP. This might be again be explained by the

low complexity of our MLP, which only has one hid-

den layer. Our RFC scores higher than Churcher’s,

because we use a balanced version of RFC, which in

addition to oversampling the dataset adds more stabil-

ity to the model.

In another work by

Ozer et. al (

Ozer et al.,

2021) RFC has slightly higher accuracy and F1-Score

(99.9%) than a neural network (99.7%), which in gen-

eral matches our results of our experiments. A similar

result is also achieved by Alkadi et al (Alkadi et al.,

2023), who use ANOVA F-Score for feature selection,

with RFC scoring 99.9% and MLP 98% accuracy.

ANOVA F-Score calculates variation between sample

means or variation within the samples, which might

explain the performance improvement in comparison

to our results. Usoh et al. (Usoh et al., 2023) also

show RFC and neural network achieving very similar

accuracy of 99.8%, but RFC has F1-Score of 99.9%

and neural network 96.33%, possibly because of re-

maining class imbalance due to lack of resampling of

the dataset.

5 CONCLUSION AND FUTURE

WORK

In this study, we evaluated various deep learning

algorithms for intrusion detection using the Bot-

IoT dataset, employing a comprehensive experimen-

tal setup to assess their multiclass classiﬁcation per-

formance. Precisely, we analyzed three intrusion

detection systems based on deep learning methods,

namely SoftOrdering1DCNN, FT-Transformer, and

GrowNet, while benchmarking their results against

random forest and MLP baselines.

Differing from recent studies, our ﬁndings indi-

cate that the FT-Transformer outperforms all other

methods, achieving an accuracy and F1-Score of

0.991, while also demonstrating the fastest inference

time. This superior performance is likely due to

the transformer’s self-attention mechanism and par-

allelization. In contrast, GrowNet shows the weakest

performance across all metrics, with the lowest accu-

racy, F1-Score, and AUC, as well as the longest train-

ing and inference times, likely due to its reliance on

weak neural networks and an iterative training pro-

cess. The RFC performs better than the MLP and

SoftOrdering1DCNN in accuracy and F1-Score, al-

though it scores lower in AUC. RFC and MLP beneﬁt

from simpler architectures, resulting in faster training

times, with RFC achieving the lowest inference time

overall. In summary, our research positions the FT-

Transformer as a promising alternative for ML-based

intrusion detection, highlighting a paradigm shift in

the effectiveness of deep learning approaches.

However, our study presents opportunities for fur-

ther exploration, particularly regarding the number of

models and datasets utilized. While our ﬁndings offer

valuable insights from evaluating deep learning algo-

rithms on the Bot-IoT dataset, incorporating a broader

range of datasets and models in future studies could

enhance the generalizability and depth of our ﬁnd-

Anomaly Detection in IoT Networks: A Performance Comparison of Transformer, 1D-CNN, and GrowNet Models on the Bot-IoT Dataset

641

ings. Hence, our future research will address this

by incorporating additional IoT anomaly detection

datasets, such as IoT-23 and N-BaIoT. By validating

our models across multiple datasets, we hope to en-

hance the robustness and applicability of our conclu-

sions, thereby providing a more comprehensive un-

derstanding of the performance of deep learning tech-

niques in diverse IoT environments.

REFERENCES

Mechanisms of Action (MoA) Prediction.

Alkadi, S., Al-Ahmadi, S., and Ben Ismail, M. M. (2023).

Toward improved machine learning-based intrusion

detection for internet of things trafﬁc. Computers,

12(8).

Badirli, S., Liu, X., Xing, Z., Bhowmik, A., Doan, K., and

Keerthi, S. S. (2020). Gradient boosting neural net-

works: Grownet. arXiv preprint arXiv:2002.07971.

Baldi, P., Sadowski, P., and Whiteson, D. (2014). Searching

for exotic particles in high-energy physics with deep

learning. Nature communications, 5(1):4308.

Bovenzi, G., Aceto, G., Ciuonzo, D., Persico, V., and

Pescap

e, A. A Hierarchical Hybrid Intrusion Detec-

tion Approach in IoT Scenarios. In GLOBECOM

2020 - 2020 IEEE Global Communications Confer-

ence, pages 1–7.

Breiman, L. (2001). Random forests. Machine learning,

45:5–32.

Bridle, J. (1989). Training stochastic model recognition al-

gorithms as networks can lead to maximum mutual in-

formation estimation of parameters. Advances in neu-

ral information processing systems, 2.

Chaparadza, R., Wodczak, M., Ben Meriem, T., De Lutiis,

P., Tcholtchev, N., and Ciavaglia, L. (2013). Stan-

dardization of resilience & survivability, and auto-

nomic fault-management, in evolving and future net-

works: An ongoing initiative recently launched in etsi.

In 2013 9th International Conference on the Design

of Reliable Communication Networks (DRCN), pages

331–341.

Chatterjee, A. and Ahmed, B. S. (2022). IoT anomaly de-

tection methods and applications: A survey. Internet

of Things, 19.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,

W. P. (2002). Smote: synthetic minority over-

sampling technique. Journal of artiﬁcial intelligence

research, 16:321–357.

Churcher, A., Ullah, R., Ahmad, J., ur Rehman, S., Masood,

F., Gogate, M., Alqahtani, F., Nour, B., and Buchanan,

W. J. (2021). An experimental analysis of attack clas-

siﬁcation using machine learning in iot networks. Sen-

sors, 21(2).

Ferrag, M. A., Maglaras, L., Moschoyiannis, S., and Jan-

icke, H. (2020). Deep learning for cyber security in-

trusion detection: Approaches, datasets, and compar-

ative study. Journal of Information Security and Ap-

plications, 50:102419.

Garcia, S., Parmisano, A., and Erquiaga, M. J. (2020). IoT-

23: A labeled dataset with malicious and benign IoT

network trafﬁc. Type: dataset.

Giannoni, F., Mancini, M., and Marinelli, F. (2018).

Anomaly detection models for iot time series data.

arXiv preprint arXiv:1812.00890.

Gorishniy, Y., Rubachev, I., Khrulkov, V., and Babenko, A.

(2021). Revisiting deep learning models for tabular

data. Advances in Neural Information Processing Sys-

tems, 34:18932–18943.

ornitz, N., Braun, M., and Kloft, M. (2015). Hidden

markov anomaly detection. In International confer-

ence on machine learning, pages 1833–1842. PMLR.

Hastie, T., Tibshirani, R., Friedman, J., Hastie, T., Tibshi-

rani, R., and Friedman, J. (2009). Boosting and addi-

tive trees. The elements of statistical learning: data

mining, inference, and prediction, pages 337–387.

Hill, D. J., Minsker, B. S., and Amir, E. (2007). Real-time

bayesian anomaly detection for environmental sensor

data. In Proceedings of the Congress-International

Association for Hydraulic Research, volume 32, page

503.

Hoque, N., Bhuyan, M. H., Baishya, R. C., Bhattacharyya,

D. K., and Kalita, J. K. (2014). Network attacks: Tax-

onomy, tools and systems. Journal of Network and

Computer Applications, 40:307–324.

Hounsell, N., Shrestha, B., Piao, J., and McDonald, M.

(2009). Review of urban trafﬁc management and the

impacts of new vehicle technologies. IET intelligent

transport systems, 3(4):419–428.

Ibitoye, O., Shaﬁq, O., and Matrawy, A. (2019). Analyz-

ing adversarial attacks against deep learning for intru-

sion detection in iot networks. In 2019 IEEE global

communications conference (GLOBECOM), pages 1–

6. IEEE.

Iliyasu, A. S. and Deng, H. (2022). N-gan: a novel

anomaly-based network intrusion detection with gen-

erative adversarial networks. International Journal of

Information Technology, 14(7):3365–3375.

Imrana, Y., Xiang, Y., Ali, L., and Abdul-Rauf, Z. (2021).

A bidirectional lstm deep learning approach for in-

trusion detection. Expert Systems with Applications,

185:115524.

Ishii, S. and Ljunggren, D. (2021). A comparative analysis

of robustness to noise in machine learning classiﬁers.

Khan, A. R., Kashif, M., Jhaveri, R. H., Raut, R., Saba,

T., and Bahaj, S. A. (2022). Deep learning for intru-

sion detection and security of internet of things (iot):

Current analysis, challenges, and possible solutions.

Security and Communication Networks, 2022.

Kohavi, R. et al. (1996). Scaling up the accuracy of naive-

bayes classiﬁers: A decision-tree hybrid. In Kdd, vol-

ume 96, pages 202–207.

Kolias, C., Kambourakis, G., Stavrou, A., and Voas, J.

(2017). Ddos in the iot: Mirai and other botnets. Com-

puter, 50(7):80–84.

Koroniotis, N., Moustafa, N., Sitnikova, E., and Turnbull,

B. (2019). Towards the development of realistic botnet

dataset in the Internet of Things for network forensic

DATA 2025 - 14th International Conference on Data Science, Technology and Applications

642

analytics: Bot-IoT dataset. Future Generation Com-

puter Systems, 100:779–796.

Kullig, N., L

ammel, P., and Tcholtchev, N. (2020). Pro-

totype implementation and evaluation of a blockchain

component on iot devices. Procedia Computer Sci-

ence, 175:379–386. The 17th International Confer-

ence on Mobile Systems and Pervasive Computing

(MobiSPC),The 15th International Conference on Fu-

ture Networks and Communications (FNC),The 10th

International Conference on Sustainable Energy Infor-

mation Technology.

LeCun, Y., Bengio, Y., et al. (1995). Convolutional net-

works for images, speech, and time series. The

handbook of brain theory and neural networks,

3361(10):1995.

Lesne, A. (2014). Shannon entropy: a rigorous notion at the

crossroads between probability, information theory,

dynamical systems and statistical physics. Mathemat-

ical Structures in Computer Science, 24(3):e240311.

Li, C., Mo, L., Tang, H., and Yan, R. (2020). Lifelong con-

dition monitoring based on nb-iot for anomaly detec-

tion of machinery equipment. Procedia Manufactur-

ing, 49:144–149. Proceedings of the 8th International

Conference on Through-Life Engineering Services –

TESConf 2019.

Malhotra, P., Vig, L., Shroff, G., Agarwal, P., et al. (2015).

Long short term memory networks for anomaly detec-

tion in time series. In ESANN, volume 2015, page 89.

Markou, M. and Singh, S. (2003). Novelty detection: a re-

view—part 1: statistical approaches. Signal process-

ing, 83(12):2481–2497.

Meidan, Y., Bohadana, M., Mathov, Y., Mirsky, Y., Breit-

enbacher, D., Shabtai, A., and Elovici, Y. (2018). N-

BaIoT: Network-based Detection of IoT Botnet At-

tacks Using Deep Autoencoders. IEEE Pervasive

Computing, 17(3):12–22. arXiv:1805.03409 [cs].

Min, M., Lee, J. J., Park, H., and Lee, K. (2021). Detecting

anomalous transactions via an iot based application:

A machine learning approach for horse racing betting.

Sensors, 21(6).

Pace, R. K. and Barry, R. (1997). Sparse spatial autoregres-

sions. Statistics & Probability Letters, 33(3):291–297.

Pang, G., Shen, C., Cao, L., and van den Hengel, A.

(2020). Deep learning for anomaly detection: A re-

view. CoRR, abs/2007.02500.

Ryu, J.-Y., Kim, D.-W., and Kim, M.-K. (2021). House-

hold differentiation and residential electricity demand

in korea. Energy Economics, 95.

Saba, T., Rehman, A., Sadad, T., Kolivand, H., and Ba-

haj, S. A. (2022). Anomaly-based intrusion detection

system for iot networks through deep learning model.

Computers and Electrical Engineering, 99:107810.

Sabir, B., Ullah, F., Babar, M. A., and Gaire, R. (2021). Ma-

chine learning for detecting data exﬁltration: a review.

ACM Computing Surveys (CSUR), 54(3):1–47.

Schlag, I., Irie, K., and Schmidhuber, J. (2021). Linear

transformers are secretly fast weight programmers.

In International Conference on Machine Learning,

pages 9355–9366. PMLR.

Sedgwick, P. (2012). Pearson’s correlation coefﬁcient. Bmj,

345.

Shahid, J., Ahmad, R., Kiani, A. K., Ahmad, T., Saeed,

S., and Almuhaideb, A. M. (2022). Data protection

and privacy of the internet of healthcare things (iohts).

Applied Sciences, 12(4):1927.

Shahid, N., Naqvi, I. H., and Qaisar, S. B. (2015). One-class

support vector machines: analysis of outlier detection

for wireless sensor networks in harsh environments.

Artiﬁcial Intelligence Review, 43:515–563.

Singh, A., Choudhary, P., et al. (2021). Keylogger detection

and prevention. In Journal of Physics: Conference

Series, volume 2007, page 012005. IOP Publishing.

Stellios, I., Kotzanikolaou, P., Psarakis, M., Alcaraz, C.,

and Lopez, J. (2018). A survey of iot-enabled cyberat-

tacks: Assessing attack paths to critical infrastructures

and services. IEEE Communications Surveys & Tuto-

rials, 20(4):3453–3495.

Stoyanova, M., Nikoloudakis, Y., Panagiotakis, S., Pallis,

E., and Markakis, E. K. (2020). A survey on the inter-

net of things (iot) forensics: challenges, approaches,

and open issues. IEEE Communications Surveys &

Tutorials, 22(2):1191–1221.

Tcholtchev, N. and Chaparadza, R. (2010). Autonomic

fault-management and resilience from the perspective

of the network operation personnel. In 2010 IEEE

Globecom Workshops, pages 469–474.

Usoh, M., Asuquo, P., Ozuomba, S., Stephen, B., and In-

yang, U. (2023). A hybrid machine learning model for

detecting cybersecurity threats in iot applications. In-

ternational Journal of Information Technology, pages

1–12.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in neural

information processing systems, 30.

Xu, J., Xu, J., Tong, Z., Yu, S., Liu, B., Mu, X., Du,

B., Gao, C., Wang, J., Liu, Z., et al. (2023). Im-

pact of different classiﬁcation schemes on discrimina-

tion of proteins with noise-contaminated spectra using

laboratory-measured ﬂuorescence data. Spectrochim-

ica Acta Part A: Molecular and Biomolecular Spec-

troscopy, 296:122646.

Ozer, E.,

Iskeﬁyeli, M., and Azimjonov, J. (2021). Toward

lightweight intrusion detection systems using the op-

timal and efﬁcient feature pairs of the bot-iot 2018

dataset. International Journal of Distributed Sensor

Networks, 17(10):15501477211052202.

Anomaly Detection in IoT Networks: A Performance Comparison of Transformer, 1D-CNN, and GrowNet Models on the Bot-IoT Dataset

643