Decomposing Training Data to Improve Network Intrusion Detection

Performance

Roberto Saia, Alessandro Sebastian Podda, Gianni Fenu and Riccardo Balia

Department of Mathematics and Computer Science,

University of Cagliari, Via Ospedale 72, 09124 Cagliari, Italy

Keywords:

Intrusion Detection, Security, Networking, Data Decomposition, Algorithms.

Abstract:

Anyone working in the ﬁeld of network intrusion detection has been able to observe how it involves an ever-

increasing number of techniques and strategies aimed to overcome the issues that affect the state-of-the-art

solutions. Data unbalance and heterogeneity are only some representative examples of them, and each mis-

classiﬁcation made in this context could have enormous repercussions in different crucial areas such as, for

instance, ﬁnancial, privacy, and public reputation. This happens because the current scenario is characterized

by a huge number of public and private network-based services. The idea behind the proposed work is decom-

posing the canonical classiﬁcation process into several sub-processes, where the ﬁnal classiﬁcation depends on

all the sub-processes results, plus the canonical one. The proposed Training Data Decomposition (TDD) strat-

egy is applied on the training datasets, where it applies a decomposition into regions, according to a deﬁned

number of events and features. The reason that leads this process is related to the observation that the same

network event could be evaluated in a different manner, when it is evaluated in different time periods and/or

when it involves different features. According to this observation, the proposed approach adopts different

classiﬁcation models, each of them trained in a different data region characterized by different time periods

and features, classifying the event both on the basis of all model results, and on the basis of the canonical

strategy that involves all data.

1 INTRODUCTION

The exponential growth of network-based services,

now even more impressive due to the COVID-

19 (Chang and Meyerhoefer, 2020; Dhawan, 2020;

Rapanta et al., 2020) emergency, has made the prob-

lem of their security a central aspect of the every-

day life, public and private. The high degree of

heterogeneity (Zuech et al., 2015) that characterizes

network events, in terms both of type and behavior,

makes the detection of attacks a very difﬁcult task.

This troubling scenario is made even more difﬁcult by

considering that many attacks are actually conducted

by operating in apparently legitimate ways (i.e., they

are characterized by the same patterns of normal net-

work activities (Carta et al., 2020b)), whereas other

attacks are being detected for the ﬁrst time and there-

fore we have no knowledge of them (zero-days at-

tacks (Radhakrishnan et al., 2019)), and everything

is further worsened by the high dynamism of the net-

work scenarios.

For this reason, tools widely used in the past with

good results, such as the Intrusion Detection Systems

(IDSs) (Tidjon et al., 2019), nowadays appear unable

to face the new threats with the same effectiveness.

This is a big problem, since there is the need to ensure

a high level of security to many crucial services such

as, for example, those related to the ﬁnance (Wazid

et al., 2019), health (Y

uksel et al., 2017), and educa-

tion (Luker and Petersen, 2003).

To cope with the new threats, approaches dif-

ferent from the canonical ones have therefore been

sought, integrating increasingly sophisticated tech-

nologies and strategies that involve areas such as

machine learning (Gao et al., 2019), artiﬁcial intel-

ligence (Kanimozhi and Jacob, 2019), neural net-

works (Le et al., 2019), and so on, also by combin-

ing them to improve the intrusion detection perfor-

mance (Li and Lu, 2019; Meyer and Labit, 2020).

The idea on which the proposed strategy revolves

was born according to many literature works in this

ﬁeld (Carta et al., 2020b; Saia et al., 2018b; Saia et al.,

2019; Saia et al., 2020), where we trivially observed

that the data used to train an evaluation model refers

to different period of time, as well as the features that

characterize each event refer to different aspect of it.

Saia, R., Podda, A., Fenu, G. and Balia, R.

Decomposing Training Data to Improve Network Intrusion Detection Performance.

DOI: 10.5220/0010661400003064

In Proceedings of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2021) - Volume 1: KDIR, pages 241-248

ISBN: 978-989-758-533-3; ISSN: 2184-3228

241

Premising that the data to which we refer are those

collected by network security devices (e.g., an IDS),

on the basis of the previous observation, we thought

to divide the classiﬁcation process into several sub-

processes, then on several evaluation models, each of

them trained on a different part of the dataset, in terms

of events and features. We believe that such a strategy

can be able to mitigate the event heterogeneity prob-

lem, classifying them on the basis of considerations

taken on different parts of the training data, therefore

on the basis of different time periods and event char-

acteristics, rather than based on a model trained on the

entire dataset.

In more detail, in this work we propose a Training

Data Decomposition (TDD) strategy to intrusion de-

tection, which is based on the subdivision of the train-

ing dataset in several regions, according to a number

of rows and columns that bound it in a certain number

of events (rows) and features (columns). Each region

is used to train a different evaluation model, and the

ﬁnal classiﬁcation of a new event is given by ensem-

bling all the models through a majority voting crite-

rion.

The main scientiﬁc contribution related to the pro-

posed work can be summarize as follows:

- formalization of a Training Data Decomposition

(TDD) strategy aimed to bound the training set on

the basis of a certain number of events (rows) and

features (columns);

- formalization of the comparison process between

an unevaluated network event and a series of eval-

uation models deﬁned on the basis of the proposed

TDD strategy, along with the formalization of the

classiﬁcation criterion;

- formalization of a classiﬁcation algorithm able to

exploit the TDD strategy in order to classify each

new network event as normal or intrusion.

The remainder of this paper has been structured

in the following way: Section 2 discusses the back-

ground and related works of the intrusion detection

domain, discussing also about the most suitable eval-

uation metrics; Section 3 provides details about the

idea behind the proposed strategy, beside its practical

implementation; Section 4 ends this work with some

remarks on the proposed strategy, making mention of

future research directions.

2 BACKGROUND AND RELATED

WORK

The intrusion detection term has been introduced for

the ﬁrst time in the eighties (Anderson, 1980), where

in his technical report the author tried to deﬁne a kind

of guideline to improve the computer security audit-

ing and surveillance capability of the computer sys-

tems. In the following years the literature proposed

numerous works focused on the intrusion detection

topic, both of a theoretical nature, where aspects such

as their taxonomy have been discussed (Axelsson,

2000), and practical application of them (Khraisat

et al., 2019), up to the present days, where the discus-

sion has extended to recent areas, such as that of the

Internet of Things (IoT) (Zarpel

ao et al., 2017), the

cloud computing (Modi et al., 2013), and the smart

cities (Aloqaily et al., 2019).

The concept of intrusion leads back to a series

of attacks based on techniques and/or strategies that

evolve over time, an activity whose effectiveness

can depend on the skill of a human operator (Latha

and Prakash, 2017) or on a speciﬁc software/mal-

ware (Rehman et al., 2011), or both of them. To this

must also be added strategies not directly related to

a direct attack, such as, for instance, the social engi-

neering (Salahdine and Kaabouch, 2019) one.

An IDS is aimed to detect and identify unau-

thorized network activities in a network, and it

can perform this activity by following different ap-

proaches. On the basis of the literature, the most com-

mon of them are: anomaly-based, signature-based,

speciﬁcation-based, and hybrid-based.

In more detail: the anomaly-based analyzes and

classiﬁes the network events without comparing them

to those in a dataset of known event patterns, since it

adopts a heuristic/rules-based strategy, detecting the

intrusions on the basis of their atypical network ac-

tivity (Samrin and Vasumathi, 2017); the signature-

based works by comparing each detected network

event to those in a dataset of known event pat-

terns (Bronte et al., 2016); the speciﬁcation-based

operates by inspecting the network protocols, with

the aim to identify non canonical sequences of com-

mands that can be part of an attack (Liao et al., 2013);

the hybrid-based approach can be considered a com-

bination of one or more of the aforementioned ap-

proaches (Li et al., 2005).

On the basis of the current literature, we can also

summarize the open problems that affect the different

intrusion detection approaches: the Anomaly-based

is able to identify novel form of network attacks

(zero-days), as well as the anomalous exploitation

of privileges, but it can not be considered an effec-

tive approach, due the high dynamism of the network

scenario and the high response-time; the Signature-

based well works in the context of known attacks or

their variations, but it is not able to inspect the in-

volved protocols, generating also an high computa-

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

242

tional load; the Speciﬁcation-based is able to inspect

the involved protocols, detecting their anomalous ex-

ploitation, but it is not able to distinguish those at-

tacks characterized by the same behavior of a legiti-

mate network activity and, in addition, it presents an

high computational cost related to the protocols in-

spection and tracing; the Hybrid-based presents the

same pros and cons of the combined approaches.

According to the approach and placement in the

network area, an IDS can be also classiﬁed in four

(most common) categories: Host-based, Network-

based, Network-node-based, and Distributed-based.

In more detail: the Host-based Intrusion Detec-

tion Systems (HIDSs) (Jose et al., 2018) adopts sev-

eral hosts to detect the network activity, detecting the

attacks, by comparing the events to a series of known

patterns (signature-based approach); the Network-

based Intrusion Detection Systems (NIDSs) (Mazini

et al., 2019) adopts a single host to detect the net-

work activity, it exploits a database of known patterns,

analyzing only the events characterized by unknown

patterns (hybrid signature-based and analysis-based

approach); the Network-Node-based Intrusion Detec-

tion Systems (NNIDSs) (Potluri and Diedrich, 2016)

adopts a single host strategically placed in the net-

work, operating by combining the HIDS and NIDS

strategies; the Distributed-based Intrusion Detection

Systems (DIDSs) (Amrita, 2018) adopts an hybrid

strategy based on the combination of all the aforemen-

tioned ones.

Moreover, recently, some security features offered

by techniques typically exploited in other areas have

been started to be taken into consideration for facing

network security tasks, such as, for instance, those

leveraging on the blockchain primitives (Vieira et al.,

2020; Longo et al., 2020).

2.1 Performance Evaluation

A preliminary consideration regarding the evaluation

metrics used in this domain is related to the fact that,

similar to some other domains (Saia et al., 2017; Carta

et al., 2020a; Saia and Carta, 2017a; Saia, 2017; Saia

and Carta, 2016; Saia et al., 2018a; Saia and Carta,

2017b), the involved data are usually characterized by

a high degree of imbalance (in the intrusion detection

context the minority class is the intrusion one), re-

quiring assessment metrics that are not biased by this

characteristic.

Since, in the literature, an intrusion detection task

is commonly expressed in terms of a binary problem,

then addressed as a classiﬁcation task (Chuang et al.,

2019), the best approach requires to adopt multiple

evaluation metrics, in order to get a reliable evalua-

tion of the effectiveness of a classiﬁcation model. For

this reason, simple metrics based on the confusion-

matrix, e.g., accuracy, sensitivity, and speciﬁcity, and

more sophisticated ones, such as those based on the

Receiver Operating Characteristic (ROC) curve, e.g.,

the Area Under the Receiver Operating Character-

istic curve (AUC), are often combined in the litera-

ture (Munaiah et al., 2016).

Moreover, considering that the main objective of

an intrusion detection system is the correct identiﬁ-

cation and classiﬁcation of the negative cases (intru-

sions), as their misclassiﬁcation would have a higher

cost than that of the positive ones (normals), many of

the works in the literature use the speciﬁcity and the

AUC metrics.

3 PROPOSED STRATEGY

This section provides the formal notation adopted in

this work, along with the deﬁnition of the problem to

face, and the implementation of the proposed strategy.

3.1 Preliminary Notation

Premising that we adopted the notation |set| to in-

dicate the cardinality of a set, we denote as E =

, e

, . .. , e

} a series of network events composed

by:

- a subset E

= {e

, e

, . . . , e

} of normal events,

then E

⊆ E;

- a subset E

−

= {e

−

, e

−

, . . . , e

−

} of intrusion

events, then E

−

⊆ E;

- a subset

E = { ˆe

, ˆe

, . . . , ˆe

} of unclassiﬁed

events, then

E ⊆ E.

So we have that E = (E

∪ E

−

∪

E), and each

event e ∈ E is characterized by the features in the set

F = { f

, f

, . . . , f

}, and it can belong to one of the

classes in the set C = {normal, intrusion}. We also

formalize:

- the training set T = {e

, e

, . . . , e

} given by E

∪

−

;

- the possibility to divide T into R = {r

, r

, . . . , r

}

regions, according to the T events (set rows) and

features (set columns);

- the regions deﬁnition operation as R

(ER,FC)

, with

ER the number of Event Rows, and FC the num-

ber of Feature Columns, then |R| = Z = (ER ×

FC).

From this it follows that:

Decomposing Training Data to Improve Network Intrusion Detection Performance

243

- generalizing on set E, each region can be com-

posed by

events and

features, since |E| = N

and |F| = W ;

- the bounds of ER and FC are, respectively, 1 ≤

ER ≤ |T | and 1 ≤ FC ≤ |F|, but it must be con-

sidered that ER = FC = 1 deﬁnes the canonical

data conﬁguration, where all the events and the

features are involved, and that each region de-

ﬁned according the ER value must contain sam-

ples (events) of both classes in C, in order to allow

us to perform the training process of an evaluation

model.

3.2 Problem Statement

We face the intrusion detection in terms of binary

classiﬁcation, according to the classes in the set C

(i.e., normal and intrusion). Hence, the problem can

be formalized as shown in Equation 1, where Ψ de-

notes a generic intrusion detection approach, whereas

eval( ˆe, Ψ) is the evaluation function of the event ˆe,

which returns 1 when a classiﬁcation has been per-

formed correctly, 0 otherwise. This allows us to ex-

press this problem in terms of maximization of the ω

value, since it is given by the sum of the correct clas-

siﬁcations (then the ω upper bound is |

E|).

max

0≤ω≤|

ω =

∑

m=1

eval( ˆe

, Ψ) (1)

3.3 Strategy Overview

We now brieﬂy introduce the TDD strategy, a domain-

speciﬁc case of the bootstrap aggregation (Breiman,

1996) strategy, here proposed as an alternative to a

canonical intrusion detection evaluation model that

exploits, during the training process, all the data that

have been collected and correctly classiﬁed in the

past, involving all the events and features.

Indeed, we propose to combine different evalua-

tion models in a fusion fashion, each of them trained

on a different region of the training set. The regions

are bounded in terms of events and features, with the

aim to selectively capture the properties of each sub-

region by focusing on speciﬁc periods of time (rows of

events) and behavior (columns of features). This has

been made according to the simple scheme reported

in the following, where four regions of the training

set T , with two events and two features each one (one

of them has been highlighted), have been selected by

way of example:

T =

←− behavior −→

( f

) e

( f

) e

( f

...

) e

( f

)

←− time−→

( f

) e

( f

) e

( f

...

) e

( f

)

( f

) e

( f

) e

( f

...

) e

( f

)

This leads to the division of the training process

into several sub-processes, each of them based on a

different part of the dataset, in terms of events and

features. In other words, we speculate that such a

strategy is able to mitigate the issues related to the

event heterogeneity, since the classiﬁcation of the new

events is now based on decisions taken on the basis

of different time periods and characteristics (aggre-

gated according to a criterion), rather than on the en-

tire dataset.

3.4 Strategy Formalization

The problem formalized in Equation 1 needs to be ar-

ranged according to the subdivision into regions we

introduced, substantially by subdividing the evalua-

tion process into Z sub-processes (i.e., |R| = Z). More

formally, a generic intrusion detection approach Ψ is

used Z times, and the ﬁnal event classiﬁcation will be

determined on the basis of all the classiﬁcations per-

formed by these approaches, as shown in the example

of Equation 2, where we hypothesized K = 4, W = 4,

ER = 2, and FC = 2, thus subdividing the training set

T into |R| = Z = (2 × 2) = 4 regions, each of them

composed by

= 2 events and

= 2 fea-

tures. This generates the four m

, m

evalua-

tion models.

(2,2)







1,1

2,1

3,1

4,1

1,2

2,2

3,2

4,2

1,3

2,3

3,3

4,3

1,4

2,4

3,4

4,4







⇒

(2)

Otherwise speaking, the process of training of the

evaluation model m of a classiﬁcation algorithm is

performed by using the events and features in each

, r

regions of the Equation 2, generating the

four m

, m

evaluation models.

Assuming the evaluation of a new event ˆe ∈

which on the basis of the scenario we hypothesized

will be composed by the f

, f

features, its eval-

uation and classiﬁcation will be performed by com-

paring (we denoted this operation as ⇔) it to all eval-

uation models trained on the different regions, inde-

pendently, and this process returns four classiﬁcation

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

244

, c

, according to the criterion shown in Equa-

tion 3.

⇔

(3)

3.4.1 Padding Rule

Considering that the number of regions given by the

values of FC and ER may not exactly partition the

sets F and T in the cases reported in Equation 4, we

need to formalize a padding rule aimed to face this

problem.

(|F| mod FC) 6= 0

(|T | mod ER) 6= 0

(4)

Introducing the notation µ

= (|F| mod FC) and µ

(|T | mod ER), according to the preliminary notation

provided in Section 3.1 (i.e., F = { f

, f

, . . . , f

} and

T = {e

, e

, . . . , e

}), we formalize in Equation 5 the

padding rule pad.

pad(F) = { f

, f

, . . . , f

, f

W +1

, f

W +2

, . . . , f

W +µ

}

with f

W +1

= f

W +2

= . . . = f

W +µ

= f

pad(T ) = {e

, e

, . . . , e

, e

K+1

, e

K+2

, . . . , e

K+µ

}

with e

K+1

= e

, e

K+2

= e

K−1

, . . . = e

K+µ

= e

K−µ

(5)

Practically, with the aim of not signiﬁcantly altering

the involved information during the padding process,

the problem has been faced by following two different

strategies: (i) in the case of F, we operate by dupli-

cating the last column (i.e., the last feature of the set

E) µ

times; (ii) in the case of T we duplicated the

last µ

rows (i.e., the last events of the set T ), reduc-

ing the risk that the added events belong to the same

class in C. The proposed naive solution does not sig-

niﬁcantly affect the machine learning process, as it is

applied to both training and test data. For simpliﬁca-

tion reasons, we will always consider this rule applied

during the process of deﬁnition of the regions (as in-

ternal preprocessing step), without referring to it from

time to time.

3.4.2 Classiﬁcation Rule

As previously formalized, excepting for the case

ER = FC = 1 that bounds a single region (i.e., the

canonical data conﬁguration), the process of classi-

ﬁcation of an event ˆe ∈

E involves a series of eval-

uation models m

, m

, . . . , m

, whose cardinality de-

pends on the number of regions, since |R| = Z. Con-

sidering that such models generate c

, c

, . . . , c

clas-

siﬁcations, this conﬁgures one of the two cases re-

ported in Equation 6.

Case 1 : Z = 2n, n ∈ N

Case 2 : Z = 2n − 1, n ∈ N

(6)

Whereas in the Case 2 an univocal classiﬁcation

of the event is possible on the basis of the majority

classiﬁcation, in the Case 1 we need to introduce a

discriminating element. For this purpose we use a

further classiﬁcation c

Z+1

obtained by using an eval-

uation model trained on the entire set E, then we will

have the c

, c

, . . . , c

, c

Z+1

classiﬁcations.

In more detail, assuming an example sce-

nario related Case 1, for instance by using ER =

FC = 2, which generate the classiﬁcation models

, m

, providing the c

, c

classiﬁca-

tions (each of them that belongs to a class in the set

C, then normal or intrusion) for an event ˆe ∈

E, the

ﬁnal classiﬁcation of the event ˆe is given by adding

the classiﬁcation c

obtained by training an evalua-

tion model on the entire set T .

A majority criterion is then applied by following

the classiﬁcation rule ρ formalized in Equation 7,

where c

and c

denote, respectively, the elements

normal and intrusion of the C set.

ρ( ˆe) =











, i f

∑

i=1

φ(c

, c

) >

∑

i=1

φ(c

, c

)

, i f

∑

i=1

φ(c

, c

) <

∑

i=1

φ(c

, c

)

, i f

∑

i=1

φ(c

, c

) =

∑

i=1

φ(c

, c

) ∧ c

Z+1

= c

, i f

∑

i=1

φ(c

, c

) =

∑

i=1

φ(c

, c

) ∧ c

Z+1

= c

with

φ(a, b) =



0, i f a 6= b

1, i f a = b

(7)

3.5 Algorithm Deﬁnition

On the basis of the proposed strategy, we formalized

the Algorithm 1, which is aimed to classify each new

network event. It takes as input a classiﬁcation algo-

rithm α, the set of classiﬁed events T (i.e., the training

set), the set

E of events to classify, and the number of

rows (ER) and columns (FC) for the regions deﬁni-

tion, returning the classiﬁcation of all events in the

set

At Steps from 2 to 4 an evaluation model is trained

by using the entire set T , if the numbers of regions

(i.e., Z = |ER × FC|) is even. At Step 5 the training

test T is processed in order to deﬁne the regions, ac-

cording to the ER and FC values. For each region, at

Steps from 6 to 9, an evaluation model of the algo-

rithm α is trained. The classiﬁcation of the events in

E is performed from Step 10 to 21, where: at Step 11

the features of the event ˆe to evaluate are divided into

Decomposing Training Data to Improve Network Intrusion Detection Performance

245

regions, according to the ER and FC values; at Steps

from 12 to 15 each region is classiﬁed according to

the M models previously trained, and an additional

classiﬁcation based on the model trained at Step 3 is

added to the set C when the number of regions is even

(Steps from 16 to 19). At Step 20 the event ˆe is clas-

siﬁed, and the classiﬁcation is stored in the set κ. The

set κ with the classiﬁcation of all events in the set

is ﬁnally returned by the algorithm at Step 22.

Algorithm 1: Classiﬁer algorithm.

Require: α=Classiﬁcation algorithm, T =Evaluated events,

E=Unevaluated

events, ER=Number of event rows, FC=Number of feature columns

Ensure: κ=Classiﬁcation of the

E events

1: procedure CLASSIFIER(α, T ,

E, ER, FC)

2: if Z is even then  Veriﬁes if the number of regions is even

3: m

← getTraining(α, T )  Trains evaluation model by using the

whole set T

4: end if

5: R ← getRegions(T, ER, FC)  Divides training set into regions

6: for each r ∈ R do  Trains an evaluation model for each regions

7: m ← getTraining(α, r)  Trains evaluation model

8: M.add(m)  Stores evaluation model

9: end for

10: for each ˆe ∈

E do  Processes events in

11: R

← getRegions(ˆe,ER, FC)  Divides event into regions

12: for each m ∈ M do  Gets all event classiﬁcations

13: c ← getEventClass(m, R

)  Classiﬁes event according to

regions

14: C.add(c)  Stores classiﬁcation

15: end for

16: if Z is even then  Veriﬁes if the number of regions is even

17: c

← getEventClass(m

, ˆe)  Classiﬁes event according to

the whole set E

18: C.add(c

)  Add classiﬁcation to the set C

19: end if

20: κ.add(getFinalClassi f ication( ˆe,C))  Gets and store ﬁnal

event classiﬁcation

21: end for

22: return κ  Returns classiﬁcation of

E events

23: end procedure

4 CONCLUSIONS AND FUTURE

DIRECTIONS

This work proposes a Training Data Decomposition

(TDD) strategy aimed to improve the performance of

an intrusion detection system. It is based on the con-

sideration that by dividing the training dataset into

several regions, it is possible to characterize differ-

ent scenarios in terms of time (number of events) and

characteristics (features), allowing the deﬁnition of a

more effective analysis model. Each region is used in

order to train an independent evaluation model, and

the ﬁnal classiﬁcation of each new event is performed

by taking into account the classiﬁcations of all the in-

volved models, in a fusion fashion regulated by a ma-

jority voting criterion. In order to be able to apply the

TDD strategy, regardless the used dataset (i.e., differ-

ent number of events and features), some rules have

been also formalized.

As next step, we will experiment the proposed

strategy in the context of a real-world dataset such as,

for instance, the largely used in the literature NSL-

KDD

one, which includes a large number of different

network events in terms of protocols and intrusions,

allowing us to verify the hypothesis behind this work.

REFERENCES

Aloqaily, M., Otoum, S., Al Ridhawi, I., and Jararweh, Y.

(2019). An intrusion detection system for connected

vehicles in smart cities. Ad Hoc Networks, 90:101842.

Amrita, K. K. R. (2018). A hybrid intrusion detection

system: Integrating hybrid feature selection approach

with heterogeneous ensemble of intelligent classiﬁers.

International Journal of Network Security (IJNS’18),

20(1):41–55.

Anderson, J. P. (1980). Computer security threat monitoring

and surveillance.

Axelsson, S. (2000). Intrusion detection systems: A survey

and taxonomy. Technical report, Technical report.

Breiman, L. (1996). Bagging predictors. Machine learning,

24(2):123–140.

Bronte, R., Shahriar, H., and Haddad, H. M. (2016). A

signature-based intrusion detection system for web

applications based on genetic algorithm. In Proceed-

ings of the 9th International Conference on Security

of Information and Networks, pages 32–39.

Carta, S., Corriga, A., Ferreira, A., Podda, A. S., and Recu-

pero, D. R. (2020a). A multi-layer and multi-ensemble

stock trader using deep learning and deep reinforce-

ment learning. Applied Intelligence, pages 1–17.

Carta, S., Podda, A. S., Reforgiato Recupero, D. R., and

Saia, R. (2020b). A local feature engineering strategy

to improve network anomaly detection. Future Inter-

net, 12(10):177.

Chang, H.-H. and Meyerhoefer, C. D. (2020). Covid-19 and

the demand for online food shopping services: Empir-

ical evidence from taiwan. American Journal of Agri-

cultural Economics.

Chuang, H.-M., Huang, H.-Y., Liu, F., and Tsai, C.-H.

(2019). Classiﬁcation of intrusion detection system

based on machine learning. In International Cogni-

tive Cities Conference, pages 492–498. Springer.

Dhawan, S. (2020). Online learning: A panacea in the time

of covid-19 crisis. Journal of Educational Technology

Systems, 49(1):5–22.

Gao, X., Shan, C., Hu, C., Niu, Z., and Liu, Z. (2019). An

adaptive ensemble machine learning model for intru-

sion detection. IEEE Access, 7:82512–82521.

https://github.com/defcom17/NSL KDD

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

246

Jose, S., Malathi, D., Reddy, B., and Jayaseeli, D. (2018). A

survey on anomaly based host intrusion detection sys-

tem. In Journal of Physics: Conference Series, vol-

ume 1000, page 012049. IOP Publishing.

Kanimozhi, V. and Jacob, T. P. (2019). Artiﬁcial intelli-

gence based network intrusion detection with hyper-

parameter optimization tuning on the realistic cy-

ber dataset cse-cic-ids2018 using cloud computing.

In 2019 International Conference on Communication

and Signal Processing (ICCSP), pages 0033–0036.

IEEE.

Khraisat, A., Gondal, I., Vamplew, P., and Kamruzzaman,

J. (2019). Survey of intrusion detection systems:

techniques, datasets and challenges. Cybersecurity,

2(1):20.

Latha, S. and Prakash, S. J. (2017). A survey on network at-

tacks and intrusion detection systems. In 2017 4th In-

ternational Conference on Advanced Computing and

Communication Systems (ICACCS), pages 1–7. IEEE.

Le, T.-T.-H., Kim, Y., Kim, H., et al. (2019). Network intru-

sion detection based on novel feature selection model

and various recurrent neural networks. Applied Sci-

ences, 9(7):1392.

Li, Y. and Lu, Y. (2019). Lstm-ba: Ddos detection approach

combining lstm and bayes. In 2019 Seventh Interna-

tional Conference on Advanced Cloud and Big Data

(CBD), pages 180–185. IEEE.

Li, Z., Das, A., and Zhou, J. (2005). Usaid: Unifying

signature-based and anomaly-based intrusion detec-

tion. In Paciﬁc-Asia Conference on Knowledge Dis-

covery and Data Mining, pages 702–712. Springer.

Liao, H.-J., Lin, C.-H. R., Lin, Y.-C., and Tung, K.-Y.

(2013). Intrusion detection system: A comprehensive

review. Journal of Network and Computer Applica-

tions, 36(1):16–24.

Longo, R., Podda, A. S., and Saia, R. (2020). Analysis of a

consensus protocol for extending consistent subchains

on the bitcoin blockchain. Computation, 8(3):67.

Luker, M. A. and Petersen, R. J. (2003). Computer and

network security in higher education. Jossey-Bass San

Francisco, CA.

Mazini, M., Shirazi, B., and Mahdavi, I. (2019). Anomaly

network-based intrusion detection system using a re-

liable hybrid artiﬁcial bee colony and adaboost algo-

rithms. Journal of King Saud University-Computer

and Information Sciences, 31(4):541–553.

Meyer, M. L. B. and Labit, Y. (2020). Combining machine

learning and behavior analysis techniques for network

security. In 2020 International Conference on Infor-

mation Networking (ICOIN), pages 580–583. IEEE.

Modi, C., Patel, D., Borisaniya, B., Patel, H., Patel, A., and

Rajarajan, M. (2013). A survey of intrusion detection

techniques in cloud. Journal of network and computer

applications, 36(1):42–57.

Munaiah, N., Meneely, A., Wilson, R., and Short, B. (2016).

Are intrusion detection studies evaluated consistently?

a systematic literature review.

Potluri, S. and Diedrich, C. (2016). High performance in-

trusion detection and prevention systems: A survey. In

ECCWS2016-Proceedings fo the 15th European Con-

ference on Cyber Warfare and Security, page 260.

Academic Conferences and publishing limited.

Radhakrishnan, K., Menon, R. R., and Nath, H. V. (2019).

A survey of zero-day malware attacks and its detection

methodology. In TENCON 2019-2019 IEEE Region

10 Conference (TENCON), pages 533–539. IEEE.

Rapanta, C., Botturi, L., Goodyear, P., Gu

ardia, L., and

Koole, M. (2020). Online university teaching during

and after the covid-19 crisis: Refocusing teacher pres-

ence and learning activity. Postdigital Science and Ed-

ucation, 2(3):923–945.

Rehman, R., Hazarika, G., and Chetia, G. (2011). Malware

threats and mitigation strategies: a survey. Journal

of Theoretical and Applied Information Technology,

29(2):69–73.

Saia, R. (2017). A discrete wavelet transform approach to

fraud detection. In International Conference on Net-

work and System Security, pages 464–474. Springer.

Saia, R. and Carta, S. (2016). A linear-dependence-based

approach to design proactive credit scoring models. In

KDIR, pages 111–120.

Saia, R. and Carta, S. (2017a). Evaluating credit card trans-

actions in the frequency domain for a proactive fraud

detection approach. In SECRYPT, pages 335–342.

Saia, R. and Carta, S. (2017b). A fourier spectral pattern

analysis to design credit scoring models. In Proceed-

ings of the 1st International Conference on Internet of

Things and Machine Learning, page 18. ACM.

Saia, R., Carta, S., et al. (2017). A frequency-domain-

based pattern mining for credit card fraud detection.

In IoTBDS, pages 386–391.

Saia, R., Carta, S., and Fenu, G. (2018a). A wavelet-based

data analysis to credit scoring. In Proceedings of the

2nd International Conference on Digital Signal Pro-

cessing, pages 176–180. ACM.

Saia, R., Carta, S., and Recupero, D. R. (2018b). A

probabilistic-driven ensemble approach to perform

event classiﬁcation in intrusion detection system. In

KDIR, pages 139–146.

Saia, R., Carta, S., Recupero, D. R., and Fenu, G. (2020).

A feature space transformation to intrusion detection

systems. In KDIR. ScitePress.

Saia, R., Carta, S., Recupero, D. R., Fenu, G., and Stan-

ciu, M. (2019). A discretized extended feature space

(defs) model to improve the anomaly detection per-

formance in network intrusion detection systems. In

KDIR, pages 322–329.

Salahdine, F. and Kaabouch, N. (2019). Social engineering

attacks: A survey. Future Internet, 11(4):89.

Samrin, R. and Vasumathi, D. (2017). Review on anomaly

based network intrusion detection system. In 2017

International Conference on Electrical, Electronics,

Communication, Computer, and Optimization Tech-

niques (ICEECCOT), pages 141–147. IEEE.

Tidjon, L. N., Frappier, M., and Mammar, A. (2019).

Intrusion detection systems: A cross-domain

overview. IEEE Communications Surveys & Tutori-

als, 21(4):3639–3681.

Decomposing Training Data to Improve Network Intrusion Detection Performance

247

Vieira, E., Ferreira, J., and Bartolomeu, P. C. (2020).

Blockchain technologies for iot applications: Use-

cases and limitations. In 2020 25th IEEE Interna-

tional Conference on Emerging Technologies and Fac-

tory Automation (ETFA), volume 1, pages 1560–1567.

IEEE.

Wazid, M., Zeadally, S., and Das, A. K. (2019). Mobile

banking: evolution and threats: malware threats and

security solutions. IEEE Consumer Electronics Mag-

azine, 8(2):56–60.

uksel, B., K

upc¸

u, A., and

Ozkasap,

O. (2017). Research

issues for privacy and security of electronic health ser-

vices. Future Generation Computer Systems, 68:1–13.

Zarpel

ao, B. B., Miani, R. S., Kawakani, C. T., and de Al-

varenga, S. C. (2017). A survey of intrusion detection

in internet of things. Journal of Network and Com-

puter Applications, 84:25–37.

Zuech, R., Khoshgoftaar, T. M., and Wald, R. (2015). Intru-

sion detection and big heterogeneous data: a survey.

Journal of Big Data, 2(1):3.

KDIR 2021 - 13th International Conference on Knowledge Discovery and Information Retrieval

248