An Effective Intrusion Detection Model based on Random Forest

Algorithm with I-SMOTE

Weijinxia

, Longchun

1,2

, Wanwei

1,2

, Zhaojing

, Duguanyao

1,2

and Yangfan

Department of Security, Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100190, China

Chinese Academy of Sciences University, Beijing, 101408, China

Keywords: Intrusion Detection, Class Imbalance, I-SMOTE, Feature Reduction, Random Forest.

Abstract: With the wide applications of network in our daily lives, network security is becoming increasing prominent.

Intrusion detection systems have been widely used to detect various types of malicious network which cannot

be detected by a conventional firewall. Therefore, various machine-learning techniques have been proposed

to improve the performance of intrusion detection system. However, the balance of different data classes is

critical and will affect detection performance. In order to reduce the impact of class imbalance of the intrusion

dataset, this paper proposes a scheme that applies the improved synthetic minority oversampling technique

(I-SMOTE) to balance the dataset, employs correlation analysis and random forest to reduce features and uses

the random forest algorithm to train the classifier for detection. The experimental results based on the NSL-

KDD dataset show that it achieves a better and more robust performance in terms of accuracy, detection rate,

false alarms and training speed.

1 INTRODUCTION

The aim of cloud computing is to provide on-demand

network security services, which are realized over the

Internet. Therefore, the security and privacy of net-

work data are crucial issues in cloud computing. With

the development of network technology and the con-

tinuous expansion of cloud computing applications,

the network attacks have become more diverse (Amin

Hatef M et al., 2018). Also, people are facing to secu-

rity threats from the Internet. Therefore, to protect the

security of computer and Internet, the various ma-

chine-learning techniques have been proposed to im-

prove the performance of intrusion detection systems.

However, with the complexity of intrusion or attack

techniques, some existing approaches exhibit limited

ability. Thus, the performance optimization of the in-

trusion detection system has received much attention

(Luo et al., 2014) (Tan et al., 2019).

Much related work focuses on the task of intrusion

detection based on various artificial intelligence and

machine learning algorithms. There exist many recent

studies, which combines or assemble several algo-

rithms in order to improve detection model’s perfor-

mance, such as detection rate, accuracy, area under

curve and false alarm rates.

However, there are two limitations to existing

works. First, although advanced and complicated de-

tection models have been built, very few have consid-

ered the effects of high dimensionality and sample

imbalance simultaneously, which are important prob-

lems in improving model’s performance. Second, the

time taken for training and testing model indicates that

their models are difficult to apply in actual situations.

Therefore, in this work, we propose a multi-clas-

sification scheme that applies the improved synthetic

minority oversampling technique (I-SMOTE) to bal-

ance the dataset, employs correlation analysis and

random forest to reduce features and uses the random

forest algorithm to train the classifier for detection.

Details are as follows:

1) The I-SMOTE is an improved oversampling al-

gorithm and is used to solve the imbalance prob-

lem of dataset, which can consider impacts of

majority samples and minority samples on over-

sampling results simultaneously.

2) Correlation analysis and random forest are ap-

plied to reduce the feature dimensions. The pur-

pose of correlation analysis is to calculate corre-

lation between each feature and classification la-

bel, and form a new feature sequence according

to order of correlation values from big to small.

Weijinxia, ., Longchun, ., Wanwei, ., Zhaojing, ., Duguanyao, . and Yangfan, .

An Effective Intrusion Detection Model based on Random Forest Algorithm with I-SMOTE.

DOI: 10.5220/0010393801750182

In Proceedings of the 23rd International Conference on Enterprise Information Systems (ICEIS 2021) - Volume 1, pages 175-182

ISBN: 978-989-758-509-8

 2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reser ved

175

Then, the ACC of random forest model is re-

garded as fitness of feature reduction, and the di-

mension corresponding to the largest ACC is the

optimal dimension.

3) Considering the random forest is better than

other commonly used detection models in train-

ing and testing time, therefore, the random forest

is used as the multi-attack detection classifier in

this work.

2 RELATED WORKS

Many machine learning approaches have been ap-

plied to construct intrusion detection systems. The

most used supervised algorithms include decision

trees, support vector machine and K-nearest neigh-

bors, random forest (Horng et al., 2011, Manjula et

al., 2016).

(Bamakan et al., 2016) introduced an SVM-based

intrusion detection model. In their paper, the time-

varying chaos particle swarm optimization was used

to optimizing parameters of SVM classifier. An

SVM-based on intrusion detection approach that

combined the genetic algorithm (GA) and kernel prin-

cipal component analysis (KPCA) was proposed by

(Kuang et al., 2014). The KPCA was used to reduce

feature dimensions and the GA was applied to optimize

parameters of SVM. (Horng et al., 2011) proposed an

SVM-based intrusion integrated with the BIRCH hier-

archical clustering algorithm which was used to pre-

process the original data. Thus, the data used for train-

ing intrusion detection model has a high quality.

Generally speaking, the intrusion detection sys-

tem should produce low false-positive rate and pro-

cess imbalanced datasets, a large number of features

and a large amount of training and testing data (Chen

Tongbao et al., 2020). Based on this scenario, random

forest algorithm presents great advantages in training

time and prediction time compared with SVM (Li et

al., 2020; Chauhan et al., 2013; Amira et al., 2017)

(Manjula et al., 2016; Golrang et al., 2020). In Chau-

han et al.’s paper (Chauhan et al., 2013), top-ten clas-

sification algorithms namely J48, Logistic, IBK,

JBayesNet, SGD, PART, Rip, Random Forest, Ran-

dom Tree and REPTree were selected for experi-

mental comparison. In experimental results, the Ran-

dom Forest classifier with 99.75% accuracy has got

the first position in ranking. Also, sensitivity of Ran-

dom Forest classifier was also highest compared to

others. In (Manjula et al., 2016), classification and

predictive models for intrusion detection were pro-

posed by using machine learning classification

namely Random Forest, Gaussian Naïve Bayes, Sup-

port Vector Machine and Logistic Regression. These

models were experimented with NSL-KDD dataset.

The experimental results illustrated that Random For-

est Classifier showed great advantages than other

methods in identifying whether the data is normal or

abnormal. To improve classification accuracy and re-

duce training time, in (Li et al., 2020) proposed an

effective deep learning method, namely AE-IDS

(Auto-Encoder Intrusion Detection System) based on

random forest algorithm. This method constructed the

training set with feature selection and feature group-

ing by using random forest. The experimental results

show that the proposed method is superior to tradi-

tional machine learning based intrusion detection

methods in terms of easy training, strong adaptability,

and high detection accuracy.

In the above works, the random forests were used

as a separate classification model. In recent years,

there also exist many hybrid models that use Random

Forest. Malik et al. (Malik et al., 2011; (Malik et al.,

2013; Malik et al., 2015) applied Particle Swarm Op-

timization and Symmetrical Uncertainty for feature

selection and Random Forest model as the classifier

to detect probe attacks. A hybrid model based Ran-

dom Forest and Neural Networks was introduce by

(Zhong et al., 2011), where the Neural Networks al-

gorithm is used to classify and the random forest is

applied to reduce feature dimension. (Hasan et al.,

2014) proposed a hybrid model using support vector

machine and random forest. The experimental results

showed that detection performance of the hybrid

model was better than the individual models.

Considering the imbalance of dataset, (Tesfahun

et al., 2013) used SMOTE to reduce the imbalance

and Random Forest algorithm as a classifier to estab-

lish intrusion detection model. (Tan et al., 2019) pro-

posed an intrusion detection algorithm based on

SMOTE and random forest. The simulations are con-

ducted on a benchmark intrusion dataset, and the ac-

curacy of model has reached 92.39%. However, only

K-neighbors operations in minority class were con-

sidered for oversampling operations in SMOTE, and

did not consider the majority class samples appeared

in K-neighbors operations, which easily led to the in-

variance of minority class density after oversampling

(Wang et al., 2018).

Based on the above analysis, we propose a scheme

that applies the improved synthetic minority over-

sampling technique (I-SMOTE) to balance the da-

taset, employs correlation analysis and random forest

to reduce features and uses the random forest algo-

rithm to train the classifier for detection. First, the I-

SMOTE is an improved oversampling algorithm and

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

176

is used to solve the imbalance problem of dataset,

which can consider impacts of majority samples and

minority samples on oversampling results simultane-

ously. Second, correlation analysis and random forest

are applied to reduce the feature dimensions. The pur-

pose of correlation analysis is to calculate correlation

between each feature and classification label, and

form a new feature sequence according to order of

correlation values from big to small. Then, the ACC

of random forest model is regarded as fitness of fea-

ture reduction, and the dimension corresponding to

the largest ACC is the optimal dimension. Finally,

considering the random forest is better than other

commonly used detection models in training and test-

ing time, therefore, the random forest is used as the

classifier in this work.

3 PRELIMINARY

3.1 Correlation Analysis

Correlation metrics are widely applied in machine

learning and statistical correlation analysis to evalu-

ate the correlation between features. The selection of

correlation metrics affects the efficiency of feature se-

lection greatly. The correlation degree between two

random variables is usually measured by entropy and

mutual information which are defined in information

theory.

Definition 1. For a discrete feature vector

{,,,}

Xxx x

′′ ′

∈ 

, its probability distribution can

be expressed as

{( ), ( ), , ( )}

xpx px

′′ ′



, the en-

tropy of feature

is as follows:

() ()log ()

HX px px

′′

=−



(1)

If all the values of

are the same, then the en-

tropy of

is 0. Thus, the feature

is useless for

data classification.

Definition 2. For two discrete features

{, , , }

xx x

′′ ′

∈ 

and

{, , , }

Yyy y∈  , their

joint probability density is

(, ),1 ,1px y i n j m

′

≤≤ ≤ ≤ , and conditional den-

sity is

()px y

′

, then entropy of

under the con-

dition

can be expressed as

()

log

(, )

() (,)

HXY px y

′



(2)

The mutual information is generated and derived

from entropy. For two features

and

in one da-

taset, the mutual information between them is as fol-

lows:

(;) () ( )

(, )

(, )log

()( )

IXY HX HXY

px y

px py

=−

′



(3)

The mutual information has the following charac-

teristics:

Symmetry:

(;) (; )IXY IYX=

Monotonic: if

BC⊆⊆

, then

(; ) (; )

AC I BC=

The mutual information reflects the amount of in-

formation shared between two random variables. The

greater value of the mutual information, the greater

correlation between the two variables. If the mutual

information between two variables is 0, the two vari-

ables are completely uncorrelated and statistically in-

dependent in probability.

3.2 SMOTE Algorithm (Wang et al.,

2018)

The SMOTE algorithm oversamples minority sam-

ples. For a certain minority sample

, find the

minority samples closest to it. If up-sampling ratio is

N , we randomly select N samples from

minor-

ity samples,

,,,

yy y

′′ ′

 . Then N new minority

samples

,,,

new new newN

XX X can be generated by

the following equation:

(0 , 1) ( ), 1, 2, ,

newj j

x rand y x j N

′

=+ ⋅ − = 

(4)

Where

(0 , 1)rand is a random number in (0 , 1) .

4 PROPOSED FRAMEWORK

FOR INTRUSION DETECTION

In this section, we list the main steps of our paper.

First, we introduce an improved SMOTE (I-SMOTE)

algorithm. Second, correlation analysis is applied to

arrange features of dataset after oversampling. The

purpose of correlation analysis is to calculate correla-

tion between each feature and classification label, and

form a new feature sequence according to order of

correlation values from big to small. Then, the ACC

of random forest model is regarded as fitness of fea-

ture reduction, and the dimension corresponding to

An Effective Intrusion Detection Model based on Random Forest Algorithm with I-SMOTE

177

the largest ACC is the optimal dimension. Finally, the

dataset with feature reduction is used to train the ran-

dom forest classifier. The specific procedures of the

proposed intrusion detection are illustrated in Fig.1.

As shown in Fig.1, the frame work of the detection

model mainly consists of three parts: Oversampling

based on I-SMOTE, feature reduction and intrusion

detection.

Figure 1: The procedures of the proposed intrusion detec-

tion.

4.1 Dataset oversampling based on

I-SMOTE

For each sample

in minority samples S , we

search for

samples closest to

, where

is ob-

tained by cross-validation. In

samples, if the num-

ber of samples of minority class is larger than major-

ity class,

is a safe sample; if the number of sam-

ples of minority class is less than majority class and

there exist minority class samples,

is a dangerous

samples; if all

samples locate in majority class,

is a noise sample.

S∈

(0.5, 1) ( )

new

x rand x x

′

=+ ⋅ −

′

(0, 0.5) ( )

new

x rand x x

′

=+ ⋅ −

(0, 1) ( )

new

Xxrand xx

′

=+ ⋅ −

Figure 2: The oversampling steps based on I-SMOTE.

 If

is a noise sample, the oversampling opera-

tion may introduce noise into dataset. In order to

reduce risk of noise, a sample

′

is randomly

chosen from minority class samples. Then we

generate new samples that close to minority

class as follows:

(0.5, 1) ( )

new

x rand x x

′

=+ ⋅ − (5)

 If

is not a noise sample, we search for one

sample

′

from

samples closest to

, if

′

belongs to the majority class, we generate new

samples that close to

as follows:

(0, 0.5) ( )

new

xrand x x

′

=+ ⋅ − (6)

′

belongs to the minority class, we generate

new samples as follows:

(0, 1) ( )

new

x rand x x

′

=+ ⋅ − (7)

 Output the new minority samples

′

4.2 Feature Sorting based on

Correlation Analysis

After oversampling, the new dataset can be expressed

as minority samples

′′

= 

. The correlation

between each feature and classification label is calcu-

lated by correlation analysis methods. Detailed steps

are as follows:

 For a certain feature vector

{, , , }

Xxx x

′′ ′

∈ 

and its corresponding label

{0, 1}Y ∈

the number of samples in

′

, their correlation

metrics are generated by equations (1)-(3). The

correlation metrics can be further standardized

and expressed as

(;)

(,)

() ()

IXY

SU X Y

HX HY

(8)

Thus, the values of correlation metrics between

features and labels locate in

[0, 1]

. The value 1

indicates that the feature and label are com-

pletely related, and 0 means that they are inde-

pendent of each other.

 After calculating the correlation, feature ranking

is performed. Higher the correlation, more infor-

mation content it has. It will determine which

features in given feature vectors are useful for

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

178

classification label. The features with the great-

est correlation are at the forefront, and the fea-

tures with the least correlation are at the last.

Thus, the greatest correlation or strongly useful

features have high ranking.

4.3 Feature Reduction

In this section, we use the random forests to reduce

the sorted features. The goal of feature reduction is to

choose as few as features and obtain higher intrusion

detection accuracy. Here, we regard the accuracy as a

feature fitness. The random forest algorithm is abbre-

viated as RF. The detailed feature reduction process

are as follows. Suppose

is the dimension of the

sample. The details processes are shown in Algorithm

Algorithm 1: Feature reduction.

Input: 0S

Output:

()St

and its dimension

t−

For

feature reductions do

Remove the last feature in

0S to form 1S

Compute

... ...

Remove the last feature to form

(1)Sp−

Compute

CC ,

(1)Sp

ACC

−

Out

CC which is largest

()St

is chosen with optimal feature

End

4.4 Intrusion Detection Model based on

Random Forest

After performing feature reduction on

′

, the new

dataset

()St

is used to train RF classifier to establish

the intrusion detection model. Framework of the pro-

posed intrusion detection system consists of three

steps: Oversampling, feature reduction and test

model.

Step1: Oversampling.

The purpose of oversampling is to oversample minor-

ity samples, and reduce the imbalance between mi-

nority samples and majority samples.

Step2: Feature Reduction.

The feature reduction algorithm is applied to the over-

sampled dataset to generate the new transformed da-

taset for testing.

Step3: Test Model.

Apply the new transformed dataset from step1 and

step 2 to train RF classifier and obtain the intrusion

detection model. New testing samples are brought

into the intrusion detection model to test the perfor-

mance of the model.

5 EXPERIMENTAL SETTING

5.1 Dataset Description

The dataset used in the paper is based on the NSL-

KDD dataset, which is a modified version of the KDD

Cup 99 (Tavalaee et al., 2009) dataset. The KDD Cup

99 dataset contains 494,020 samples. Each sample

represents a TCP connection records represented by a

41-dimensional feature vector, in which 9 nominal

features and 32 continuous features. Each category of

the dataset is labeled as one out of five classes, which

are normal traffic and four different classes of attacks,

i.e., Probing, DOS, R2L, U2R. Also, the NSL-KDD

is an imbalance dataset in which the number of DOS

is about 10000 times than U2R.

Although the KDD Cup 99 dataset is widely ap-

plied in intrusion detection field, the dataset has some

flaws, for example, there exist many duplicated sam-

ples, leading to classifiers trained with it to be biased

toward the samples with larger number. Thus, (Taval-

aee et al., 2009) proposed a more complete dataset,

namely, the NSL-KDD dataset. The optimization

makes the NSL-KDD dataset more reasonable in data

distribution. (Bhattacharya et al., 2015), D (De La

Hoz et al., 2015) and (Kim et al., 2014) illustrated that

the NSL-KDD dataset can be regarded as an ideal da-

taset in intrusion detection field.

5.2 Experimental Setup

The empirical experiments in our work were all im-

plemented on a computer with an Intel Core i7-7700

An Effective Intrusion Detection Model based on Random Forest Algorithm with I-SMOTE

179

CPU @ 3.60GHz with 16.0 GB RAM running Win-

dows 10. The feature reduction, transformation and

RF classifier test were run using Python.

The dataset is divided into two disjoint parts,

which are respectively used as the training and the

testing. The 10-fold cross validation method was ap-

plied to train and test the proposed classifier by using

training dataset. In this method, the dataset is divided

into 10 un-duplicated subsets, and any nine of ten are

used for training and the remaining one for testing.

Thus, after running 10 times, each subset of the initial

dataset has an equal opportunity to be selected as a

training or testing. Thus, the RF classifier will be

trained and tested 10 times. Finally, the performance

of intrusion detection is evaluated by the testing da-

taset.

5.3 Experimental Results

This section shows the benefits of the proposed intru-

sion detection model for multi-class classification

problem. In this work, we consider the rates of detec-

tion, false alarms and accuracy, where the rates of de-

tection, false alarms and accuracy are widely applied

in related work to indicate the performance of intru-

sion detection model. To verify the effectiveness of

this model, we show the experiment from four as-

pects. First, we explain why we choose the random

forest as the multi-class classifier. Second, the over-

sampling operation improves performance of the

classifier is confirmed. Third, we show advantages of

our scheme in terms of training and testing time com-

pared with other classic schemes. Finally, to further

investigate the advantages of the proposed model, a

comparison of overall efficiency is illustrated be-

tween our proposed model and other related methods.

The experimental results shown in Table 1 and

Table 2 illustrate advantages of random forest com-

pared with SVM, GBDT, LRL1 and LRL2, where

LRL1 and LRL2 represent the logistic regression with

L1 regularization and L2 regularization respectively.

Table 1: The detection rate obtained by different classifier

in 5 classification detection.

Classifier

Types

Probe DOS U2R R2L Normal

SVM 0.952471 0.999629 0.230769 0.183919 0.993079

RF 0.996139 0.999695 0.519231 0.958793 0.999584

GBDT 0.940803 0.992862 0.269231 0.688442 0.997015

LRL1 0.932481 0.998492 0.346153 0.358794 0.984778

LRL2 0.929993 0.998524 0.230769 0.098492 0.983011

Table 2: The overall performance of different classifier in 5

classification detection.

Classifier

Types

ACC

Detection

Rate

False Alarm

Rate

SVM 0.985003 0.975728 0.006921

RF 0.993785 0.997867 0.000416

GBDT 0.987561 0.976701 0.002985

LRL1 0.979731 0.973937 0.015222

LRL2 0.976464 0.968941 0.016989

As shown in Table 1, we can see that the random

forest has the highest detection rate, especially in de-

tecting U2R and R2L attacks. For U2R, the detection

rates of SVM, GBDT, LRL1 and LRL2 are 0.23, 0.27,

0.35 and 0.23 respectively. Then, the detection rate of

RF is 0.52 which significantly outperforms other clas-

sifiers. Also, for R2L, the detection rates of SVM,

GBDT, LRL1 and LRL2 are 0.18, 0.69, 0.36 and

0.098 respectively. Then, the detection rate of RF is

0.96.

The overall performance of different classifier is

illustrated in Table 2. RF can produce the highest

ACC and detection rate, and lowest false alarm rate.

The ACC of SVM, GBDT, LRL1 and LRL2 are

0.985, 0.987, 0.979 and 0.976, and RF is 0.994 which

is larger than other values. The overall detection val-

ues of SVM, GBDT, LRL1 and LRL2 are 0.975,

0.976, 0.974 and 0.969, and RF is 0.997 which is also

larger than other values. The most obvious advantage

of random forest is reflected in false alarm rate. The

false alarm rate of SVM, GBDT, LRL1 and LRL2 are

0.0069, 0.0029, 0.0152 and 0.0169, and RF is 0.0004

which is nearly 10 times smaller than other values.

The experimental results shown in Table 3 and

Table 4 illustrate advantages of the preRF compared

with preSVM, preGBDT, preLRL1 and preLRL2,

where preRF indicates the combination of over-

sampling, feature reduction and random forest,

preSVM indicates the combination of oversampling,

feature reduction and SVM, preGBDT indicates the

combination of oversampling, feature reduction and

GBDT, preLRL1 indicates the combination of over-

sampling, feature reduction and LRL1, and preLRL2

indicates the combination of oversampling, feature

reduction and LRL2.

As shown in Table 3, we can see that our method

has the highest detection rate in five classification, es-

pecially in detecting U2R attacks. For U2R, the de-

tection rates of preSVM, preGBDT, preLRL1 and

preLRL2 are 0.46, 0.60, 0.59 and 0.56 respectively.

Then, the detection rate of preRF is 0.96 which

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

180

Table 3: The detection rate obtained by different combina-

tion method.

Classifier

Types

Probe DOS U2R R2L

ormal

preSVM

0.955387 0.999673 0.458042 0.934307 0.996864

preRF

0.996654 0.999738 0.963286 0.997168 0.999683

preG-

BDT

0.949011 0.994316 0.596153 0.929009 0.997065

preLRL1

0.933537 0.998543 0.590909 0.931841 0.985142

preLRL2

0.930934 0.998671 0.562937 0.936683 0.989411

Table 4: The overall performance of different combination

method.

Methods

ACC

Detection

Rate

False Alarm

Rate

preSVM 0.986988 0.977365 0.003136

preRF(Our

method)

0.999088 0.998508 0.000317

preGBDT 0.988894 0.973033 0.002935

preLRL1 0.979315 0.973637 0.014858

preLRL2 0.981514 0.973818 0.010589

significantly outperforms other classifiers. The over-

all performance of different combination classifier is

illustrated in Table 4. The ACC of preSVM, preG-

BDT, preLRL1 and preLRL2 are 0.987, 0.989, 0.979

and 0.981, and preRF is 0.999 which is larger than

other values. The overall detection rate of preSVM,

preGBDT, preLRL1 and preLRL2 are 0.977, 0.973,

0.974 and 0.974, and preRF is 0.998 which is also

larger than other values. The most obvious advantage

of random forest is reflected in false alarm rate. The

false alarm rate of preSVM, preGBDT, preLRL1 and

preLRL2 are 0.0031, 0.0029, 0.0148 and 0.0106, and

preRF is 0.0003 which is nearly 10 times smaller than

preSVM and preGBDT, 100 times smaller than

preLRL1 and preLRL2. In summary, the preRF is

good at detection the Probe, DOS, U2R, R2L and

Normal classes.

The advantages of oversampling and feature reduc-

tion can be illustrated in Table 2 and Table 4. In Table

4, preRF provides an ACC, detection rate, and false

alarm rate of 0.999088, 0.998508, and 0.000317 re-

spectively, results which are better or similar to the

RF classifier shown in Table 2, which for 0.993785,

0.997867 and 0.000416 respectively. The other meth-

ods in Table 4 and Table 2 also achieve the same ef-

fect.

The advantages of our scheme in terms of training

and testing time compared with other classic schemes

are shown in Table 5. The training time of preSVM is

about 306s, preRF is about 10s, preGBDT is about 38s,

preLRL1 is about 59s, preLRL2 is about 20s. The

method proposed in this paper has minimal training

time. At the same time, the testing time of preSVM is

about 1ms, preRF is about 11ms, preGBDT is about

1ms, preLRL1 is about 1ms, preLRL2 is about 1ms.

Although preRF has the longest testing time, it has the

shortest sum time from the whole processes.

Table 5: The training time and testing time of different com-

bination method.

Classifier Types

Training time Testing time Sum time

preSVM 306s 1ms 306.02s

preRF 10s 11ms 10.18s

preGBDT 38s 1ms 3 8.02s

preLRL1 59s 1ms 59.02s

preLRL2 20s 1ms 20.02s

Regarding the previous results, we further exam-

ine the proposed method over the five classes. Perfor-

mance of different methods is illustrated in Fig.3. The

CANN (6), CANN (19), k-NN (6) and k-NN (19) are

described in detail in (Wei-Chao et al., 2015), where

6 and 19 represent the dimensions of CANN and k-

NN with feature reduction, and preRF (30) indicates

that the optimal dimension is 30 in our method after

feature reduction .To detect the normal, probe and

Dos classes, the performance of preRF is not obvious

compared with CANN and k-NN. However, preRF

can correctly detect U2R and R2L with the highest

detection rate. These results indicate that the classifi-

cation with oversampling and feature reduction is

suitable for detection of U2R and R2L in the five-

classification problem.

Figure 3: The performance of preRF, CANN and k-NN.

0.00%

50.00%

100.00%

150.00%

Normal Probe Dos U2R R2L

CANN(6) CANN(19) k-NN(6)

k-NN(19) preRF(30)

An Effective Intrusion Detection Model based on Random Forest Algorithm with I-SMOTE

181

6 CONCLUSIONS

This paper proposes a multi-classification method

that applies the improved synthetic minority over-

sampling technique (I-SMOTE) to balance the da-

taset, employs correlation analysis and random forest

to reduce features and uses the random forest algo-

rithm to train the classifier for multi-attack type de-

tection. The experimental results based on the NSL-

KDD dataset show that it achieves a better and more

robust performance in terms of accuracy, detection

rate, false alarms and training speed.

ACKNOWLEDGEMENTS

The authors would like to thank the editorial board

and reviewers. This work was supported by the Re-

search on Key Technologies of High Security and

Trustworthy Mobile Terminal Operating System Se-

curity Protection (2017YFB0801902).

REFERENCES

Amin Hatef M., Shaker V., Reza Jabbarpour M. et al.

HIDCC: A hybrid intrusion detection approach in cloud

computing. Concurrency Computat: Pract Exper., 2018;

30: e4171.

Luo B., Xia J., A novel intrusion detection system based on

feature generation with visualization strategy. Expert

Syst. Appl. 41 (9) (2014) 4139-4147.

Tan L. L., Li C. and Xia J. M., Application of self-organizing

feature map neural network based on K-means clustering

in network intrusion detection. CMC-Computers Materi-

als & Continua, 61 (1) (2019) 275-288.

Bamakan S. M. H., Wang H., Tian Y., Shi Y., An effective

intrusion detection framework based on mclp/svm opti-

mized by time-varying chaos particle swarm optimiza-

tion, Neurocomputing 199 (2016) 90-102.

Kuang F., Xu W., Zhang S., A novel hybrid kpca and svm

with ga model for intrusion detection. Appl. Soft. Com-

put. 18 (4) (2014) 178-184.

Horng S. J., Su M. Y., Chen Y. H., Kao T. W. ,, A novel

intrusion detection system based on hierarchical cluster-

ing and support vector machines, Expert Syst. Appl. 38

(1) (2011) 306-313.

Li X. K., Chen W., and Zhang Q. R., 2020. Building Auto-

Encoder Intrusion Detection System based on random

forest feature selection. 95.10.1016/j.cose.2020.10 1851.

Chauhan,V. Kumar, S. Pundir, and E. S. Pilli. 2013. A com-

parative study of classification techniques for intrusion

detection. In 2013 International Symposium on Compu-

tational and Business Intelligence.40–43.

Amira Sayed A. Aziz, Sanaa EL-Ola Hanafi, and Aboul Ella

Hassanien. 2017. Comparison of classification techi

niques applied for network intrusion detection and clas-

sification. Journal of Applied Logic 24 (2017), 109–118.

Manjula C. Belavagi and Balachandra Muniyal. 2016. Per-

formance evaluation of supervised machine learning al-

gorithms for intrusion detection. Procedia Computer Sci-

ence 89(Jan.2016),117–123

Golrang A., Golrang A. M., and Yayilgan S. Y.. 2020. A

Novel Hybrid IDS Based on Modified NSGAII-ANN

and Random Forest. 9 (4) 2020.

Tesfahun A. and Bhaskari D. L., 2013. Intrusion detection

using random forests classifier with SMOTE and feature

reduction In 2013 International Conference on Cloud

Ubiquitous Computing Emerging Technologies.127–

132.

Malik A. J., Shahzad W., and Khan F. A., 2011. Binary PSO

and random forests algorithm for PROBE attacks detec-

tion in a network. In 2011 IEEE Congress of Evolution-

ary Computation (CEC’11).662–668.

Malik A. J. and Khan F. A., 2013. A hybrid technique using

multi-objective particle swarm optimization and random

forests for PROBE attacks detection in a network. In

2013 IEEE International Conference on Systems, Man,

and Cybernetics. 2473–2478.

Arif Jamal Malik, Waseem Shahzad, and Farrukh Aslam

Khan. 2015. Network intrusion detection using hybrid bi-

nary PSO and random forests algorithm.8, 16 (Nov.

2015), 2646–2660.

Zhong SH, Huang HJ, Chen AB. 2011. An effective intrusion

detection model based on random forest and neural net-

works. Advanced Materials Research, vol. 267 (308), pp.

308–313.

Md. Al Mehedi Hasan, Mohammed Nasser, Biprodip and

Shamim Ahmad. 2014. Support Vector Machine and

Random Forest Modeling for IDS, JILSA, pp. 45–52.

Tan X. P., Su S. J., Huang Z. P. , Guo X. J., Wireless Sensor

Networks Intrusion Detection Based on SMOTE and the

Random Forest Algorithm, Sensors 2019. 19, 203.

Wang L., Chen H. M., Unbalanced dataset classification

method based on NKSMOTE algorithm. Computer Sci-

ence, 2018, 9 (45), 260-265.

Tavalaee M., Bagheri E., Lu W., Ghorbani A. A., 2009. A

detailed analysis of the kdd cup 99 data set, in: Proceed-

ings of the Second IEEE Symposium on Computational

Intelligence for Security and Defense Applications,

2009.

Bhattacharya S., Selvakumar S., 2015. LAWRA: a layered

wrapper feature selection approach for network attack

detection, Security Commun Netw 8 (18), 3459–3468.

De La Hoz E., Ortiz A., Ortega J., Prieto B., 2015. PCA ﬁlter-

ing and probabilistic som for network intrusion detection,

Neurocomputing 164, 71–81.

Kim G., Lee S., Kim S., 2014. A novel hybrid intrusion de-

tection method integrating anomaly detection with mis-

use detection, Expert Syst. Appl. 41 (4), 1690–1700.

Wei-Chao L., Shih-Wen Ke, Chih-Fong T., CANN: An in-

trusion detection system based on combining cluster cen-

ters and nearest neighbors. Knowledge-Based Systems.

78 (2015) 13-21.

Chen Tongbao, Wen Liangming, Li Jianhui. A Data Predic-

tion Method Based on Feature Selection and Transfer

Learning[J].Frontiers of Data & Computing,2020,2(2):

145-154.

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

182