Evaluating Pattern Recognition Techniques in Intrusion

Detection Systems

⋆

M. Esposito

1,2

, C. Mazzariello

, F. Oliviero

, S.P. Romano

and C. Sansone

University of Napoli Federico II

Dipartimento di Informatica e Sistemistica

Via Claudio 21 — 80125 Napoli, Italy

CRIAI, Consorzio Campano di Ricerca per l’Informatica e l’Automazione Industriale

Piazzale E. Fermi 1 — 80055 Portici (Napoli) Italy

Abstract. Pattern recognition is the discipline studying the design and operation

of systems capable to recognize patterns with speciﬁc properties in data sources.

Intrusion detection, on the other hand, is in charge of identifying anomalous ac-

tivities by analyzing a data source, be it the logs of an operating system or in the

network trafﬁc. It is easy to ﬁnd similarities between such research ﬁelds, and it is

straightforward to think of a way to combine them. As to the descriptions above,

we can imagine an Intrusion Detection System (IDS) using techniques proper

of the pattern recognition ﬁeld in order to discover an attack pattern within the

network trafﬁc. What we propose in this work is such a system, which exploits

the results of research in the ﬁeld of data mining, in order to discover potential

attacks. The paper also presents some experimental results dealing with perfor-

mance of our system in a real-world operational scenario.

1 Introduction

Security of computer networks has been the subject of an intensive research activity

in the last years. New solutions and techniques have been proposed to tackle the secu-

rity issue. Firewalls and Intrusion Detection Systems (IDS) are the most well known

tools which can be employed to protect the network from malicious activities. Indeed,

ﬁrewalls are used to prevent intrusions from happening, whereas IDS detect an intru-

sion while it is happening. In particular, in order to accomplish its task, an intrusion

detection system needs to have a pre-deﬁned set of models, or “patterns”, describing

the behaviour of both “normal” and “malicious” network users. By monitoring the real

trafﬁc on the network, the system computes a current user proﬁle which is compared

with a set of pre-deﬁned models in order to detect potential intrusions.

If we look at both normal and anomalous behaviors as patterns, we can use common

pattern recognition techniques to ﬁnd attack instances within the network trafﬁc. IDS

⋆

Research outlined in this paper is partially funded by the Italian “Ministero dell’Istruzione,

dell’Universit

a e della Ricerca (MIUR)” in the framework of the FIRB Project “Middleware

for advanced services over large-scale, wired-wireless distributed systems (WEB-MINDS)”

Esposito M., Mazzariello C., Oliviero F., P. Romano S. and Sansone C. (2005).

Evaluating Pattern Recognition Techniques in Intrusion Detection Systems.

In Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems, pages 144-153

DOI: 10.5220/0002575201440153

 SciTePress

commonly base their detection ability on a set of attack models, making it both easy

and fast to discover well known attacks. Though, they are often unable to detect un-

known anomalies: even slight modiﬁcations of an attack pattern may result in a missed

detection. The easiest techniques to use are based on attack signatures, which are orga-

nized in databases and are usually hand-coded by a network administrator. An attack

signature is the ﬁngerprint of a speciﬁc attack, is statically deﬁned and strictly related

to the attack type it is meant to detect. Pattern recognition techniques, instead, proved

to have a higher generalization capability.

Based on the above considerations, our work aims to deﬁne a framework for ex-

tracting high-level knowledge from a large data set by means of pattern recognition

techniques in order to discover a set of patterns able to distinguish between normal ac-

tivities on one side and intrusions on the other. The framework is ﬁrst presented and

then evaluated by means of experiments conducted on real data.

2 Related Work

This work has many liaisons with both intrusion detection and data mining.

As to the ﬁrst research ﬁeld, intrusion detection is the art of detecting inappropriate,

incorrect or anomalous activity within a system, be it a single host or a whole network.

An Intrusion Detection System (IDS) analyzes a data source and, after preprocessing

the input, lets a detection engine decide, based on a set of classiﬁcation criteria, whether

the analyzed input instance is normal or anomalous, given a suitable behavior model.

Intrusion Detection Systems can be grouped into three main categories: Network-based

Intrusion Detection Systems (N-IDS) [1], Host-based Intrusion Detection Systems (H-

IDS) [2] [3] and Stack-based Intrusion Detection Systems (S-IDS) [4]. This classiﬁca-

tion depends on the information sources analyzed to detect an intrusive activity.

Intrusion Detection Systems can be roughly classiﬁed as belonging to two main

groups as well, depending on the detection technique employed: anomaly detection and

misuse detection [5]. Both such techniques rely on the existence of a reliable character-

ization of what is normal and what is not, in a particular networking scenario.

The main problem related to both anomaly and misuse detection techniques resides

in the encoded models, which deﬁne normal or malicious behaviors. Although some re-

cent open source IDS, such as SNORT

[6] or Bro

[7], provide mechanisms to write new

rules that extend the detection ability of the system, such rules are usually hand-coded

by a security administrator. This represents a weakness in the deﬁnition of new normal

or malicious behaviors. Recently, many research groups have focused on the deﬁnition

of systems able to automatically build a set of models. Data mining techniques are fre-

quently applied to audit data in order to compute speciﬁc behavioral models (MADAM

ID [8], ADAM [9]).

Coming to the second related research ﬁeld, we recall that a data mining algorithm

is referred to as the process of extracting speciﬁc models from a great amount of stored

data [10]. Machine learning or pattern recognition processes are usually exploited in

http://www.snort.org

http://www.bro-ids.org

145

order to realize this extraction. These processes may be considered as off-line processes.

In fact, all the techniques used to build intrusion detection models need a proper set of

audit data. The information must be labelled as either “normal” or “attack” in order to

deﬁne the suitable behavioral models that represent these two different categories. Such

audit data are quite complicated to obtain.

We ﬁnally mention that this work also entails an analysis of the network trafﬁc

aimed at deﬁning a comprehensive set of so-called connection features. Such a process

requires that an ad-hoc classiﬁer is deﬁned and implemented. The greater the capability

of the set of features to discriminate among different categories, the better the classiﬁer.

Many researchers have been working on the topic in the last few years.

In particular, we have adopted a model descending from the one proposed by Stolfo

et al., who propose a set of connection features which can be classiﬁed in tree main

groups: intrinsic features, content features, and trafﬁc features. Intrinsic features spec-

ify general information on the current session, like the duration in seconds of the con-

nection, the protocol type, the port number (i.e. the service), the number of bytes from

the source to the destination, etc..

The content features are related to the semantic content of connection payload:

for example, they specify the number of failed login attempts, or the number of shell

prompts.

Finally, the trafﬁc features can be divided in two groups: the same host and the same

service features. The same host features examine all the connections in the last two

seconds to the same destination host of the current connection, in particular the number

of such connections, or the rate of connections that have a “SYN” error. Instead, the

same service features examine all the connections in the last two seconds to the same

destination service of the current one.

3 Rationale and Motivation

One of the main issues related to pattern recognition in intrusion detection is the use of

a proper data set, containing user proﬁles on which the data mining processes work in

order to extract the patterns. In principle, an efﬁcient set of patterns for the detection

has to contain all of the possible user behaviors. Moreover, according to all pattern

recognition processes, the data set has to properly label the behavior proﬁle items with

either “normal” or “attack”. Although this might look like an easy task, labelling the

data imposes a pre-classiﬁcation process: you have to know exactly which proﬁle is

“normal” and which is not.

In order to solve the issue related to data set building, two main approaches are

possible: the former relies on simulating a real-world network scenario, the latter builds

the set using actual trafﬁc.

The ﬁrst approach is usually adopted when applying pattern recognition techniques

to intrusion detection. The most well-known dataset is the so-called KDD Cup 1999

Data, which was created for the Third International Knowledge Discovery and Data

Mining Tools Competition, held within KDD-99, The Fifth International Conference

on Knowledge Discovery and Data Mining

that was created by the Lincoln Laboratory

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

146

at MIT in order to conduct a comparative evaluation of intrusion detection systems,

developed under DARPA (Defense Advanced Research Projects Agency) and AFRL

(Air Force Research Laboratory) sponsorship

This set was created in order to evaluate the ability of data mining algorithms to

build predictive models able to distinguish between a normal behavior and a malicious

one. The KDD Cup 1999 Data contains a set of “connection” records coming out from

the elaboration of raw TCPdump data. Each connection is labelled as either “normal” or

“attack”. The connection records are built from a set of higher-level connection features,

deﬁned by Stolfo et al. [11], that are able to tell apart normal activities from illegal

network activities.

Although widely employed, some criticisms have been raised against the 1999 KDD

Cup Data [12]. Indeed, numerous research works analyze the difﬁculties arising when

trying to reproduce actual network trafﬁc patterns by means of simulation [13]. Actu-

ally, the major issue resides in effectively reproducing the behavior of network trafﬁc

sources.

Based on the considerations above, we have concluded that the KDD Cup 1999

Data can just be used to evaluate the effectiveness of the pattern recognition algorithms

under study, rather than in the real application of intrusion detection.

Collecting real trafﬁc can be considered as a viable alternative approach for the

construction of the trafﬁc data set [14]. Although it can prove effective in real-time

intrusion detection, it still presents some concerns. In particular, collecting the data set

by means of real trafﬁc needs a data pre-classiﬁcation process. In fact, as stated before

the pattern recognition process needs a data set in which packets are labelled as either

“normal” or “attack”. Indeed, no information is available in the real trafﬁc to distinguish

the normal activities from the malicious ones in order to label the data set. So we have a

paradox: we need pre-classiﬁed trafﬁc in order to extract the models able to classify the

trafﬁc. Last but not least, the issue of privacy of the information contained in the real

network data has to be considered: payload anonymizers and IP address spooﬁng tools

are needed in order to preserve sensitive information.

This work aims to develop a real time intrusion detection system based on pattern

recognition techniques. We have adopted the real trafﬁc collection approach to extract

the network behavior models. We will present in the paper a method to: (i) collect real

data from a network; (ii) elaborate such information in order to build and appropriately

label the associated data set. We also provide some ﬁgures about the effectiveness of

the method proposed.

4 A feature-based framework for detecting attacks

The ﬁrst issue that a real time intrusion detection system has to face is the computation

of an up-to-date user behavior proﬁle every time a new event occurs on the network.

In particular, the set of features characterizing the current trafﬁc proﬁle has to be de-

termined every time a new packet is captured from the network. Indeed, issues related

to real time features computation have been usually neglected by previous research on

http://www.ll.mit.edu/IST/ideval

147

pattern recognition applied to intrusion detection. For example, few works have dealt

with the packet loss issue: if the proﬁle extraction time is longer than the interarrival

time (due to either a low computation speed or a high trafﬁc rate), some packets may be

lost, at the detriment of the detection ability.

In our previous work [15], we have evaluated the feasibility of a real time intrusion

detection system. The system we developed is available at the SourceForge

site. In

this section we present a different contribution, dealing with an approach to the off-line

extraction of models which can be proﬁtably exploited in the real time system.

As stated before, we need a proper data set on which the pattern recognition algo-

rithm works in order to extract the “detection patterns” needed for the real time clas-

siﬁcation process. With our approach, we collect real trafﬁc traces. We deem that such

approach represents a desirable solution in case the computed patterns have to be ap-

plied in an actual operational scenario (see section 3). Our data set has been built by

collecting real trafﬁc on the local network at Genova National Research Council (CNR).

The raw trafﬁc data set contains about one million packets, equivalent to 1GByte of

data. The network trafﬁc has been captured by means of the TCPdump tool and logged

to a ﬁle. In order to solve the pre-classiﬁcation problem (which, as already stated, re-

quires labelling the items in the data set), we have used a previous work of Genova’s

research team. By using two different intrusion detection systems, researchers in Gen-

ova have analyzed the generated alert ﬁles and manually identiﬁed, in the logged trafﬁc,

a set of known intrusions. We have leveraged the results of this research in order to ex-

tract the connection features record and properly label it with either a normal or an

attack tag, as it will be clariﬁed in section 5.

After building the data set, we have focused on the management of the data in order

to realize the pattern recognition process. Every record in the data set is composed of 26

connection features, namely Stolfo’s “intrinsic” and “trafﬁc” features. Indeed, just few

features can be used to tell apart normal from anomalous trafﬁc in the analyzed network

scenario. In fact, some attacks can be classiﬁed only with a small set of connection

features. This can be considered as an advantage: we can reduce the dimensional space

of the data set, letting the pattern recognition process become simpler. Common to all

the data mining processes, the issue of feature subset selection is known as feature

selection problem.

In our context, we have adopted ToolDiag

, a pattern recognition toolbox, in order

to realize the feature selection. As a selection strategy we adopted Sequential Forward

Selection with the Estimated Minimum Error Probability Criterion.

The last step in our work has concerned the extraction of network behavior patterns

from the data set.

By using Stolfo’s connection features — which cover a wide range of attack types

— it is possible characterize the attacks by means of a set of rules. Supposing that the

trafﬁc data item can be represented in a vectorial space, a data mining process partitions

such a space in a normal region and an attack region, based on the rule set; if the vector

of features related to the current packet belongs to this space, an intrusive action is

http://sourceforge.net/projects/s-predator

http://www.inf.ufes.br/ thomas/home/tooldiag.html

148

detected. In this way the rule is not referred to a single attack; it is rather used in a more

complex classiﬁcation process.

In order to extract the set of rules from the data set, we have adopted the SLIP-

PER

[16] tool. SLIPPER is a rule-learning system exploiting the Boosting technique [17].

5 Experimental results

In this section we present some experimental results concerning the attack detection ca-

pabilities attained by using the proposed approach. We will mainly focus on the missed

detection rate and, more important, on the false alarm rate, which is a critical require-

ment for an effective intrusion detection system [18]. Though in other pattern recog-

nition applications a false positive rate below 5% may be a very satisfactory value, in

intrusion detection such a rate may not be acceptable. For example, if we imagine to

work on a network with a packet rate of 1000000 packets per hour, a false alarm rate of

0.1% would lead to 1000 annoying alert messages sent to the administrator every hour:

though characterized by a very low false alarm rate, the number of unjustiﬁed alerts

would be too high and would lead the administrator to ignore or eventually switch the

intrusion detection system off.

Table 1. Detection accuracy after feature selection

Train Error Rate Test Error Rate Hypothesis Size Learning Time

Test 1 0.25% 0.35% 9 Rules, 29 Conditions 196.14s

Test 2 0.18% 0.31% 12 Rules, 46 Conditions 211.78s

Test 3 0.22% 0.28% 9 Rules, 34 Conditions 202.08s

Test 4 0.23% 0.26% 9 Rules, 31 Conditions 182.32s

Test 5 0.21% 0.32% 9 Rules, 41 Conditions 264.87s

Test 6 0.20% 0.35% 9 Rules, 31 Conditions 222.76s

Test 7 0.20% 0.31% 9 Rules, 31 Conditions 202.62s

Test 8 0.15% 0.29% 13 Rules, 45 Conditions 243.16s

Test 9 0.20% 0.30% 10 Rules, 29 Conditions 233.16s

Test 10 0.24% 0.31% 24 Rules, 85 Conditions 244.37s

Test 11 0.17% 1.38% 10 Rules, 38 Conditions 198.12s

Test 12 0.25% 0.32% 9 Rules, 31 Conditions 225.62s

Test 13 0.19% 0.29% 13 Rules, 40 Conditions 195.63

Test 14 0.17% 0.32% 7 Rules, 24 Conditions 188.12s

Test 15 0.21% 0.30% 11 Rules, 43 Conditions 223.65s

Test 16 0.23% 0.28% 4 Rules, 9 Conditions 186.62s

Test 17 0.21% 0.28% 7 Rules, 26 Conditions 246.65s

Test 18 0.17% 0.18% 14 Rules, 59 Conditions 244.29s

We ran different tests on the huge amount of data collected at the CNR laboratories

in Genova (Italy). As stated before, we have a 1000000 packets log.

http://www-2.cs.cmu.edu/ wcohen/slipper/

149

Table 2. Detection accuracy after feature selection – Average values

Train Error Rate Test Error Rate Hypothesis Size Learning Time

0.20% 0.36% 10 Rules, 37 Conditions 217.33s

Table 3. Detection accuracy after ﬁltering and feature selection