Going a Step Beyond the Black and White Lists for URL Accesses in the

Enterprise by Means of Categorical Classiﬁers

A. M. Mora, P. De las Cuevas and J. J. Merelo

Depto. Arquitectura y Tecnologa de Computadores, ETSIIT-CITIC, University of Granada, Granada, Spain

Keywords:

Data Mining, Corporate Security Policies, URL request, Machine Learning, Classiﬁcation.

Abstract:

Corporate systems can be secured using an enormous quantity of methods, and the implementation of Black

or White lists is among them. With these lists it is possible to restrict (or to allow) the users the execution

of applications or the access to certain URLs, among others. This paper is focused on the latter option. It

describes the whole processing of a set of data composed by URL sessions performed by the employees of a

company; from the preprocessing stage, including labelling and data balancing processes, to the application of

several classiﬁcation algorithms. The aim is to deﬁne a method for automatically make a decision of allowing

or denying future URL requests, considering a set of corporate security policies. Thus, this work goes a step

beyond the usual black and white lists, since they can only control those URLs that are speciﬁcally included

in them, but not by making decisions based in similarity (through classiﬁcation techniques), or even in other

variables of the session, as it is proposed here. The results show a set of classiﬁcation methods which get

very good classiﬁcation percentages (95-97%), and which infer some useful rules based in additional features

(rather that just the URL string) related to the user’s access. This led us to consider that this kind of tool would

be very useful tool for an enterprise.

1 INTRODUCTION

With the diffusion and evolution in the society of the

so-called smartphones, a new scenario deﬁned by the

Bring Your Own Device (BYOD) tendency has been

created, meaning that people do not use their smart

devices for only one purpose (personal life or busi-

ness) anymore. This scenario, in which the devices

that access to the company system are owned by the

users (employees), and that could contain both per-

sonal and professional information, has turned the se-

curity focus on those users, who have become one of

the main threats (even not on purpose) to the corpo-

rate security (Oppliger, 2011).

This has meant the rising of new security issues,

which are normally dealt by means of Corporate Se-

curity Policies (CSPs), which are basically a set of

security rules aimed to protect the company assets, by

deﬁning permissions to be considered for every differ-

ent action to be performed inside the security system.

These CSPs usually include policies that allow or

deny access to non-conﬁdent (or non-certiﬁed) web

sites (referenced by their URLs in this work). More-

over, several web pages might be also controlled for

productivity or suitability reasons. Thus, some of

the CSPs usually deﬁne sets of allowed or denied

pages/sites that could be accessed by the enterprise

users/employees. These sets are usually included in

a White (permitted) or Black (non-permitted) Lists.

These lists act as a good control tool for those URLs

included in them as well as for the complementary,

i.e. the URLs not included in a Whitelist have auto-

matically denial of access, for instance.

In this work we go a step beyond, trying to de-

ﬁne a tool for automatically making an allowance

or denial decision with respect to URLs that are not

included in the aforementioned lists. This decision

would be based in that one made for similar URL ac-

cesses (those with similar features), but considering

other parameters of the request/connection instead of

just the URL string, as those lists do.

Thus, the problem has been transformed into a

classiﬁcation one, in which we have started from a

set of unlabelled patterns, that model the connection

properties from a huge amount of real

URL accesses

(known as sessions). Then we have assigned a label

to many of them, considering a set of real

security

Taken from a log ﬁle given by a volunteer Spanish company.

The set of rules has been written by the same company, with

respect to its employees.

125

Mora A., Cuevas P. and Merelo J..

Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by Means of Categorical Classiﬁers.

DOI: 10.5220/0005170601250134

In Proceedings of the International Conference on Evolutionary Computation Theory and Applications (ECTA-2014), pages 125-134

ISBN: 978-989-758-052-9

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

rules (CSPs) deﬁned by the Chief Security Ofﬁcer

(CSO) in the company. The resulting dataset has been

processed by means of different classiﬁcation meth-

ods, in order to ﬁnd the best algorithm for dealing

with these data. Previously, data balancing techniques

were applied, namely undersampling and oversam-

pling (Japkowicz and Stephen, 2002), due to the high

imbalance present in the dataset (more than two thirds

of the patterns belonged to the majority class).

Different data partitions have been done in the

experiments, even considering consecutive URL ses-

sions in the training and test ﬁles. The results are

quite good, getting classiﬁcation accuracies around

95-97% in the test phase, even when using the un-

balanced datasets. Then, after analysing the yielded

sets of classiﬁcation rules and trees, several rules can

be identiﬁed, based in other features rather than the

URL itself, which is the aim of this work.

The paper is structured as follows. Next section

describes related work in relation to the application

of Data Mining and Machine Learning techniques to

security issues inside a company. Section 3 presents

the problem we solve and the dataset we have worked

with. The followed methodology is described in Sec-

tion 4, concerning the data preprocessing and a ﬁrst

round of experiments comparing different classiﬁca-

tion methods. Once the best of them were selected, a

set of experiments have been conducted, and the re-

sults are described and discussed in Section 5. Fi-

nally, the conclusions and future lines of research are

presented in Section 6.

2 STATE OF THE ART

Our work tries to obtain a URL classiﬁcation tool for

enhancing the security in the client side, as at the end

we want to get if a certain URL is secure or not, hav-

ing as reference a set of rules (derived from a CSP)

that allow or deny a set of known http requests. For

this, Data Mining (DM) and Machine Learning (ML)

techniques have been applied. This section gives an

overview in a number of solutions given to protect the

user, or the company, against unsecure situations.

Due to the nature of the data (URL accesses per-

formed by humans), the used set of data is highly un-

balanced (Chawla, 2005). In order to deal with this

problem there exist several methods in the literature,

but all of them are mainly grouped in three techniques

(Japkowicz and Stephen, 2002):

• Undersampling the Over-sized Classes: i.e. re-

duce the considered number of patterns for the

classes with the majority.

• Oversampling the Small Classes: i.e. introduce

additional (normally synthetic) patterns in the

classes with the minority.

• Modifying the Cost Associated to Misclassifying

the Positive and the Negative Class: to compen-

sate for the imbalance ratio of the two classes. For

example, if the imbalance ratio is 1:10 in favour of

the negative class, the penalty of misclassifying a

positive example should be 10 times greater.

The ﬁrst option has been applied in some works,

following a random undersampling approach (Guo

et al., 2008), but it has the problem of the loss of valu-

able information.

The second has been so far the most widely

used, following different approaches, such as

SMOTE (Synthetic Minority Oversampling Tech-

nique) (Chawla et al., 2002), a method proposed by

Chawla et al. for creating ‘artiﬁcial’ samples for the

minority class, in order to balance the amount of them

with respect. However this technique is based in nu-

merical computations, which consider different dis-

tance measures, in order to generate useful patterns

(i.e. realistic or similar to the existing ones).

The third option implies using a method in which

a cost can be associated to the classiﬁer accuracy at

every step. This was done for instance by Alfaro-Cid

et al. in (Alfaro-Cid et al., 2007), where they used

a Genetic Programming (GP) approach in which the

ﬁtness function was modiﬁed in order to consider a

penalty when the classiﬁer makes a false negative (an

element from the minority class was classiﬁed as be-

longing to the majority class). However almost all the

approaches deal with numerical (real, integer) data.

One interesting point about URL classiﬁcation is

that the study of the distance between URLs may be

based in the distance between two strings, but Blanco

et al. (Blanco et al., 2011) argues that the lexical dis-

tance between two URLs is not enough to classify

them. In addition, the heuristic study of URLs for

security purposes in the user side is not a novel prac-

tice. Also, the use of Blacklists (in this work, the de-

nied URLs) and Whitelists (allowed URLs) are very

extended practices. For instance, phishing is a prob-

lem of security that Sheng et al. and Khonki et al.

(Khonji et al., 2011) tried to solve. The ﬁrst work

uses Blacklists as reference to avoid phishing attacks

made by e-mail; the second one aims for an heuristic

analysis of the URLs domain names and its ranks, in

a way that a phished URL can be detected.

Also, doing some web searching we have found

that a lot of companies stands for the use of one be-

tween Blacklist and Whitelist

. While whitelisting is

http://kevtownsend.wordpress.com/2011/08/24/whitelisting-

ECTA2014-InternationalConferenceonEvolutionaryComputationTheoryandApplications

126

the more restrictive solution and therefore the more

secure, we think that the best solution is to use both,

and for this reason the set of rules that we used covers

a succession of either allowed and denied web sites.

What refers to the used techniques, DM, as well as

ML, has been used since long ago in many scientiﬁc

ﬁelds, and given that research in computer security

was growing since the eighties (Anderson, 1980), it

was in the nineties when these techniques began to be

applied to security issues (Clifton and Marks, 1996).

On the one hand, DM helped to develop new solu-

tions to computer forensics (de Vel et al., 2001), being

the researchers able to extract information from large

ﬁles with events gathered from infected computers.

Another important advance took place after the 9/11

events, when clustering techniques and social net-

work analysis started to be performed in order to de-

tect pontential crime networks (Chen et al., 2003). On

the other hand, and more focused on the user side like

our approach, there exist some user-centric solutions

to problems like user authentication in a personal de-

vice, who Greenstadt and Beal (Greenstadt and Beal,

2008) proposed to address using collected user bio-

metrics along with machine learning techniques.

Then, when a Information Security Policy (ISP) is

going to be applied, P.G. Kelley et al. (Kelley et al.,

2008) found important to include the user in the ma-

chine learning process for reﬁning the policy model.

They called it user-controllable policy learning. An-

other approach to the reﬁnement of user’s privacy

policies has been described by Danezis in (Danezis,

2009), for he uses ML techniques over the user’s set-

tings in a social network, being capable of restricting

permissions to other people depending on their inter-

action with the user.

In the same line, Lim et al. propose a system (Lim

et al., 2008b; Lim et al., 2008a) that evolves a set

of computer security policies by means of GP, tak-

ing again into account the user’s feedback. Further-

more, Suarez-Tangil et al. (Suarez-Tangil et al., 2009)

take the same approach as Lim et al., but also bringing

event correlation in. These two latter author’s works

are interesting for ours, though they are not focused

on company ISPs - for instance, our case with the al-

lowed or denied http requests -.

Finally, a system named MUSES (from Multiplat-

form Usable Endpoint Security System) (Mora et al.,

2014) is being developed under the European Sev-

enth Framework Programme (FP7). This system will

include event treatment on the user actions inside a

company, DM techniques for applying the set of poli-

cies from the company ISP to the actions, allowing or

denying them, and ML techniques for improving the

vs-blacklisting/

set of rules derived from these policies, according to

user’s feedback and behaviour after the system deci-

sions (Seigneur et al., 2013).

3 PROBLEM AND DATA

DESCRIPTION

The problem to solve is related with the application

of corporate security policies in order to deal with

potential URL accesses inside an enterprise. To this

end a dataset of URL sessions (requests and accesses)

is analysed. These data are labelled with the corre-

sponding permission or not for that access following

the aforementioned rules. The problem is then trans-

formed into a classiﬁcation one, in which every new

URL request will be classiﬁed, and thus, a grant or

deny action will be assigned to that pattern.

The analysed data come from an access.log of

the Squid proxy application (Team, 2013a), in a real

Spanish company. This open source tool works as

a proxy, but with the advantage of storing a cache

of recent transactions so future requests may be an-

swered without asking the origin server again (Wes-

sels, 2004). Every pattern, namely a URL session

has ten variables associated, which we describe in Ta-

ble 1, indicating if the variable is numeric or nomi-

nal/categorical.

The dependent variable or class is a label which

inherently assigns an decision (and so the following

action) to every request. This can be: ALLOW if the

access is permitted according to the CSPs, or can be

DENY, if the connection is not permitted. These pat-

terns are labelled using an ‘engine’ based in a set of

security rules, that specify the decision to make. This

process is described in Subsection 4.1.

These data were gathered along a period of two

hours, from 8.30 to 10.30 am (30 minutes after the

work started), monitoring the activity of all the em-

ployees in a medium-size Spanish company (80-100

people), obtaining 100000 patterns. We consider this

dataset as quite complete because it contains a very

diverse amount of connection patterns, going from

personal (traditionally addressed at the ﬁrst hour of

work) to professional issues (the rest of the day).

Moreover, the results derived from the experiments

(described in Section 5) show that this quantity of

data might be big enough, but a more accurate out-

come would be given with, for instance, a 24 hours

long log.

GoingaStepBeyondtheBlackandWhiteListsforURLAccessesintheEnterprisebyMeansofCategoricalClassifiers

127

Table 1: Independent Variables corresponding to a URL session (a connection to a URL for some time). The URLs are parsed

as detailed in Subsection 4.2.

Variable name Description Type Rank/Number of Values (if categorical)

http reply code Status of the server response Categorical 20 values

http method Desired action to be performed Categorical 6 values

duration milliseconds Session duration Numerical integer in [0,357170]

content type Media type of the entity-body sent to the recipient Categorical 11 values (main content), 85 values (whole content)

server or cache address IP address Categorical 2343 values

time connection hour (in the day) Date 00:00:00 to 23:59:59

squid hierarchy It indicates how the next-hop cache was selected Categorical 3 values

bytes Number of transferred bytes during the session Numerical integer in [0,85135242]

client address IP address Categorical 105 values

URL Core domain of the URL, not taking into account the TLD Categorical 976 values

4 METHODOLOGY

Before classiﬁcation techniques are applied, a data

preprocessing step has been performed. First, the raw

dataset is labelled according a set of initial corporate

security rules, i.e. every pattern is assigned to a label

indication if the corresponding URL request/access

would be ALLOWED or DENIED considering these

rules. This step is necessary in order to transform the

problem into a classiﬁcation one. However, in order

to apply the rules they must be transformed from their

initial format into another one that can be applied in

our programs (a hash in Perl

). This is described in

Subsection 4.1.

Subsection 4.2 details how the patterns of the nav-

igation data log (URL sessions) are also converted to

a Perl hash to perform the matting/labelling process.

At the end of these two steps, the two hashes are

compared in order to obtain which entries of the log

should be ALLOW or DENY, know as the labelling

step. This is similar to perform a decision process

in a security system. This step results in that there

are 38972 pattern belonging to class ALLOW (posi-

tive class) and 18530 of class DENY (negative class),

so just a 67.78% of the samples belong to the major-

ity class. This represents a very important problem,

since a classiﬁer that is trained considering these pro-

portions is supposed to classify all the samples as AL-

LOW, getting a theoretically quite good classiﬁcation

accuracy equal or greater than 68%. However, in sec-

tion 5 we will see that, despite the fact that some de-

nied patterns are classiﬁed as allow, the overall perfor-

mance of the classiﬁers are better than the expected.

Given that the dataset contains a majority of cate-

gorical/nominal data, we have performed different ap-

proaches for data balancing:

• Undersampling: we will remove random samples

A hash in Perl is an object that represents a hash table, which

is a set of pairs key-value. Sometimes, the value can be another

hash itself.

of the majority class until the amount in both

classes are similar.

• Oversampling: we will duplicate random samples

of the minority class, in order to get a close num-

ber of patterns in both classes. This has to be done

due to the impossibility of creating synthetic data

when dealing with categorical values (there is not

a proper distance measure between two values in a

category). Actually, since the number of samples

in the majority class is almost twice the minority

one, we have just duplicated all of those belonging

to the minority class.

Finally, in Subsection 4.3 we explain the selec-

tion of the methods to apply in order to classify the

data. We just have considered the patterns correctly

labelled in the preprocessing phase. Thus, a super-

vised classiﬁcation process (MacQueen et al., 1967)

has been conducted on the balanced datasets. Weka

Data Mining Software

has been used, in order to se-

lect the best set of methods in order to deal with these

data. These classiﬁers will be further tested in Section

4.1 Security Rules Parsing

In this work we have considered Drools (Team,

2013c) as the tool to create, and therefore, manage

rules in a business environment. This so called Busi-

ness Rule Management System (BRMS) has been de-

veloped by the JBoss community under an Apache

License and it is written in Java. Though this plat-

form consist of many components, here we focus on

Drools Expert and the Drools Rule Language (DRL,

(Team, 2013b)). Then, the deﬁned rules for a cer-

tain company are inside of a ﬁle with a .drl exten-

sion, the ﬁle that needs to be parsed to obtain the ﬁ-

nal set of rules. In Figure 1, (a), there is the typical

http://www.cs.waikato.ac.nz/ml/weka/

ECTA2014-InternationalConferenceonEvolutionaryComputationTheoryandApplications

128

rule syntax in DRL. Two main things should be ob-

tained from the parsing method: both left and right

sides of the rule, taking into account that the left

side is where the company speciﬁes the conditions re-

quired to apply the action indicated in the right side.

Also, for describing the conditions, Squid syntax is

used (see Section 3), having thus the following struc-

ture: squid:Squid(conditions). Finally, from

the right side of the rule, the ALLOW or DENY label

to apply on the data that matches with the conditions,

will be extracted. The Perl parser that we have imple-

mented applies two regular expressions, one for each

side of the rule, and returns a hash with all the rules

with the conditions and actions deﬁned. The ‘before

and after’ performing the parsing over the .drl ﬁle is

in Figure 1.

rule "name"

attributes

when

/* Left

Side of the

Rule */

then

/* Right

Side of the

Rule */

end

(a) Drools Rule

%rules = (

rule =>{

field => xxx

relation =>

xxx

value => xxx

action =>

[allow,

deny]

);

(b) Hash Rule

Figure 1: (a) Structure of a rule in Drools Expert. (b) Re-

sulting rule, after the parsing, in a global hash of rules.

4.2 URL Log Data Parsing

Usually, the instances of a log ﬁle have a number of

ﬁelds, in order to have a registration of the client who

asks for a resource, the time of the day when the re-

quest is made, and so on. In this case, we have worked

with an access.log (see Section 3) ﬁle, converted into

a CSV format ﬁle so it could be parsed and trans-

formed in another hash of data. All ten ﬁelds of the

Squid log yield a hash like the one depicted in Fig. 2.

Once the two hashes of data were created, they

were compared in such a way that for each rule in the

hash of rules, it was determined how many entries in

the data log hash are covered by the rule, and so they

were applied the label that appears as ‘action’ in the

rule.

One of the problems was to extract from a whole

URL the part that was more interesting for our pur-

poses. It is important to point out that in a log with

thousands of entries, an enormous variety of URLs

can be found, since some can belong to advertise-

ments, images, videos, or even some others does not

have a domain name but are given directly by an

IP address. For this reason, we have taken into ac-

count that for a domain name, many subdomains (sep-

arated by dots) could be considered, and their hier-

archy grows from the right towards the left. The

highest level of the domain name space is the Top-

Level Domain (TLD) at the right-most part of the

domain name, divided itself in country code TLDs

and generic TLDs. Then, a domain and a number

of subdomains follow the TLD (again, from right to

left). In this way, the URLs in the used log are

such as http://subdomain...subdomain.domain.TLD/

other subdirectories. However, for the ARFF

ﬁle to

be created, only the domain (without the subdomains

and the TLD) should be considered, because there are

too many different URLs to take into consideration.

Hence, applying another regular expression, the data

parser implemented in Perl obtains all the core do-

mains of the URLs, which makes 976 domains in to-

tal.

%logdata = (

entry =>{

http reply code => xxx

http method => xxx

duration miliseconds => xxx

content type => xxx

server or cache address => xxx

time => xxx

squid hierarchy => xxx

bytes => xxx

url => xxx

client address => xxx

);

Figure 2: Perl hash with an example entry. The actual hash

used for this work has a total of 100000 entries, with more

than a half labelled as ALLOW or DENY after the compar-

ing process.

4.3 Classiﬁcation Methods

As said in Section 3, the data used for this work is not

only numerical or nominal, thus, only classiﬁcation

algorithms that support both types of data have been

considered. Weka has a great number of possible al-

gorithms to work with, so we have conducted a pre-

selection phase trying to choose those which would

yield better results in the experiments. More specif-

ically, we have focused on rule-based and decision-

tree-based algorithms.

In this way, a decision-tree algorithm is a group

Format of Weka ﬁles

GoingaStepBeyondtheBlackandWhiteListsforURLAccessesintheEnterprisebyMeansofCategoricalClassifiers

129

of conditions organised in a top-down recursive man-

ner in a way that a class is assigned following a path

of conditions, from the root of the tree to one of its

leaves. Generally speaking, the possible classes to

choose are mutually exclusive. Furthermore, these al-

gorithms are also called “divide-and-conquer” algo-

rithms. On the other hand, there are the “separate-

and-conquer” algorithms, which work creating rules

one at a time, then the instances covered by the cre-

ated rule are removed and the next rule is generated

from the remaining instances.

A reference to each Weka classiﬁer can be found

at (Frank and Witten, 2011). Below are described

the top ﬁve techniques, obtained from the best results

(See Table 2) of the experiments done in this stage,

along with more speciﬁc bibliography. Nave Bayes

method (Domingos and Pazzani, 1997) has been in-

cluded as a baseline, normally used in text catego-

rization problems. According to the results, the ﬁve

selected classiﬁers are much better than this method.

Table 2: Results of all the tested classiﬁcation methods on

balanced data. The best ones are marked in boldface.

Undersampling Oversampling

Nave Bayes 91.12 91.77

Conjunctive Rule 60.14 60.02

Decision Table 94.08 90.29

DTNB 94.75 95.65

JRip 90.08 92.47

NNge 96.49 98.76

One R 93.45 93.70

PART 96.45 97.54

Ridor 87.22 89.87

Zero R 51.39 51.26

AD Tree 77.73 77.68

Decision Stump 60.14 60.02

J48 97.02 98.00

LAD Tree 79.95 79.97

Random Forest 96.87 98.84

Random Tree 95.14 98.35

REP Tree 96.79 97.67

J48. This classiﬁer generates a pruned or unpruned

C4.5 decision tree. Described for the ﬁrst time in

1993 by (Quinlan, 1993), this machine learning

method builds a decision tree selecting, for each

node, the best attribute for splitting and create the

next nodes. An attribute is selected as ‘the best’

by evaluating the difference in entropy (informa-

tion gain) resulting from choosing that attribute

for splitting the data. In this way, the tree contin-

ues to grow till there are not attributes anymore for

further splitting, meaning that the resulting nodes

are instances of single classes.

Random Forest. This manner of building a decision

tree can be seen as a randomization of the pre-

vious C4.5 process. It was stated by (Breiman,

2001) and consist of, instead of choosing ‘the

best’ attribute, the algorithm randomly chooses

one between a group of attributes from the top

ones. The size of this group is customizable in

Weka.

REP Tree. Is another kind of decision tree, it means

Reduced Error Pruning Tree. Originally stated

by (Quinlan, 1987), this method builds a decision

tree using information gain, like C4.5, and then

prunes it using reduced-error pruning. That means

that the training dataset is divided in two parts:

one devoted to make the tree grow and another

for pruning. For every subtree (not a class/leaf)

in the tree, it is replaced by the best possible leaf

in the pruning three and then it is tested with the

test dataset if the made prune has improved the re-

sults. A deep analysis about this technique and its

variants can be found in (Elomaa and Kaariainen,

2001).

NNge. Nearest-Neighbor machine learning method

of generating rules using non-nested generalised

exemplars, i.e., the so called ‘hyperrectangles’ for

being multidimensional rectangular regions of at-

tribute space (Martin, 1995). The NNge algorithm

builds a ruleset from the creation of this hyper-

rectangles. They are non-nested (overlapping is

not permitted), which means that the algorithm

checks, when a proposed new hyperrectangle cre-

ated from a new generalisation, if it has conﬂicts

with any region of the attribute space. This is done

in order to avoid that an example is covered by

more than one rule (two or more).

PART. It comes from ‘partial’ decision trees, for it

builds its rule set from them (Frank and Wit-

ten, 1998). The way of generating a partial de-

cision tree is a combination of the two afore-

mentioned strategies “divide-and-conquer” and

“separate-and-conquer”, gaining then ﬂexibility

and speed. When a tree begins to grow, the node

with lowest information gain is the chosen one for

starting to expand. When a subtree is complete (it

has reached its leaves), its substitution by a sin-

gle leaf is considered. At the end the algorithm

obtains a partial decision tree instead of a fully

explored one, because the leafs with largest cover-

age become rules and some subtrees are thus dis-

carded.

These methods will be deeply tested on the dataset

(balanced and unbalanced) in the following section.

ECTA2014-InternationalConferenceonEvolutionaryComputationTheoryandApplications

130

5 EXPERIMENTS AND RESULTS

Several experiments have been conducted, once a sub-

set of classiﬁcation methods has been chosen in pre-

vious section. To this end, some training and test

datasets have been created from the set of labelled

patterns. It contains 57502 samples, with 38972 be-

longing to class ALLOW and 18530 to class DENY.

In order to better test the methods, two different

divisions (training-test) have been done, namely 90%-

10% and 80%-20%. Moreover, two additional splits

have been considered in every case, using both a ran-

dom and a sequential approach for selecting samples

from the original ﬁle. Thus, in the latter, consecutive

patterns have been included in the training ﬁle up to

the desired percentage. The rest have composed the

test ﬁle. In the ﬁrst approach, a random selection is

performed.

The aim of the sequential division is to compare

if the online activity of the employees considering

URL sessions could be somehow ‘predicted’, just us-

ing data from previous minutes or hours.

With respect to the data, the initial ﬁle was unbal-

anced, as it can be seen in the number of patterns per

class. Hence, as stated in Section 3, two data balanc-

ing methods have been applied to all the ﬁles, to get

similar numbers in both classes: undersampling (ran-

dom removal of ALLOW patterns) and oversampling

(duplication of DENY patterns).

Results for unbalanced data are presented in Ta-

ble 3. Three different tests have been done for the

random pattern distribution approach, so the mean

and standard deviation are shown in the correspond-

ing columns.

As it can be seen, all the ﬁve methods achieved a

high performance classifying in the right way the test

dataset. Also, these results are not like this by chance,

as shown by a low standard deviation. Although it

was expected that the results from the 90%-10% divi-

sion were slightly better, in the future a more aggres-

sive division will be executed so the methods can be

really proved with much less training data.

What matters to the results of the experiments

made with the sequential data, they are worse than

the obtained from the random data, but still they are

good (> 85%). This is due to the occurrence of new

patterns from a certain time (maybe there are some re-

quests that are made just at one speciﬁc time in a day,

or in settled days), and then there is no sufﬁcient sim-

ilarity between the training data and the classifying of

the test data set may fail. The loss of 5 to 6 points in

the results of the 90%-10% division is the ﬁrst unex-

pected or unlogical result of the experiments, but they

also reinforce the previous theory.

The technique that lightly stands out over the oth-

ers is Random Forest, being the best in almost every

case, even in the experiments with the most complex

sequential divisions. However, if we focus on the

standard deviation, REP Tree is the chosen one, as its

results present robustness.

For its part, results obtained from unbalanced data

are shown in Table 4. Again the corresponding to the

random partitions come from the mean of three blocks

of experiments, and so are speciﬁed the standard de-

viations. The Table illustrates two segments of re-

sults, obtained from the undersampled data and from

the oversampled data. For each one, the 90%-10%

and 80%-20% divisions were also made.

Applying Undersampling. In comparison with

those results from Table 3, these go down one

point (in the case of randomly made divisions) to

six points (sequential divisions). The reason why

this happens is that when randomly removing

ALLOW patterns, we are really losing informa-

tion, i. e. key patterns that could be decisive in a

good classiﬁcation of a certain set of test patterns.

Applying Oversampling. Here we have duplicated

the DENY patterns so their number could be up

to that of the ALLOW patterns. However, it does

not work as well as in other approaches which

uses numerical computations for creating the new

patterns to include in the minority class. Conse-

quently, the results have been decreased.

In both cases it is noticeable that taking the data

in a sequential way, instead of randomly, lower the

results. It is clear that due to the fact that perform-

ing undersampling some patterns are lost while in the

case of oversampling they all remain, undersampling

results are better. Then, in this case the algorithm

with best performance is J48, though Random For-

est follows its results very closely in random datasets

processing, and REP Tree, which is better than the

rest when working with sequential data. Neverthe-

less, generally speaking and given the aforementioned

reasons, performing data balancing methods yields

worse results.

Furthermore, we have found that for the data

sets taken consecutively, the methods always classify

worse the DENY labels, as they label them as AL-

LOW patterns. This is worth further study because it

is the worst situation. It would be preferable to have

a false positive in a DENY pattern, rather than a false

negative and permit a request that is forbidden in the

ISP.

Regarding the obtained rules/trees, we want to re-

mark that the majority are based on the URL in or-

der to discriminate between the two classes, how-

GoingaStepBeyondtheBlackandWhiteListsforURLAccessesintheEnterprisebyMeansofCategoricalClassifiers

131

Table 3: Percentage of correctly classiﬁed patterns for non-balanced data.

80% Training - 20% Test 90% Training - 10% Test

Random (mean) Sequential Random (mean) Sequential

J48 97.56 ± 0.20 88.48 97.70 ± 0.15 82.28

Random Forest 97.68 ± 0.20 89.77 97.63 ± 0.13 82.59

REP Tree 97.47 ± 0.11 88.34 97.57 ± 0.01 83.20

NNge 97.23 ± 0.10 84.41 97.38 ± 0.36 80.34

PART 97.06 ± 0.19 89.11 97.40 ± 0.16 84.17

Table 4: Percentage of correctly classiﬁed patterns for balanced data (under- and oversampling).

80% Training - 20% Test 90% Training - 10% Test

Undersampling Oversampling Undersampling Oversampling

Rand (mean) Sequential Rand (mean) Sequential Rand (mean) Sequential Rand (mean) Sequential

J48 97.05 ± 0.25 84.29 97.40 ± 0.03 85.66 96.85 ± 0.35 76.44 97.37 ± 0.06 74.24

Random Forest 96.61 ± 0.17 88.59 97.16 ± 0.19 89.03 96.99 ± 0.13 79.98 97.25 ± 0.33 81.33

REP Tree 96.52 ± 0.13 85.54 97.13 ± 0.25 85.41 96.55 ± 0.10 77.65 97.14 ± 0.09 76.81

NNge 96.56 ± 0.42 85.28 96.90 ± 0.28 83.46 96.33 ± 0.05 81.93 96.91 ± 0.06 78.73

PART 96.19 ± 0.14 85.16 96.82 ± 0.09 84.50 96.09 ± 0.10 79.70 96.68 ± 0.11 78.16

ever we also found several ones which consider vari-

ables/features different of this to make the decision.

For instance:

IF server_or_cache_address = "90.84.53.17"

THEN DENY

IF server_or_cache_address = "173.194.78.103"

THEN ALLOW

IF content_type =

"application/vnd.google.safebrowsing-update"

THEN DENY

IF server_or_cache_address = "173.194.78.94"

AND content_type_MCT = "text"

AND content_type = "text/html"

AND http_reply_code = "200"

AND bytes > 772

THEN ALLOW

IF server_or_cache_address = "173.194.34.225"

AND http_method = "GET"

AND duration_milliseconds > 52

THEN ALLOW

IF server_or_cache_address = "90.84.53.49"

AND time <= 33758000

THEN ALLOW

These are the interesting rules for our purposes,

since they are somehow independent of the URL to

which the client requests to access. Thus, it would

be potentially possible to allow or deny the access to

unknown URLs just taking into account some param-

eters of the session.

Of course, some of these features depend on the

session itself, i.e. they will be computed after the

session is over, but the idea in that case would be

’to reﬁne’ somehow the existing set of URLs in the

White List. Thus, when a client requests access to a

Whitelisted URL, this will be allow, but after the ses-

sion is over, and depending on the obtained values,

and on one of these classiﬁers, the URL could be la-

belled as DENIED for further requests. This could

be a useful decision-aid tool for the CSO in a com-

pany, for instance. In the case that the features con-

sidered in the rule can be known in advance, such as

http method, or server or cache address, for in-

stance, the decision could be made in real-time, and

thus, a granted URL (Whitelisted) could be DENIED

or the other way round.

The tree-based methods also yield several useful

branches in this sense, but they have not been plotted

here because of the difﬁculty for showing/visualizing

them properly.

6 CONCLUSIONS AND FUTURE

WORK

In this paper a set of classiﬁcation methods have been

applied in order to perform a decision process inside

a company, according to some predeﬁned corporate

security policies. This decision is focused on allow-

ing or denying URL access requests, but just consid-

ering previous decisions on similar requests, not hav-

ing speciﬁc rules in a White/Black List, deﬁned for

those URLs. Thus, the proposed method could al-

low or deny an access to a URL based in additional

terms rather than just the speciﬁc URL string. This

could be very useful since new URLs could be auto-

matically ’Whitelisted’ or ’Blacklisted’, just depend-

ing on some of the connection parameters, such as the

content type of the access or the IP of the client

which makes the request.

To this aim, we have started from a big dataset

(100000 patterns) about employees’ URL sessions

information, and considering a set of URL access

permissions, we have composed a labelled dataset

ECTA2014-InternationalConferenceonEvolutionaryComputationTheoryandApplications

132

(57000 patterns). Over that set of data, we have tested

several classiﬁcation methods, after some data bal-

ancing techniques have been applied. Then, the best

ﬁve have been deeply proved over several training and

test divisions, and with two methods: using sequential

patterns (consecutive URL accesses), and taking them

in a randomly way.

The results show that classiﬁcation accuracies are

between 95% and 97%, even when using the un-

balanced datasets. However, they have been dimin-

ished because of the possible loss of data that comes

from performing an undersampling (removing pat-

terns) method; or taking the training and the data sets

in a sequential way from the main log ﬁle, due to the

fact that certain URL requests can be made only at a

certain time.

In this way, we can conclude that the approach has

been successful and it would be a useful tool in an

enterprise.

Future lines of work include conducting a deeper

set of experiments trying to test the generalisation

power of the method, maybe considering bigger data

divisions, bigger data sets (from a whole day or work-

ing day), or adding some kind of ‘noise’ to the dataset.

So that, considering the good classiﬁcation results ob-

tained in this work, the next step could be the ap-

plication of these methods in the real system from

which data was gathered, counting with the opinion

of expert CSOs, in order to know the real value of

the proposal. The study of other classiﬁcation meth-

ods could be another research branch, along with the

implementation of a Genetic Programming approach,

which could deal with the imbalance problem using a

modiﬁcation of the cost associated to misclassifying,

could be done (as the authors did in (Alfaro-Cid et al.,

2007)).

Finally, we also point to extract additional infor-

mation from the URL string, than could be trans-

formed into additional features that could be more

discriminative than the current set. Moreover, a data

process involving summarizing data about sessions

(such as number of requests per client, or average time

connection) will be also considered.

ACKNOWLEDGEMENTS

This paper has been funded in part by European

project MUSES (FP7-318508), along with Span-

ish National project TIN2011-28627-C04-02 (ANY-

SELF), project P08-TIC-03903 (EVORQ) awarded

by the Andalusian Regional Government, and

projects 83 (CANUBE), and GENIL PYR-2014-17,

both awarded by the CEI-BioTIC UGR.

REFERENCES

Alfaro-Cid, E., Sharman, K., and Esparcia-Alczar, A.

(2007). A genetic programming approach for

bankruptcy prediction using a highly unbalanced

database. In Giacobini, M., editor, Applications of

Evolutionary Computing, volume 4448 of Lecture

Notes in Computer Science, pages 169–178. Springer

Berlin Heidelberg.

Anderson, A. J. P. (1980). Computer security threat mon-

itoring and surveillance. Technical report, James P.

Anderson Co., Fort Washington, PA.

Blanco, L., Dalvi, N., and Machanavajjhala, A. (2011).

Highly efﬁcient algorithms for structural clustering of

large websites. In WWW ’11 Proceedings of the 20th

international conference on World wide web., pages

437–446. ACM.

Breiman, L. (2001). Random forests. Machine Learning,

45(1):5–32.

Chawla, N. (2005). Data mining for imbalanced datasets:

An overview. In Maimon, O. and Rokach, L., edi-

tors, Data Mining and Knowledge Discovery Hand-

book, pages 853–867. Springer US.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,

W. P. (2002). Smote: Synthetic minority over-

sampling technique. J. Artif. Int. Res., 16(1):321–357.

Chen, H., Chung, W., Qin, Y., Chau, M., Xu, J. J., Wang, G.,

Zheng, R., and Atabakhsh, H. (2003). Crime data min-

ing: An overview and case studies. In Proceedings of

the 3rd National Conference for Digital Government

Research (dg.o 2003), volume 130, pages 1–5. Digital

Government Society of North America.

Clifton, C. and Marks, D. (1996). Security and privacy im-

plications of data mining. In ACM SIGMOD Work-

shop on Research Issues on Data Mining and Knowl-

edge Discovery, pages 15–19.

Danezis, G. (2009). Inferring privacy policies for so-

cial networking services. In Proceedings of the 2Nd

ACM Workshop on Security and Artiﬁcial Intelligence,

AISec ’09, pages 5–10, New York, NY, USA. ACM.

de Vel, O., Anderson, A., Corney, M., and Mohay, G.

(2001). Mining e-mail content for author identiﬁca-

tion forensics. SIGMOD Record, 30(4):55–64.

Domingos, P. and Pazzani, M. (1997). On the optimality

of the simple bayesian classiﬁer under zero-one loss.

Machine Learning, 29:103–137.

Elomaa, T. and Kaariainen, M. (2001). An analysis of re-

duced error pruning. Artiﬁcial Intelligence Research,

15(-):163–187.

Frank, E. and Witten, I. H. (1998). Generating accurate

rule sets without global optimization. In Shavlik, J.,

editor, Fifteenth International Conference on Machine

Learning, pages 144–151. Morgan Kaufmann.

Frank, E. and Witten, I. H. (2011). Data Mining: Practi-

cal Machine Learning Tools and Techniques. Morgan

Kaufmann Publishers, third edition.

Greenstadt, R. and Beal, J. (2008). Cognitive security for

personal devices. In Proceedings of the 1st ACM

Workshop on Workshop on AISec, AISec ’08, pages

27–30, New York, NY, USA. ACM.

GoingaStepBeyondtheBlackandWhiteListsforURLAccessesintheEnterprisebyMeansofCategoricalClassifiers

133

Guo, X., Yin, Y., Dong, C., Yang, G., and Zhou, G. (2008).

On the class imbalance problem. In Natural Compu-

tation, 2008. ICNC ’08. Fourth International Confer-

ence on, volume 4, pages 192–201.

Japkowicz, N. and Stephen, S. (2002). The class imbal-

ance problem: A systematic study. Intell. Data Anal.,

6(5):429–449.

Kelley, P. G., Hankes Drielsma, P., Sadeh, N., and Cranor,

L. F. (2008). User-controllable learning of security

and privacy policies. In Proceedings of the 1st ACM

Workshop on Workshop on AISec, AISec ’08, pages

11–18, New York, NY, USA. ACM.

Khonji, M., Jones, A., and Iraqi, Y. (2011). A novel phish-

ing classiﬁcation based on url features. In GCC Con-

ference and Exhibition (GCC), pages 221–224. IEE.

Lim, Y. T., Cheng, P. C., Clark, J., and Rohatgi, P. (2008a).

Policy evolution with genetic programming: A com-

parison of three approaches. In Evolutionary Com-

putation, 2008. CEC 2008. (IEEE World Congress

on Computational Intelligence). IEEE Congress on,

pages 1792–1800.

Lim, Y. T., Cheng, P. C., Rohatgi, P., and Clark, J. A.

(2008b). Mls security policy evolution with genetic

programming. In Proceedings of the 10th Annual

Conference on Genetic and Evolutionary Computa-

tion, GECCO ’08, pages 1571–1578, New York, NY,

USA. ACM.

MacQueen, J. et al. (1967). Some methods for classiﬁcation

and analysis of multivariate observations. In Proceed-

ings of the ﬁfth Berkeley symposium on mathematical

statistics and probability, number 14, pages 281–297.

California, USA.

Martin, B. (1995). Instance-based learning: Nearest neigh-

bor with generalization. Master’s thesis, University of

Waikato, Hamilton, New Zealand.

Mora, A., De las Cuevas, P., Merelo, J., Zamarripa, S.,

Juan, M., Esparcia-Alczar, A., Burvall, M., Arfwed-

son, H., and Hodaie, Z. (2014). MUSES: A corpo-

rate user-centric system which applies computational

intelligence methods. In et al., D. S., editor, 29th Sym-

posium On Applied Computing, pages 1719–1723.

Oppliger, R. (2011). Security and privacy in an online

world. IEEE Computer, 44(9):21–22.

Quinlan, J. R. (1987). Simplifying decision trees. Man-

Machine Studies, 27(3):221–234.

Quinlan, J. R. (1993). C4.5: Programs for Machine Learn-

ing. Morgan Kaufmann Publishers, San Mateo, CA.

Seigneur, J.-M., K

olndorfer, P., Busch, M., and Hochleitner,

C. (2013). A Survey of Trust and Risk Metrics for a

BYOD Mobile Working World. In Third International

Conference on Social Eco-Informatics.

Suarez-Tangil, G., Palomar, E., Fuentes, J., Blasco, J., and

Ribagorda, A. (2009). Automatic rule generation

based on genetic programming for event correlation.

In Herrero, l., Gastaldo, P., Zunino, R., and Corchado,

E., editors, Computational Intelligence in Security for

Information Systems, volume 63 of Advances in Intel-

ligent and Soft Computing, pages 127–134. Springer

Berlin Heidelberg.

Team, S. (2013a). Squid website.

Team, T. J. D. (2013b). Drools documentation. version

6.0.1.ﬁnal.

Team, T. J. D. (2013c). Drools website.

Wessels, D. (2004). Squid: The Deﬁnitive Guide. O’Reilly

Media, Inc., 1 edition.

ECTA2014-InternationalConferenceonEvolutionaryComputationTheoryandApplications

134