Self-Organizing Maps in the Design of Anti-spam Filters

A Proposal based on Thematic Categories

Ylermi Cabrera-León

, Patricio García Báez

and Carmen Paz Suárez-Araujo

Universidad de Las Palmas de Gran Canaria (ULPGC), Canary Islands, Las Palmas de Gran Canaria, Spain

Departamento de Ingeniería Informática y de Sistemas, Universidad de La Laguna,

Canary Islands, San Cristóbal de La Laguna, Spain

Instituto Universitario de Ciencias y Tecnologías Cibernéticas, ULPGC,

Canary Islands, Las Palmas de Gran Canaria, Spain

Keywords:

Anti-spam, Spam, Ham, Artiﬁcial Neural Networks, Self-Organizing Maps (

SOM

s), Thematic Category, Term

Frequency, Inverse Category or Class Frequency.

Abstract:

Spam, or unsolicited messages sent massively, is one of the threats that affects email and other media. Its

high volume generates substantial time and economic losses. A solution to this problem is presented: a

hybrid anti-spam ﬁlter based on unsupervised Artiﬁcial Neural Networks (

ANN

s). It consists of two steps,

preprocessing and processing, both based on different computation models: programmed and neural (using

Kohonen

SOM

). This system has been optimized using, as a data corpus, ham from “Enron Email” and spam

from two different sources: traditional (user’s inbox) and spamtrap-honeypot. It has been proved that thematic

categories can be found both in spam and ham words. 1260 system conﬁgurations were analyzed, comparing

their quality and performance with the most used metrics. All of them achieved

AUC > 0.90

and the best 204

AUC > 0.95

, despite just using 13 attributes for the input vectors of the

SOM

, one for each thematic category.

Results were similar to other researchers’ over the same corpus, though they make use of different Machine

Learning (

) methods and a number of attributes several orders of magnitude greater. It was further tested

with datasets not utilized during design, obtaining 0.77 < AUC < 0.96 with normalized data.

1 INTRODUCTION AND

BACKGROUND

Nowadays, the use and importance of telecommunica-

tion has increased, primarily due to the rise of Informa-

tion and Communications Technology (

ICT

). Among

the multiple ways to make such communication, email

can be highlighted, mainly because it has been used

extensively for decades. Unfortunately, this popular-

ity has brought with it the appearance of threats such

as hoaxes, cyber-attacks, computer viruses and, to a

greater extent, spam.

Although there are many different ways of deﬁning

the word “spam” (Subramaniam et al., 2010), in this

paper spam refers to any message, mostly email but

other media are affected too, sent massively without

the recipients having requested or desired it. The char-

acteristic of “massive” must be highlighted because,

for years, both spam volume and overall spam rate (in

other words, the quantity of spam and the percentage of

spam relative to all messages, respectively) have been

extremely high: in 2008 about 62 trillion unwanted

messages (McAfee and ICF International, 2009) and

less than 1 out of 10 emails could be considered as

ham (legitimate or desired messages), fortunately im-

proving to 4 out of 10 in 2014 (Statista, 2016).

Considering that, in most cases, their content is

offensive or questionable - e.g. scam, phishing, il-

legal drugs, pornography, replicas. . . (Cabrera León

et al., 2015) - it can be asserted that spam is a great

scourge, creating quite substantial, both temporary and

economic, losses: annually, American ﬁrms and con-

sumers experience costs of near $20 billion due to

spam whereas spammers (people or companies that

send spam) and spam-advertised ﬁrms collect $200

million worldwide (Rao and Reiley, 2012). This oc-

curs, mainly, due to sending spam being easy and

having low cost, and that the recipient carries the bulk

of the cost, in contrast to what happens with more

traditional or off-line unsolicited marketing methods

(Lieb, 2002).

There have been a wide variety of proposals to

solve the problem of email spam detection so far, and

Cabrera-León, Y., Báez, P. and Suárez-Araujo, C.

Self-Organizing Maps in the Design of Anti-spam Filters - A Proposal based on Thematic Categories.

DOI: 10.5220/0006041400210032

In Proceedings of the 8th International Joint Conference on Computational Intelligence (IJCCI 2016) - Volume 3: NCTA, pages 21-32

ISBN: 978-989-758-201-1

therefore there is a huge proliferation of papers in this

regard, as it will be discussed in Subsection 1.1. In

this paper another solution to this problem is presented:

a hybrid spam ﬁltering system. It can be considered

hybrid not only because its two main stages, prepro-

cessing and processing, are based on different com-

puting models - programmed and neural computation,

respectively - but also because in the processing one

the

SOM

, an unsupervised

ANN

, is followed by a

non-neural supervised labeling part.

This system has been optimized using, as a data

body, ham from “Enron Email” (Cohen, 2004) and

spam from two different sources: obtained through tra-

ditional ways (user’s inbox), or through spamtraps and

honeypots. Its quality and performance have been ana-

lyzed on several different datasets with both Kohonen

maps’ most used quality measures, Mean Quantiza-

tion Error (

MQE

) and Topographic Error (

), and the

most common performance metrics for classiﬁers such

as Receiver Operating Characteristic (

ROC

) curves

and AUC, among others.

The remainder of this paper is organized as fol-

lows. In Subsection 1.1 some of the numerous related

works developed throughout the last decades are de-

scribed. Through Section 2 the dataset and methods

are explained. Section 3 shows the experimental re-

sults, followed by a discussion of them. Finally, the

conclusions can be found in Section 4.

1.1 Related Works

As it is common to defensive and security systems

in all areas (such as pathogenic diseases, armament,

crime and predation), attackers (spammers in our case)

are always one step ahead of defenders (Postini, Inc,

2004), therefore, in this area, the latter need to con-

tinuously face new threats and counter shortcomings,

weaknesses and security ﬂaws that the former have

found, and later exploited, in anti-spam ﬁlters, other

software or hardware (Spammer-X et al., 2004). Ac-

tually, this evolution explains the proliferation of mul-

tiple anti-spam techniques developed over the past

decades (Wang et al., 2013).

Anti-spam methods may ﬁlter during any of the

network hierarchical levels (mostly in the Application,

Transport and Network layers of the TCP/IP model),

i.e. in any of the steps involved with sending emails: in

the sender device, en route and in the recipient. Along-

side this manner, spam classiﬁers can also be grouped

by these two ways: based on the design method, and

based on the source of information. This section’s

scope has been reduced by just choosing user-level

and administrator-level ﬁlters, whose techniques be-

long to any of the two previous groups, due to the fact

that a user-level ﬁltering system was developed here.

1.1.1 Based on the Design Method

There are two subgroups in this group, the latter being

the most popular one:

Manual:

easier to implement and more useful for

email administrators. Despite their slow adapta-

tion to changes in spam, whitelists and blacklists

(respectively, lists of good and bad mail servers

and ISPs) achieve just 1% of False Positives (

)

and False Negatives (

) (Erickson et al., 2008).

Furthermore, greylisting blocks email delivery tem-

porarily with unrecognized senders, forcing re-

sending, something not usually done by spammers

(Kucherawy and Crocker, 2012). It reduces band-

width waste at the expense of delaying ham too

(Harris, 2003).

Based on ML:

it is the largest subgroup (Guzella and

Caminhas, 2009), where ﬁlters can be classiﬁed, at

the same time, depending on the kind of architec-

ture used (neural or not), or the quantity of human

interaction required (supervised, semi-supervised

or unsupervised). Making use of the latter, we

could classify some anti-spam techniques as fol-

lows:

•

Supervised: the most popular non-neural one

is the Bayesian (Meyer and Whateley, 2004;

Sahami et al., 1998), and, therefore, the most

attacked by spammers through Bayesian poi-

soning (Lowd and Meek, 2005; Sprengers and

Heskes, 2009; Wittel and Wu, 2004).

(Metsis et al., 2006) must be described sepa-

rately from other Bayesian ﬁlters because this

paper’s dataset was built from theirs and later

compared with, as explained in Section 2. Their

system performs very well (with average sensi-

tivity of 0.9753 and speciﬁcity of 0.9726, and

quite-near-perfection

ROC

curves) when 3000

attributes, the greatest number they tested, are

used.

Nowadays, other popular non-neural one, due to

its performance, is the Support Vector Machine

(

SVM

) (Drucker et al., 1999; Xie et al., 2009),

which is greatly kernel-dependent (Chhabra

et al., 2010) and offers better results when sev-

eral are combined with a voting strategy (Blanco

et al., 2007).

On the other hand, for a long time, the per-

ceptron, neural, has dominated as anti-spam

(Kufandirimbwa and Gotora, 2012; Sculley

et al., 2006) but the raise of Bayesian and

SVM

changed this.

Other supervised

ANN

is the Learning Vec-

tor Quantization (

LVQ

) with whom (Chuan

NCTA 2016 - 8th International Conference on Neural Computation Theory and Applications

et al., 2005) built a quite good ﬁlter (96.20%

F-measure, 98.97% precision and 93.58% sensi-

tivity) if enough iterations, at least 1500, were

made.

•

Unsupervised: because no previous data la-

beling process is needed, emails should oc-

cupy less disk space and be more recent (Cabr-

era León and Acosta Padrón, 2011). Among

non-neural ﬁlters, we could ﬁnd: the Spam-

CampaignAssassin (Qian et al., 2010) based on

the detection of spam campaigns, one based on

the alienness or searching of similarities in sub-

strings (Narisawa et al., 2007), and other which

uses sufﬁx trees (Uemura et al., 2008).

On the other hand, there are also unsuper-

vised neural techniques. The

SOM

Based Se-

quence Analysis (Luo and Zincir-Heywood,

2005) makes use of a double hierarchical-

leveled

SOM

, where the second

SOM

is con-

nected a posteriori with a k-Nearest Neigh-

bors (

k-NN

) for categorization and sequence

analysis.

(Vrusias and Golledge, 2009a; Vrusias and

Golledge, 2009b) should be introduced sepa-

rately from others due to their importance for

this paper’s research. They compare their

SOM

based system with what they consider to be the

best spam classiﬁers: the Multinomial Naïve

Bayes Boolean,

SVM

and Boosted Decision

Trees. It is a 10x10

SOM

, sequentially trained

for 1000 cycles, whose input vectors have 26

or 500 attributes, and where keywords were

grouped (just when 26 attributes were used) and

identiﬁed with Term Frequency

Inverse Docu-

ment Frequency (

TF·IDF

) and weirdness. The

main differences between their ﬁlters and this

paper’s are: larger

SOM

, smaller input vectors,

other learning algorithm and a similar way of

identifying keywords were used here, as de-

scribed in Section 2. They also used datasets

based on the “Enron-Spam” corpus from (Met-

sis et al., 2006).

•

Semi-supervised: not many labeled data, due to

high costs (Chapelle et al., 2006), with a lot of

unlabeled examples. Regularized Discriminant

Expectation-Maximization (Gao et al., 2009)

combine both transductive (for labeling unla-

beled data) with inductive (to make a model in

order to classify new data) methods, obtaining

91.66% detection rate and 2.96% FP.

Learning with Local and Global Consistency, of

which there is a variant proposed by (Pfahringer,

2006), obtains better results than with

k-NN

and

SVM (Santos et al., 2011; Zhou et al., 2004).

Although SpamAssassin (Mason, 2009) is gen-

erally considered supervised, it can also utilize a

semi-supervised learning rule (Xu et al., 2009).

A semi-supervised version of

SVM

also exists,

Transductive Support Vector Machine (Shunli

and Qingshuang, 2010; Zhou et al., 2007),

which sometimes performs worse than

SVM

an anti-spam (Mojdeh and Cormack, 2008).

1.1.2 Based on the Source of Information

These methods, which use any part of an email i.e.

envelope, header and body (P. Resnick, 2008), can

also be subdivided in the next three assortments:

Content of the Email:

the most prevalent way, either

using the whole message (Cormack and Mojdeh,

2009) or just selecting parts with different meth-

ods: rules (Malathi, 2011), detecting anchored

parts (Pitsillidis et al., 2010) or spam campaigns

(Qian et al., 2010), signatures of messages (Kolcz

et al., 2004), or combining several techniques, such

as in SpamAssassin (Mason, 2009; The Apache

SpamAssassin Project, 2014) and CRM114 (Yer-

azunis et al., 2010) popular anti-spam ﬁlters.

User Feedback:

in spite of users being considered

the most reliable and robust anti-spam method, spe-

cially against content obfuscation made by spam-

mers, (Graham-Cumming, 2006) ﬁnd out that they

have up to 2% of classiﬁcation errors, due to ham

being very similar to spam - a.k.a. “hard ham”

(Feroze et al., 2015) - or the presence of “grey

cases” (Bruce, 2012), where categorization has an

important subjective factor.

Information Relative to the System:

they fre-

quently take the advantage on the inherent

difﬁculty for spammers to change the message

headers and produce valid ones (Ramachandran

and Feamster, 2006). This can be detected

by checking ﬁelds of some network protocols,

specially the ones which contain the sender’s IP

and port, and the local sending time (countless

emails sent during sender’s sleeping time may

indicate that sender’s device belongs to a botnet).

On the other hand, they can also detect the

presence or absence of speciﬁc characteristics in

the message, such as only images (usually no text

at all, hence called “image spam”) (Fumera et al.,

2006; Gao et al., 2009), and attached ﬁles, prone

to be malware-infected.

Self-Organizing Maps in the Design of Anti-spam Filters - A Proposal based on Thematic Categories

Table 1: Original email corpus. In bold our datasets.

FOLDER HAM–SPAM ORIGINS NO. OF HAM-SPAM HAM DATES SPAM DATES

E1 farmer-d – GP 3672 - 1500 12/1999 - 01/2002 12/2003 - 09/2005

E2 kaminski-v – SH 4361 - 1496 12/1999 - 05/2001 05/2001 - 07/2005

E3 kitchen-l – BG 4012 - 1500 02/2001 - 02/2002 08/2004 - 07/2005

E4 williams-w3 – GP 1500 - 4500 04/2001 - 02/2002 12/2003 - 09/2005

E5 beck-s – SH 1500 - 3675 01/2000 - 05/2001 05/2001 - 07/2005

E6 lokay-m – BG 1500 - 4500 06/2000 - 03/2002 08/2004 - 07/2005

SA2 easy_ham_2 – spam_2 1400 - 1397 02/2003 02/2003

2 DATASET AND METHODS

2.1 Dataset

In order to work with

ANN

s effectively, it is crucial

to have a broad and representative dataset, i.e. a set

of emails where both spam and ham are widely rep-

resented (Borovicka et al., 2012). Most email corpus

are both restricted and costly in order to keep their

users’ privacy and obstruct spammers’ countermea-

sures. In spite of this, several corpus are freely avail-

able nowadays (Cabrera León et al., 2015; Guzella and

Caminhas, 2009).

The proposed system has been developed using a

free and gratis one: a subset of emails from a corpus

created by (Metsis et al., 2006) known as “Enron-

Spam”. Our dataset was built using only “Enron1”

(

) and “Enron5” (

) folders, Table 1. By doing

this, we have worked with a balanced dataset (5172

ham and 5175 spam) from the preprocessed version of

their corpus, wherein:

•

Ham belonged to the “Enron Email Corpus”,

which has been widely used with different pre-

processing techniques applied on it (Cohen, 2004;

Skillicorn, 2013; Styler, 2011). In fact, (Metsis

et al., 2006) use ham from 6 out of 7 Enron users’

inboxes from the preprocessed version of (Bekker-

man et al., 2004), as seen in the column with the

ham origin in Table 1.

•

Spam came from two different sources: received

in a traditional way by one of the authors of the

mentioned corpus, Georgios Paliouras (GP); and

through spamtraps (The Apache SpamAssassin

Project, 2013) and honeypots (SH), which are anti-

spam techniques intended, respectively, to lure

spam, and to bait, investigate and punish spam-

mers. Unwanted messages from Bruce Guenter’s

“Spam Archive” (BG) (Guenter, 1998) were not

used in our case, unlike (Metsis et al., 2006).

Our dataset was subsequently partitioned in the

following balanced sets: 80% for training-validation

the

ANN

and 20% for testing the system over data

never seen before.

Apart from

and

(

E1E5

) used during the de-

sign of the system, other datasets which came from

different email corpora were utilized to test the design

more independently, in order to evaluate in a more

realistic way the methodology of the system: both pre-

processing and processing stages. Thus, the system

was further tested on two additional balanced datasets,

Table 1: the “Enron2” (

) and “Enron4” (

) combi-

nation,

E2E4

hereafter, which is similar to

E1E5

but

with ham from other Enron users and different quanti-

ties of spam from same sources, and “SpamAssassin_2”

(

SA2

), built choosing the newest ham and spam folders

from the SpamAssassin dataset, which differ consid-

erably from our previous datasets in terms of content,

topics, origins and dates (The Apache SpamAssassin

Project, 2013).

2.2 Methods

The proposed intelligent anti-spam system consisted

of two different computing stages or modules, Figure

1, and it can be considered hybrid because each of

them was based on a different computing scheme. The

ﬁrst one was the preprocessing stage, Subsection 2.2.1,

which was based on programmed computation (i.e.

digital electronic computing together with stored pro-

grams) whereas the second one, the processing stage,

Subsection 2.2.2, made use of a neural computing

scheme. The preprocessing module was responsible

for obtaining a semantic and compact representation

of the information environment, a set of feature vec-

tors for emails to analyze. These vectors were the

input data for the subsequent hybrid processing mod-

ule, where the detection of spam by a

SOM

-based,

unsupervised

ANN

, system was performed, followed

by a non-neural supervised labeling method which

worked with the outputs of the SOM.

2.2.1 Preprocessing Module

The preprocessing stage is quite important (Hovold,

2005; Zhang, 2012), especially with unsupervised

NCTA 2016 - 8th International Conference on Neural Computation Theory and Applications

Figure 1: Anti-spam ﬁltering system scheme.

methods due to the fact that no kind of corrective

signals nor correct outputs are provided. The main

purposes of this stage were: reducing the dimensional-

ity of the vocabulary, and only making use of the most

relevant words. Three key concepts - thematic cate-

gories, Inverse Category or Class Frequency (

ICF

) and

Top k% of words - were used, which will be explained

in the next paragraphs.

This preprocessing is founded on the premise that

there are several spam thematic categories (Cabrera

León et al., 2015; Wang et al., 2013), ergo, based on

this, detection and differentiation from ham would be

feasible. The most common thematic categories fre-

quently found in spam and ham, which were expected

to exist within our dataset, have been encountered and

the most similar ones were lumped together in only

13 thematic categories, Table 2. Email words can be-

long to more than one category at the same time (e.g.:

obfuscation & medicine, links & trading. . . ). Initially,

our words categorization in thematic categories was

made manually, keeping in mind the word’s use and

context. Later and based on this, this process was au-

tomatized. Manual words categorization brought with

it two useful advantages:

•

No lemmatization nor stemming were required

(Porter, 1980), common in many anti-spam ﬁlters.

•

Robust against words deliberately obfuscated, a

countermeasure frequently applied by spammers

to deceive or defeat anti-spam techniques (Freschi

et al., 2006; Liu and Stamm, 2007).

The existence of the aforementioned thematic cate-

gories directed this research to the usage of

ICF

, which

is recommended by some authors when categories

or classes exist in the data (Wang and Zhang, 2013),

rather than using Inverse Document Frequency (IDF).

Both

ICF

and

IDF

are used for similar purposes: re-

duce the importance/weight given by Term Frequency

(

) to “stop words” i.e. extremely common words

in categories or documents, respectively. The three

different variants of

ICF

described by (Lertnattee and

Theeramunkong, 2004) have been checked, which are

deﬁned by Equations 1, 2 and 3, where

is the total

number of categories, and

is the number of cate-

gories where token t happens.

ICF

Log

= log(

|C|

) (1)

ICF

Linear

|C|

(2)

ICF

Sqrt

(

|C|

) (3)

Another interesting aspect in this process was to

know if using all the words within a thematic cate-

gory was better than using less, so the system was

tested with different number of words in each one.

This number is given by some percentage, k, which

means that we chose the k% of the words with greater

T F

category

· ICF

so extremely common words were

given less importance. Ordering categories us-

ing those values permitted the obtainment of sev-

eral Top k% of words to be tested. Building a

Self-Organizing Maps in the Design of Anti-spam Filters - A Proposal based on Thematic Categories

Table 2: Description of the 13 thematic categories within our spam and ham words.

THEMATIC CATEGORY DESCRIPTION

Sex & Relationships Mostly pornography, casual sex and dating websites

Medicine, Drugs & Body

Selling medicines or illegal drugs. Includes words related with body parts,

and surgical procedures and tools

Betting, Gambling & Divination Includes lotto, sports betting, casino, tarot, etc.

Banking, Investment, Insurance & Trading Commerce, offers, funds, stock markets. . .

Links & Email addresses

Parts of links to websites and emails, mainly web extensions and domains

Other languages Most of the emails were written in English but some were in Spanish,

German, French, Dutch and Turkish, among others

Obfuscation

Very common. Words badly written on purpose by spammers to interfere

with most content-based anti-spam ﬁlters

Business, Companies & Government Name of ﬁrms, governmental agencies and analogous

Internet & Technology ICT vocabulary

Documents & Paperwork CV, diplomas, business documents, etc.

Names & Family Names and surnames, also family members

Tourism & Regions Countries, cities, holidays. . .

Attached ﬁles Several ﬁle extensions and words related with “attach”

13-dimensional, ﬂoating point, array where each el-

ement represented the sum of the

T F

document

· ICF

similar to Phase 2 but now weighted, of all words

belonging to certain thematic category. Same two

labels at the end of each line-email.

2.2.2 Processing Module

The system’s processing stage has as its information

environment the feature vectors obtained in the prepro-

cessing stage. It is hybrid as the ﬁrst part was based

on a type of

ANN

s, the well known Kohonen Self-

Organizing Maps (Kohonen, 2001), whereas the sec-

ond one was non-neural. Both parts will be explained

below.

SOM

, as an unsupervised neural architecture, is a

very appropriate method for facing the problem to be

solved. It quantiﬁes the input space in different regions

represented by a speciﬁc number of output neurons,

a.k.a. detectors. In our case, there are two types of

detectors: spam detectors and ham detectors.

Moreover,

SOM

s might be used as a visualiza-

tion tool of high-dimensional data by projections over

lower-dimensional maps (Rojas, 1996). During this

projection process akin to multidimensional scaling,

it is easily seen that

SOM

s try to extract the features

of the input space in order to preserve its topological

properties, similar to the idea of topographic maps

that exist in the brains of highly developed animals

(Haykin, 1999).

Its structure is made of an input layer fully intercon-

nected, by excitatory connections, with the neurons

in the output layer. The latter is organized in a m-

dimensional space, the most common being the 2D

matrix. Within this matrix there is a neighborhood re-

lationship between its nodes that is usually deﬁned by

an hexagonal or rectangular lattice. Also, the matrix

shape can vary, the sheet one being the most common.

All neurons within the output layer simultaneously

present inhibitory lateral connections among neural

neighbors as well as excitatory self-connections. Their

neurodynamic is simpliﬁed by computing the least

(more frequently Euclidean and hence used in this

work) distance between the inputs and a model (Ko-

honen, 2001), which is a parametric real vector, that

can be considered as the weight vector in the neural

architecture. The winning neuron, a.k.a. Best Match-

ing Unit (

BMU

), will be the one with the minimum

distance value.

The learning process belongs to a winner-take-all,

unsupervised and competitive training paradigm. The

main variations are seen in the modiﬁcation of the

synaptic weights, which not only affects the winning

neuron but also, to a lesser extent, the set of neurons

in the winners’ neighborhood

(thus,

SOM

training

can be considered cooperative too), and consequently

being able to generate topological relations, Equation

4. During the training period, the neighborhood rela-

tionship between nodes

decreases both in time and

distance (commonly a Gaussian function), affecting

only the

BMU

during the ﬁnal phase. The learning

rate

normally decreases with time, usually beginning

near the unity and ﬁnishing close to zero during the

ﬁne tuning done in the last training cycles, although a

ﬁxed value may be utilized but not recommended (Tan

and George, 2004).

∆w

(

α(x

− w

) if i ∈ N = argmin

{net

(x)}

0 otherwise

(4)

SOM

s can use two different learning methods: se-

quential and batch, Equation 5, where

is the mean

of the elements in a list of weights updates;

, the

NCTA 2016 - 8th International Conference on Neural Computation Theory and Applications

neighborhood function; and

, the number of elements

in that list. Batch learning method, which can be bet-

ter and converges faster (Kohonen, 2013), has been

employed in this article.

(n + 1) =

∑

· h

· x

∑

· h

(5)

The second part of the processing stage was a non-

neural supervised classiﬁcation method, which was

appended after the

SOM

, Figure 1. Its main aim was

to label the results obtained by the

SOM

, that way

classifying in spam or ham the emails inputted into the

anti-spam ﬁlter. It was based on conﬁdence thresholds,

which are based on the minimum percentage from

which consider an email as spam. These conﬁdence

thresholds, which were empirically chosen, will allow

us to adjust the system relative to the

, an important

factor in this kind of ﬁlters. In a nutshell, an email was

labeled as ham if the spam ratio

#spam

#spam + #ham

was lesser

than this threshold, or as spam otherwise, Equation 6.

label



spam if

#spam

#spam + #ham

≥ threshold

ham otherwise

(6)

3 RESULTS AND DISCUSSION

MATLAB was the main development environment for

our anti-spam ﬁlter, using the

SOM

Toolbox (Vesanto

et al., 2000) for the

SOM

architecture and visualiza-

tion tools, and the Parallel Computing Toolbox (Math-

Works, 2014) to reduce the high computational costs

of the experiments.

Table 3: Modiﬁed characteristics and their tested values for

all the 1260 conﬁgurations.

CHARACTERISTIC TESTED VALUES

Normalization Scenario 1 (None) or

Scenario 2 (Variance

is normalized to one,

linear operation)

ICF log, linear & sqrt

Top k% of words

100, 95, 90, 75, 50, 25,

SOM size 13x13, 20x20, 25x25,

40x40 & 50x50

SOM training algorithm batch

Number of epochs 100, 500, 1000, 3000,

5000 & 8000

Neighborhood function gaussian

SOM shape sheet

Lattice hexagonal

Weight initialization linear

The experiments have been performed using origi-

nal (non-normalized) and normalized data, which are

indicated as Scenario 1 and Scenario 2 respectively.

1260 different system conﬁgurations were developed,

which differed between them by varying several char-

acteristics related to the information environment and

the SOM structure, Table 3.

The efﬁciency and quality of the proposed anti-

spam system were determined through the usage of

two different families of metrics:

• Quality of the SOM map (Tan and George, 2004):

MQE

and

, which measure the map resolution

(how accurately the inputs are reﬂected in the out-

put space) and the topology preservation (the order

of the map), respectively.

•

On-use performance measures such as F-score,

accuracy, precision, speciﬁcity, sensitivity,

ROC

curve and

AUC

. All of them can be expressed

in term of the elements of the confusion matrix:

True Positives (

, True Negatives (

) and

. Also, it might be included in this group one

simple and low cost metric found out during this

research that measures the least Euclidean distance

between the

ROC

curve and the point of perfect

classiﬁcation in (0, 1).

For evaluation purposes and comparison with other

researchers’, the best considered performance mea-

sures are

ROC

curves

and

AUC

(Fawcett, 2004;

Metz, 1978; Slaby, 2007).

Relative to

E1E5

dataset, results obtained with all

metrics were quite positive. All the 1260 analyzed

conﬁgurations obtained

AUC > 0.90

, and even 204

got

AUC > 0.95

which can be described as “excellent”

classiﬁers in the anti-spam context. Additionally, most

metrics’s results were quite similar between conﬁg-

urations, even more if comparing same scenario. It

has been observed that all pairs Top 100% and Top

95% conﬁgurations, with identical rest of parameters,

shared the same results. Consequently, Top 95% ones

were preferred because of their faster learning due to

using a smaller vocabulary. Furthermore, none of the

best classiﬁers for each scenario used the biggest

SOM

sizes, 50x50, but smaller-sized ones. Besides, it was

found out that normalized data behaved better, which

usually happens with Kohonen networks. Obtained

MQE

and

with normalized data are on the same

range of values as other authors’ (Cabrera León et al.,

2015).

Comparing our results in Table 4 with other re-

searchers’ in Table 5, the proposed system achieves

worse than desired

and

(around 7% and 3%,

Each

ROC

curve was drawn with 102 speciﬁcity and

sensitivity values, given by the same number of conﬁdence

thresholds, for enhanced ROC curve detail.

Self-Organizing Maps in the Design of Anti-spam Filters - A Proposal based on Thematic Categories

Table 5: Results obtained by some researchers, indicating dataset, methodology and metric utilized.

RESEARCH DATASET METHOD BEST RESULTS

(Metsis et al., 2006) Enron-Spam Bayesian (several) Sensitivity = [0.9232 - 0.9753]

(Vrusias and Golledge, 2009b) Enron-Spam SOM Precision = 0.992867

Sensitivity = 0.920067

(Chuan et al., 2005) SpamAssassin LVQ Precision = 0.9897

Sensitivity = 0.9358

(Holden, 2004) SpamAssassin Bayesian (several, commercial) Precision = [0.328 - 1]

Sensitivity = [0.837 - 0.988]

(Kufandirimbwa and Gotora, 2012) SpamAssassin Perceptron algorithm Precision = 0.97149

Sensitivity = 0.77859

(Luo and Zincir-Heywood, 2005) Ling-Spam Two-level SOMs + k-NN Precision = [0.933 - 1]

Sensitivity = [0.675 - 0.975]

(Shunli and Qingshuang, 2010) ECML-PKDD 2006 Transductive SVM AUC = 0.9321

(Xie et al., 2009) PU1 & PU2 SVM (several) Accuracy (PU1) = [0.926 - 0.941]

Accuracy (PU2) = [0.932 - 0.945]

Table 4: Results of the anti-spam ﬁlter for each scenario

(testing phase) with E1E5 (20%, 2069 emails) dataset.

E1E5 (20%)

PERFORMANCE

MEASUREMENTS

Scenario 1 Scenario 2

AUC 0.970809 0.977740

Discrete AUC

(threshold)

0.924172

(38)

0.944499

(37)

Accuracy 0.924595 0.944726

Precision 0.898782 0.928571

F-score 0.927915 0.946385

Speciﬁcity 0.889344 0.924103

Sensitivity 0.959000 0.964895

TP 959 962

FP 108 74

TN 868 901

FN 41 35

% FP 10.44% 7.15%

% FN 3.96% 3.38%

Distance to (0, 1) 0.118007 0.083623

SOM MAP QUALITY Scenario 1 Scenario 2

MQE 39.592070 0.452661

TE 0.118567 0.078204

respectively), which should be the correction priority

in future works, while good and comparable values

with performance metrics. Still, it should be noted

that this comparison would have been more realistic

if exactly the same emails and preprocessing methods

had been tested with other processing techniques. This

is expected to be done in future works, together with a

more advanced system.

The optimal system conﬁguration used normalized

data (Scenario 2), Top 95% of words,

ICF

sqrt, gaus-

sian neighborhood, hexagonal lattice, 20x20 sheet-

shaped map, and trained for 8000 epochs [with the

batch algorithm]. It overcame the Scenario 1’s oppo-

nent that used data without any kind of normalization

applied to it, Top 25% of words,

ICF

log, gaussian

neighborhood, hexagonal lattice, 20x20 sheet-shaped

map, and was trained for 100 epochs, Table 4 and

Figure 2(a)).

The proposed anti-spam ﬁlter was further tested

with other datasets, which are different from the one

used during the design of the anti-spam (

E1E5

). In-

deed, the datasets were mixed in order to analyze both

best conﬁgurations in a more realistic situation (i.e.

with emails from diverse origins and in different pro-

portions):

• E2E4

: from different folders of the same email

corpus, “Enron-Spam”.

• E2E4

SA2

: a mix of emails from “Enron-Spam”

and “SpamAssassin_2”.

• E1E5

E2E4

SA2

: an even more realistic mix

of the three datasets used, including unseen emails

from the dataset utilized during the design (

E1E5

and without taking into account the mail distribu-

tion among them, using the whole set of emails.

Regarding to Scenario 2 as it has been the one

that achieved better performance, obtained results for

the previous dataset mixtures vary from having an

excellent performance -

AUC

over

0.964

and

0.906

respectively - with

E2E4

and

E1E5

E2E4

SA2

to a satisfactory one -

AUC > 0.778

- with

E2E4

SA2

, as seen in Table 6 and Figure 2(b)). These values

reﬂect a coherent behavior and a good performance of

the proposed system.

When the analyzed emails share similar character-

istics (e.g. content, topics and origins) with the dataset

used for design, training, performance is excellent. But

the system is still able to have good and very good per-

formance with new received emails, whose attributes

are highly dissimilar. These differences fully justify

the results in each case, Tables 4 and 6. At the same

time, results are quite promising and indicative of the

goodness of the proposed methods.

Still, a generalization of thematic categories, of

the words inside them or a mix of both might be one

of the potential solutions in order to improve even

NCTA 2016 - 8th International Conference on Neural Computation Theory and Applications

Table 6: Results of the anti-spam ﬁlter for each scenario (testing phase) with

E2E4

(2000 random emails),

E2E4

SA2

(2000

random, 50% each) and E1E5 + E2E4 + SA2 (20% of E1E5, all E2E4 and SA2, 16719 emails) datasets.

E2E4 E2E4 + SA2 E1E5 + E2E4 + SA2

PERFORMANCE

MEASUREMENTS

Scenario 1 Scenario 2 Scenario 1 Scenario 2 Scenario 1 Scenario 2

AUC 0.951306 0.964116 0.730900 0.778584 0.886687 0.906290

Discrete AUC

(threshold)

0.890036

(71)

0.925391

(48)

0.709345

(71)

0.726712

(27)

0.835745

(71)

0.856466

(48)

Accuracy 0.889952 0.925509 0.707214 0.724960 0.835752 0.856403

Precision 0.897137 0.918835 0.646072 0.677951 0.802223 0.816712

F-score 0.890995 0.927034 0.752889 0.750601 0.844410 0.864702

Speciﬁcity 0.895135 0.915401 0.516667 0.612735 0.780211 0.794249

Sensitivity 0.884937 0.935381 0.902023 0.840689 0.891279 0.918684

TP 846 883 847 781 7001 7174

FP 97 78 464 371 1726 1610

TN 828 844 496 587 6127 6215

FN 110 61 92 148 854 635

% FP 9.97% 8.01% 46.44% 37.13% 20.80% 19.11%

% FN 10.71% 5.94% 9.19% 14.78% 10.13% 7.53%

Distance to (0, 1) 0.155679 0.106454 0.493164 0.418753 0.245208 0.221237

(a) (b)

Figure 2: a)

ROC

curve of the anti-spam ﬁlter for each scenario (testing phase - “E1E5” dataset). b) Comparison of the

ROC

curves obtained by the proposed system for Scenario 2 with several datasets.

more the performance of the proposed system when

dealing with any email type. Another solution might

be updating the ﬁlter periodically with new messages

(i.e. providing it with online-learning capabilities)

to both counteract the evolution of spam and ham

content and topics. The latter phenomenon is known

as “topic drift” (Wang et al., 2013) and it is related

with the general problem of “concept drift” in

unexpected variations of the statistical properties of

the target variable over time, which usually imply

increasingly inaccurate predictions over the course

of time (Gama et al., 2014; [Pleaseinsertintopream-

ble]liobait[Pleaseinsertintopreamble] et al., 2015).

Comparing with the anti-spam ﬁlters proposed by

(Metsis et al., 2006), from whom most of the datasets

used in this paper were obtained, our results in Table 4

are comparable to theirs, Table 5, and our

ROC

curves

in Figure 2(a)) are similar to their separated curves for

“Enron1” and “Enron5” folders: speciﬁcity = 0.94433

and sensitivity = 0.96436 (average and for only those

two folders). The main difference is that their ﬁlters

make use of a non-neural methodology with super-

vised learning strategy and up to 3000 attributes while

our proposal is a

SOM

-based system which used only

13 for analogous results. Consequently, we could infer

that we utilized both appropriate preprocessing meth-

ods, that let us obtain smaller yet more informative

input vectors, and a powerful processing tool, which

is able to work with such unlabeled vectors.

Self-Organizing Maps in the Design of Anti-spam Filters - A Proposal based on Thematic Categories

4 CONCLUSIONS

In this paper a hybrid and modular anti-spam ﬁlter-

ing system based on Kohonen Self-Organizing Maps

and supervised non-neural computing has been pre-

sented. It has been proved that thematic categories

can be found in spam and ham so, accordingly, both

spam and ham words have been classiﬁed in the 13 cat-

egories found. The proposed system is robust to word

obfuscation, quite frequent in spam, and it is also inde-

pendent of the need to use stemming or lemmatization

algorithms, unlike other anti-spam ﬁlters.

All the studied conﬁgurations obtained good re-

sults with all metrics with

E1E5

, the dataset used dur-

ing the design of the system. Results were identical

when using the whole set, Top 100%, of keywords

from each of the 13 categories or just the Top 95%,

something that also brought lower runtime along. Our

optimal conﬁguration was attained with normalized

data, which is usual with Kohonen

SOM

s. Obtained re-

sults were similar to other researchers’ over the same

corpus (Metsis et al., 2006), though they use input

vectors with a dimensionality several orders of mag-

nitude greater than ours, up to 3000, and a number of

Bayesian methods.

The developed anti-spam ﬁlter was additionally

tested with data that were completely different from

the ones used during its design, achieving important

ﬁndings. Results with

E2E4

dataset were similar but

worse with the non-Enron

SA2

. Testing with a mix of

the previous three datasets in different number and pro-

portion conﬁrmed that the ﬁlter’s detection power got

affected when newer and off-topic spam and ham were

encountered. This is common to other ofﬂine-training

anti-spam solutions because topics drift along the time

as both spam and spammers’ techniques evolve (Wang

et al., 2013). This situation can be solved with periodic

retraining (as in online-training ﬁlters) or, on the other

hand, improving the generalization of the system or

the thematic categories.

Obtained results conﬁrmed the goodness and high

quality of the proposed system. The usage of compu-

tational intelligence methods and hybrid schemes for

designing anti-spam ﬁltering systems were quite ben-

eﬁcial. Both facts encourage us to continue research

over these topics. A big upgrading step might be the

use of some powerful hybrid neural architectures such

as the Counterpropagation network or the Hybrid Un-

supervised Modular Adaptive Neural Network (Suárez

Araujo et al., 2010).

ACKNOWLEDGEMENTS

The authors would like to thank all six reviewers for

their valuable commentaries, which led to the improve-

ment of this article.

REFERENCES

Bekkerman, R., McCallum, A., and Huang, G. (2004). Auto-

matic categorization of email into folders: Benchmark

experiments on Enron and SRI corpora. Technical

Report IR-418, Center for Intelligent Information Re-

trieval - University of Massachusetts Amherst.

Blanco, Á., Ricket, A. M., and Martín-Merino, M. (2007).

Combining SVM classiﬁers for email anti-spam ﬁlter-

ing. In Sandoval, F., Prieto, A., Cabestany, J., and

Graña, M., editors, Computational and Ambient Intel-

ligence, volume 4507 of Lecture Notes in Computer

Science. 9th International Work-Conference on Artiﬁ-

cial Neural Networks, IWANN 2007, pages 903–910.

Springer Berlin Heidelberg, San Sebastián, Spain.

Borovicka, T., Jirina Jr., M., Kordik, P., and Jirina, M. (2012).

Selecting Representative Data Sets. In Karahoca, A.,

editor, Advances in Data Mining Knowledge Discovery

and Applications. InTech.

Bruce, J. (2012). Grey Mail: The New Email Nuisance To

Hit Your Inbox.

Cabrera León, Y. and Acosta Padrón, O. (2011). Spam: Def-

inition, statistics, anti-spam methods and legislation.

Course project, Politechnika Wroclawska, Wroclaw,

Poland.

Cabrera León, Y., Suárez Araujo, C. P., and García Báez, P.

(2015). Análisis del Uso de las Redes Neuronales Artiﬁ-

ciales en el Diseño de Filtros Antispam: una Propuesta

Basada en Arquitecturas Neuronales No Supervisadas.

Final project, Universidad de Las Palmas de Gran Ca-

naria, Las Palmas de Gran Canaria.

Chapelle, O., Schölkopf, B., and Zien, A. (2006). Semi-

Supervised Learning, volume 2. MIT Press, Cam-

bridge, MA, USA.

Chhabra, P., Wadhvani, R., and Shukla, S. (2010). Spam Fil-

tering using Support Vector Machine. In Special Issue

of IJCCT Vol.1 Issue 2, 3, 4; 2010 for International

Conference [ACCTA-2010], pages 166–171.

Chuan, Z., Xianliang, L., Mengshu, H., and Xu, Z. (2005). A

LVQ-based neural network anti-spam email approach.

ACM SIGOPS Operating Systems Review, 39(1):34–39

(6).

Cohen, W. W. (2004). Enron Email Dataset.

Cormack, G. V. and Mojdeh, M. (2009). Machine learning

for information retrieval: TREC 2009 web, relevance

feedback and legal tracks. In The Eighteenth Text RE-

trieval Conference Proceedings (TREC 2009), pages

1–9, Gaithersburg, MD, USA.

Drucker, H., Wu, D., and Vapnik, V. N. (1999). Support

Vector Machines for spam categorization. IEEE Trans-

actions on Neural Networks, 10(5):1048–1054.

Erickson, D., Casado, M., and McKeown, N. (2008). The

Effectiveness of Whitelisting: A User-Study. In Proc.

of Conference on Email and Anti-Spam, pages 1–10.

NCTA 2016 - 8th International Conference on Neural Computation Theory and Applications

Fawcett, T. (2004). ROC graphs: Notes and practical consid-

erations for researchers. Machine Learning, 31(1):1–

38.

Feroze, M. A., Baig, Z. A., and Johnstone, M. N. (2015). A

Two-Tiered User Feedback-based Approach for Spam

Detection. In Becker Westphall, C., Borcoci, E., and

Manoharan, S., editors, ICSNC 2015: The Tenth Inter-

national Conference on Systems and Networks Commu-

nications, November 15-20, 2015, Barcelona, Spain,

pages 12–17. Curran Associates, Inc, Red Hook, NY.

Freschi, V., Seraghiti, A., and Bogliolo, A. (2006). Filtering

obfuscated email spam by means of phonetic string

matching. In Advances in Information Retrieval, pages

505–509. Springer.

Fumera, G., Pillai, I., and Roli, F. (2006). Spam ﬁltering

based on the analysis of text information embedded

into images. The Journal of Machine Learning Re-

search, 7:2699–2720.

Gama, J., Žliobait

e, I., Bifet, A., Pechenizkiy, M., and

Bouchachia, A. (2014). A Survey on Concept Drift

Adaptation. ACM Computing Surveys (CSUR), 46(4):1–

37.

Gao, Y., Yan, M., and Choudhary, A. (2009). Semi Super-

vised Image Spam Hunter: A Regularized Discrimi-

nant EM Approach. In International Conference on

Advanced Data Mining and Applications, pages 152–

164. Springer Berlin Heidelberg.

Graham-Cumming, J. (2006). SpamOrHam. Virus Bulletin,

pages 22–24.

Guenter, B. (1998). SPAM Archive: Email spam received

yearly, since early 1998.

Guzella, T. S. and Caminhas, W. M. (2009). A review of

machine learning approaches to Spam ﬁltering. Expert

Systems with Applications, 36(7):10206–10222.

Harris, E. (2003). The Next Step in the Spam Control War:

Greylisting.

Haykin, S. S. (1999). Neural Networks. A Comprehen-

sive Foundation. Prentice-Hall International, Ontario,

Canada, 2nd edition.

Holden, S. (2004). Spam Filtering II: Comparison of a

number of Bayesian anti-spam ﬁlters over different

email corpora.

Hovold, J. (2005). Naive Bayes Spam Filtering Using Word-

Position-Based Attributes. In CEAS.

Kohonen, T. (2001). Self-Organizing Maps. Springer-Verlag

New York, Secaucus, NJ, USA, 3 edition.

Kohonen, T. (2013). Essentials of the self-organizing map.

Neural Networks, 37:52–65.

Kolcz, A., Chowdhury, A., and Alspector, J. (2004). The

impact of feature selection on signature-driven spam

detection. In Proceedings of the 1st Conference on

Email and Anti-Spam (CEAS-2004), pages 1–8.

Kucherawy, M. and Crocker, D. (2012). RFC 6647 - Email

Greylisting: An Applicability Statement for SMTP.

Proposed standard.

Kufandirimbwa, O. and Gotora, R. (2012). Spam Detection

Using Artiﬁcial Neural Networks (Perceptron Learning

Rule). Online Journal of Physical and Environmental

Science Research, 1(2):22–29.

Lertnattee, V. and Theeramunkong, T. (2004). Analysis of

inverse class frequency in centroid-based text classiﬁ-

cation. volume 2, pages 1171–1176. IEEE.

Lieb, R. (2002). Make Spammers Pay Before You Do.

Liu, C. and Stamm, S. (2007). Fighting Unicode-obfuscated

spam. In Proceedings of the Anti-Phishing Working

Groups 2nd Annual eCrime Researchers Summit, pages

45–59. ACM.

Lowd, D. and Meek, C. (2005). Good Word Attacks on

Statistical Spam Filters. In Proceedings of the Second

Conference on Email and Anti-Spam (CEAS), pages

1–8.

Luo, X. and Zincir-Heywood, N. (2005). Comparison of

a SOM based sequence analysis system and naive

Bayesian classiﬁer for spam ﬁltering. In Neural Net-

works, 2005. IJCNN’05. Proceedings. 2005 IEEE In-

ternational Joint Conference on, volume 4, pages 2571–

2576.

Malathi, R. (2011). Email Spam Filter using Supervised

Learning with Bayesian Neural Network. Computer

Science, HH The Rajah’s College, Pudukkottai-622,

1:89–100.

Mason, J. (2009). Filtering Spam With SpamAssassin.

MathWorks (2014). Parallel Computing Toolbox for Matlab

R2014a - User’s Guide.

McAfee and ICF International (2009). The Carbon Footprint

of Email Spam Report. Technical report.

Metsis, V., Androutsopoulos, I., and Paliouras, G. (2006).

Spam Filtering with Naive Bayes - Which Naive

Bayes? In CEAS 2006 - Third Conference on Email

and Anti-Spam, pages 27–28, Mountain View, Califor-

nia, USA.

Metz, C. E. (1978). Basic principles of ROC analysis. In

Seminars in Nuclear Medicine, volume 8, pages 283–

298. Elsevier.

Meyer, T. A. and Whateley, B. (2004). SpamBayes: Effec-

tive open-source, Bayesian based, email classiﬁcation

system. In CEAS. Citeseer.

Mojdeh, M. and Cormack, G. V. (2008). Semi-supervised

spam ﬁltering: Does it work? In Proceedings of the

31st Annual International ACM SIGIR Conference on

Research and Development in Information Retrieval,

pages 745–746, Singapore. ACM.

Narisawa, K., Bannai, H., Hatano, K., and Takeda, M. (2007).

Unsupervised spam detection based on string alien-

ness measures. In Discovery Science, pages 161–172.

Springer-Verlag Berlin Heidelberg.

P. Resnick, E. (2008). RFC 5322 - Internet Message Format.

Draft standard.

Pfahringer, B. (2006). A semi-supervised spam mail detector.

pages 1–5, Berlin, Germany.

Pitsillidis, A., Levchenko, K., Kreibich, C., Kanich, C.,

Voelker, G. M., Paxson, V., Weaver, N., and Savage, S.

(2010). Botnet judo: Fighting spam with itself. In Sym-

posium on Network and Distributed System Security

(NDSS), pages 1–19.

Porter, M. F. (1980). An algorithm for sufﬁx stripping. Pro-

gram, 14(3):130–137.

Postini, Inc (2004). The shifting tactics of spammers: What

you need to know about new email threats. White

paper.

Qian, F., Pathak, A., Hu, Y. C., Mao, Z. M., and Xie,

Y. (2010). A Case for Unsupervised-Learning-based

Spam Filtering. volume 38, pages 367–368. ACM.

Self-Organizing Maps in the Design of Anti-spam Filters - A Proposal based on Thematic Categories

Ramachandran, A. and Feamster, N. (2006). Understanding

the network-level behavior of spammers. In ACM SIG-

COMM Computer Communication Review, volume 36,

pages 291–302.

Rao, J. M. and Reiley, D. H. (2012). The Economics of Spam.

Journal of Economic Perspectives, 26(3):87–110.

Rojas, R. (1996). Kohonen Networks. In Neural Networks:

A Systematic Introduction, pages 391–412. Springer-

Verlag, Berlin.

Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E.

(1998). A Bayesian Approach to Filtering Junk E-

mail. AAAI Technical Report WS-98-05, Madison,

Wisconsin.

Santos, I., Sanz, B., Laorden, C., Brezo, F., and Bringas, P. G.

(2011). Computational Intelligence in Security for In-

formation Systems: 4th International Conference, CI-

SIS 2011, Held at IWANN 2011. Torremolinos-Málaga,

Spain.

Sculley, D., Wachman, G., and Brodley, C. E. (2006). Spam

Filtering Using Inexact String Matching in Explicit

Feature Space with On-Line Linear Classiﬁers. In

TREC.

Shunli, Z. and Qingshuang, Y. (2010). Personal Spam Filter

by Semi-supervised Learning. In Proceedings of the

Third International Symposium on Com Puter Science

and Computational Technology (ISCSCT ’10), pages

171–174, Jiaozuo, P. R. China.

Skillicorn, D. (2013). Other versions of the Enron data

(preprocessed).

Slaby, A. (2007). ROC Analysis with Matlab. In Infor-

mation Technology Interfaces, 2007. ITI 2007. 29th

International Conference on, pages 191–196. IEEE.

Spammer-X, Posluns, J., and Sjouwerman, S. (2004). Inside

the SPAM Cartel. Syngress - Elsevier, 1 edition.

Sprengers, M. and Heskes, T. T. (2009). The Effects of Dif-

ferent Bayesian Poison Methods on the Quality of the

Bayesian Spam Filter ‘SpamBayes’. Bachelor thesis,

Radboud University Nijmegen.

Statista (2016). Global spam volume as percentage of total

e-mail trafﬁc from 2007 to 2014.

Styler, W. (2011). The EnronSent Corpus. Technical Report

01-2011, University of Colorado at Boulder Institute

of Cognitive Science.

Suárez Araujo, C. P., García Báez, P., and Hernández Tru-

jillo, Y. (2010). Neural Computation Methods in the

Determination of Fungicides. In Fungicides. INTECH

Open Access Publisher, odile carisse edition.

Subramaniam, T., Jalab, H. A., and Taqa, A. Y.

(2010). Overview of textual anti-spam ﬁltering tech-

niques. International Journal of the Physical Science,

5(12):1869–1882.

Tan, H. S. and George, S. E. (2004). Investigating Learning

Parameters in a Standard 2-D SOM Model to Select

Good Maps and Avoid Poor Ones. In Australasian

Joint Conference on Artiﬁcial Intelligence, pages 425–

437. Springer.

The Apache SpamAssassin Project (2013). Index of the

SpamAssassin’s Public Corpus.

The Apache SpamAssassin Project (2014). SpamAssassin

v3.3.x: Tests Performed to Determine Spaminess and

Haminess of a Message.

Uemura, T., Ikeda, D., and Arimura, H. (2008). Unsuper-

vised spam detection by document complexity estima-

tion. In Discovery Science, pages 319–331.

Vesanto, J., Himberg, J., Alhoniemi, E., and Parhankangas, J.

(2000). SOM Toolbox for Matlab 5. Technical Report

Report A57, Helsinki University of Technology.

Vrusias, B. L. and Golledge, I. (2009a). Adaptable Text

Filters and Unsupervised Neural Classiﬁers for Spam

Detection. In Proceedings of the International Work-

shop on Computational Intelligence in Security for

Information Systems CISIS’08, volume 53 of Advances

in Soft Computing, pages 195–202. Springer Berlin

Heidelberg.

Vrusias, B. L. and Golledge, I. (2009b). Online Self-

Organised Map Classiﬁers as Text Filters for Spam

Email Detection. Journal of Information Assurance

and Security (JIAS), 4(2):151–160.

Wang, D., Irani, D., and Pu, C. (2013). A Study on Evolu-

tion of Email Spam Over Fifteen Years. pages 1–10,

Atlanta, Georgia (USA). IEEE.

Wang, D. and Zhang, H. (2013). Inverse-Category-

Frequency Based Supervised Term Weighting Schemes

for Text Categorization. Journal of Information Sci-

ence & Engineering, 29(2):209–225.

Wittel, G. L. and Wu, S. F. (2004). On Attacking Statistical

Spam Filters. In CEAS.

Xie, C., Ding, L., and Du, X. (2009). Anti-spam Filters

Based on Support Vector Machines. In Advances in

Computation and Intelligence. 4th International Sym-

posium, ISICA 2009, volume 5821 of Lecture Notes

in Computer Science, pages 349–357. Springer Berlin

Heidelberg, Huangshi, China.

Xu, J.-M., Fumera, G., Roli, F., and Zhou, Z.-H. (2009).

Training SpamAssassin with Active Semi-supervised

Learning. In Proceedings of the 6th Conference on

Email and Anti-Spam (CEAS’09), pages 1–8. Citeseer.

Yerazunis, W., Kato, M., Kori,, M., Shibata, H., and Hacken-

berg, K. (2010). Keeping the Good Stuff In: Conﬁden-

tial Information Firewalling with the CRM114 Spam

Filter & Text Classiﬁer. White Paper for Black Hat

USA, pages 1–18.

Zeimpekis, D., Kontopoulou, E. M., and Gallopoulos, E.

(2011). Text to Matrix Generator (TMG).

Zhang, Y. (2012). Lecture for Chapter 2 - Data Preprocess-

ing.

Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and

Schölkopf, B. (2004). Learning with local and global

consistency. Advances in neural information process-

ing systems, 16:321–328.

Zhou, D., Burges, C. J. C., and Tao, T. (2007). Transductive

link spam detection. In Proceedings of the 3rd Interna-

tional Workshop on Adversarial Information Retrieval

on the Web, pages 21–28.

Žliobait

e, I., Pechenizky, M., and Gama, J. (2015). An

overview of concept drift applications. In Japkowicz,

N. and Stefanowski, J., editors, Big Data Analysis:

New Algorithms for a New Society, volume 16 of Stud-

ies in Big Data, pages 91–114. Springer International

Publishing.

NCTA 2016 - 8th International Conference on Neural Computation Theory and Applications