EMBEDDED FEATURE SELECTION FOR SPAM AND PHISHING

FILTERING USING SUPPORT VECTOR MACHINES

Sebasti

an Maldonado

and Gast

on L’Huillier

Faculty of Engineering and Applied Sciences, Universidad de los Andes

Av. San Carlos de Apoquindo 2200, Las Condes, Santiago, Chile

Groupon, Inc., Palo Alto, CA, U.S.A.

Keywords:

Spam and phishing ﬁltering, Support vector machines, Feature selection, Embedded methods.

Abstract:

Today, the Internet is full of harmful and wasteful elements, such as phishing and spam messages, which must

be properly classiﬁed before reaching end-users. This issue has attracted the pattern recognition community’s

attention and motivated to determine which strategies achieve best classiﬁcation results. Several methods use

as many features as content-based properties the data set have, which leads to a high dimensional classiﬁcation

problem. In this context, this paper presents a feature selection approach that simultaneously determines a non-

linear classiﬁcation function with minimal error and minimizes the number of features by penalizing their use

in the dual formulation of binary Support Vector Machines (SVM). The method optimizes the width of an

anisotropic RBF Kernel via successive gradient descent steps, eliminating features that have low relevance

for the model. Experiments with two real-world Spam and Phishing data sets demonstrate that our approach

accomplishes the best performance compared to well-known feature selection methods using consistently a

small number of features.

1 INTRODUCTION

One particular domain for which machine learn-

ing has been considered a key component is cyber-

security. Speciﬁcally, for the correct identiﬁcation of

the large number of spam messages, web spam, and

spam servers which inundate Internet resources every

day. It is likely that spam messages will continue to

be one of the most wasteful, dangerous and infectious

elements on the Web as new campaigns are occasion-

ally instigated by spam senders (Taylor et al., 2007).

Identifying malicious emails such as spam or

phishing can be considered as a task of binary clas-

siﬁcation where the goal is to discriminate between

the two classes of “desired” and “undesired” emails.

Support Vector Machine (Vapnik, 1998) is an effec-

tive classiﬁcation method and provides several advan-

tages such as absence of local minima, adequate gen-

eralization to new objects, and representation that de-

pends on few parameters. Furthermore, this method

has proved to be very effective for spam classiﬁca-

tion (Tang et al., 2008) and Phishing (L’Huillier et al.,

2010). However, this approach does not determine the

importance of the features used by a classiﬁer (Mal-

donado and Weber, 2009). In this paper we present a

feature selection approach for binary classiﬁcation us-

ing SVM, showing its potential for spam and phishing

classiﬁcation.

This paper is organized as follows. In Section 2

we brieﬂy introduce spam and phishing classiﬁcation.

Recent developments for feature selection using SVM

are reviewed in Section 3. Section 4 presents the pro-

posed feature selection method based on SVM. Ex-

perimental results using real-world data sets are given

in Section 5. A summary of this paper can be found

in Section 6, where we provide its main conclusions

and address future developments.

2 SPAM AND PHISHING

CLASSIFICATION

Among all counter-measures used against spam and

phishing, there are two main alternatives (Bergholz

et al., 2010): content-based classiﬁcation on the one

hand and blacklisting and white-listing on the other.

In the following, the main approaches for these alter-

natives are brieﬂy reviewed.

445

Maldonado S. and L’Huillier G. (2012).

EMBEDDED FEATURE SELECTION FOR SPAM AND PHISHING FILTERING USING SUPPORT VECTOR MACHINES.

In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pages 445-450

DOI: 10.5220/0003782004450450

 SciTePress

2.1 Content-based Classiﬁcation

Spam ﬁltering is a classical problem in machine learn-

ing, and many ﬁltering techniques have been de-

scribed (Goodman et al., 2007). However, in terms of

content-based classiﬁcation, phishing differs in many

aspects from the spam case. While most of spam

emails are intended to spread information about prod-

ucts and web sites, in phishing, the interaction be-

tween a message and the receiver is more complex.

End users are usually involved in a third step of in-

teraction, such as following malicious links, ﬁlling

deceptive forms, or replying with useful information

which are relevant for the fraud message to succeed.

Previous works on content-based ﬁltering of de-

ceptive spam or phishing emails have focused on the

extraction of a large number of features used in pop-

ular machine learning techniques for its classiﬁcation

(Bergholz et al., 2010).

2.2 Network-based Classiﬁcation

Real Time Blacklists (RBLs) have been considered as

an efﬁcient alternative to ﬁltering spam messages, just

by considering server-side features for spam sender

detection. These services can be queried over the Do-

main Name System (DNS) protocol, which provides

a powerful tool for email servers to decide whether or

not to accept messages from a given host (Tang et al.,

2008). These approaches are based on features ex-

tracted from network properties and not from content-

based characteristics, hence the dimensionality of the

classiﬁcation problem is considerably low and the

features’ properties are different than in content-based

approaches. For this reason, these approaches were

not considered in this paper.

3 EMBEDDED FEATURE

SELECTION FOR SVMS

There are different strategies for embedded feature se-

lection. First, feature selection can be seen as an op-

timization problem. For example, the methods pre-

sented in (Neumann et al., 2005) add an extra term

that penalizes the cardinality of the selected feature

subset to the standard cost function of SVM. By op-

timizing this modiﬁed cost function features are se-

lected simultaneously to model construction. Another

embedded approach is the Feature Selection Con-

caVe (FSV) (Bradley and Mangasarian, 1998), based

on the minimization of the “zero norm” :

{

i : w

6= 0

}

|. Note that

is not a norm be-

cause the triangle inequality does not hold (Bradley

and Mangasarian, 1998), unlike l

-norms with p > 0.

Since the l

-“norm” is non-smooth, it was approxi-

mated by a concave function:

≈ e

(e −exp(−β|w|) (1)

with an approximation parameter β ∈ R

and e =

(1, . . . ,1)

. The problem is ﬁnally solved by using an

iterative method called Successive Linearization Al-

gorithm (SLA) for FSV (Bradley and Mangasarian,

1998). (Weston et al., 2003) proposed an alternative

approach for zero-“norm” minimization (l

-SVM) by

iteratively scaling the variables, multiplying them by

the absolute value of the weight vector w. An impor-

tant drawback of these methods is that they are limited

to linear classiﬁcation functions (Guyon et al., 2006).

Several embedded approaches consider backward

feature elimination in order to establish a ranking

of features, using SVM-based contribution measures

to evaluate their relevance. One popular method is

known as Recursive Feature Elimination (SVM-RFE)

(Guyon et al., 2009). The goal of this approach is to

ﬁnd a subset of size r among n variables (r < n) which

maximizes the classiﬁer’s performance. The feature

to be removed in each iteration is the one whose re-

moval minimizes the variation of W

(α):

(α) =

∑

i,s=1

K(x

) (2)

The scalar W

(α) is a measure of the model’s pre-

dictive ability and is inversely proportional to the mar-

gin. Features are eliminated applying the following

procedure:

1. Given a solution α, for each feature p calculate:

(−p)

(α) =

∑

i,s=1

K(x

(−p)

) (3)

where x

(−p)

represents the training object i with

feature p removed.

2. Eliminate the feature with smallest value of

(α) −W

(−p)

(α)|.

Another ranking method that allows kernel func-

tions was proposed in (Rakotomamonjy, 2003), which

considers a leave-one-out error bound for SVM,

the radius margin bound (Vapnik, 1998) LOO ≤

||w||

, where R denotes the radius of the small-

est sphere that contains the training data. This bound

is also used in (Weston et al., 2001) through the scal-

ing factors strategy. Feature selection is performed by

scaling the input parameters by a vector σ ∈ [0, 1]

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

446

Large values of σ

indicate more useful features. The

problem consists in choosing the best kernel of the

form:

, x

) ≡ K(σ ∗ x

, σ ∗ x

) (4)

where ∗ is the component-wise multiplication opera-

tor. The method presented by (Weston et al., 2001)

considers the gradient descent algorithm for updating

σ. (Canu and Grandvalet, 2002) propose to limit the

use of the attributes by constraining the scaling fac-

tors using a parameter σ

, which controls the norm of

σ.

4 THE PROPOSED METHOD FOR

EMBEDDED FEATURE

SELECTION

An embedded method for feature selection using

SVMs is proposed in this section. The main idea is

to penalize the use of features in the dual formulation

of SVMs using a gradient descent approximation for

Kernel optimization and feature elimination. The pro-

posed method attempts to ﬁnd the best suitable RBF-

type Kernel function for each problem with a minimal

dimension by combining the parameters of general-

ization (using the 2-norm), goodness of ﬁt, and fea-

ture selection (using a 0-“norm” approximation).

For this approach we use the anisotropic Gaussian

Kernel:

K(x

, x

, σ) = exp



−

||σ ∗ x

− σ ∗ x



(5)

where ∗ denotes the component-wise vector product

operator, deﬁned as a*b = (a

, . . . , a

The proposed approach (Kernel-Penalized SVM)

incorporates feature selection in the dual formulation

of SVMs. The formulation includes a penalization

function f (σ) based on the 0-“norm” approximation

(1) described in Section 3 and modifying the Gaus-

sian Kernel using an (anisotropic) width vector σ as a

decision variable. The feature penalization should be

negative since the dual SVM is a maximization prob-

lem. The following embedded formulation of SVMs

for feature selection is proposed:

Max

α, σ

∑

i=1

−

∑

i,s=1

K(x

, x

, σ)−C

f (σ)

(6)

subject to

∑

i=1

= 0

0 ≤ α

≤ C i ∈ {1, . . . , m}.

≥ 0 j ∈ {1, . . . , n}.

Notice that the values of σ are always consid-

ered to be positive, in contrast to the weight vector

w in formulation (1), since it is desirable that the

kernel widths be positive values (Maldonado et al.,

2011). Considering the “zero norm” approximation

described in (1),

≈ e

(e − exp(−β|σ|), and

since |σ

| = σ

∀ j, it is not necessary to use the 1-

norm in the approximation.

The following feature penalization function is pro-

posed, where the approximation parameter β is also

considered. In (Bradley and Mangasarian, 1998), the

authors suggest setting β to 5:

f (σ) = e

(e − exp(−βσ) =

∑

j=1

[1 − exp (−βσ

)]

(7)

Since the formulation (6) is non-convex, we de-

velop an iterative algorithm for its approximation.

A 2-step methodology is proposed: ﬁrst we solve

the traditional dual formulation of SVM for a ﬁxed

anisotropic kernel width σ:

Max

∑

i=1

−

∑

i,s=1

K(x

, x

, σ) (8)

subject to

∑

i=1

= 0

0 ≤ α

≤ C i ∈ {1, . . . , m}.

In the second step the algorithm solves, for a given

solution α, the following non-linear formulation:

Min

F(σ) =

∑

i,s=1

K(x

, x

, σ) +C

f (σ)

(9)

subject to

≥ 0 j ∈ {1, . . . , n}.

The goal of formulation (9) is to ﬁnd a sparse so-

lution, making zero as many components of σ as pos-

sible. We propose an iterative algorithm that updates

the anisotropic kernel variable σ, using the gradient of

the objective function, and eliminates the features that

are close to zero (below a given threshold ε). The al-

gorithm solves successive gradient descent steps until

one particular scaling factor σ

drops below a thresh-

old ε, starting with one initial solution σ

. When this

happens, attribute j is eliminated by setting σ

= 0.

EMBEDDED FEATURE SELECTION FOR SPAM AND PHISHING FILTERING USING SUPPORT VECTOR

MACHINES

447

Algorithm 1: Kernel Width Updating and Feature Elimina-

tion.

1. Start with σ = σ

;

2. ﬂag=true; ﬂag2=true;

3. while(ﬂag==true) do

4. train SVM (formulation (8));

5. t = 0;

6. while(ﬂag2==true) do

7. σ

t+1

= σ

− γ∆F(σ

);

8. if (||σ

t+1

− σ

< ε

) then

9. ﬂag2==false, ﬂag==false;

10. else

11. if (∃ j | σ

t+1

> 0 ∧ σ

t+1

< ε, ∀ j) then

12. for all (σ

t+1

< ε) do σ

t+1

= 0;

13. ﬂag2==false;

14. end if

15. end if

16. t = t + 1;

17. end while;

18. end while;

Then the algorithm returns to formulation (8) until

convergence. It is also possible that several variables

become zero in one iteration. The algorithm Kernel

Width Updating and Feature Elimination follows:

In the seventh line the algorithm adjusts the Kernel

variables by using the gradient descent procedure, in-

corporating a gradient parameter γ. In this step the al-

gorithm computes the gradient of the objective func-

tion in formulation (9) for a given solution of SVMs

α, obtained by training an SVM classiﬁer using for-

mulation (8). For a given feature j, the gradient of

formulation (9) is:

∆

F(ν) = C

βexp (−βσ

) (10)

∑

i,s=1

i, j

− x

s, j

)

K(x

, x

, σ)

Lines 8 and 9 of the algorithm represent the stop-

ping criterion, which is reached when σ

t+1

≈ σ

Lines 11 to 14 of the algorithm represent the feature

elimination step. When a Kernel variable σ

in itera-

tion t + 1 is below a threshold ε, this feature is con-

sidered as irrelevant and eliminated by setting σ

= 0.

This variable will not be included in subsequent iter-

ations of the algorithm.

5 RESULTS FOR SPAM AND

PHISHING DATA SETS

We applied the proposed approach for feature selec-

tion to two data sets. We consider the following pro-

cedure for model comparison: First, model selec-

tion is performed before feature selection, obtaining

the kernel parameters d, ρ and penalty parameter C.

The best combination is selected via 10-fold cross-

validation. For the methods RFE-SVM, FSV-SVM

and Fisher Filtering a ranking is ﬁrst obtained with the

training data, and model performance is then obtained

using 10-fold cross-validation for speciﬁc numbers

of attributes, depending on the size of the data set,

considering the hyper-parameters obtained during the

model selection procedure. For KP-SVM, instead,

the algorithm runs using initial hyper-parameters and

automatically obtains the desired number of features

and the Kernel shape when convergence is reached

we compute also the average cross-validation per-

formance in intermediate steps for comparison pur-

poses. The parameters for KP-SVM were selected

previously according to the following values:

• Parameter C

represents the penalty for the feature

usage and is strongly related to C, the original reg-

ularization parameter. C

is considered the most

important parameter for KP-SVM, since classiﬁ-

cation results change signiﬁcantly varying its val-

ues. We try the following values, monitoring both

classiﬁcation accuracy and feature usage:

= {0, 0.5C,C, 2C}

• The initial (isotropic) kernel width σ

, the thresh-

old ε and the gradient parameter γ are considered

less inﬂuential in the ﬁnal solution, according to

our empirical results. We set σ

· e, where ρ

is the isotropic kernel width obtained in a previous

step for model selection considering all features,

and e is a vector of ones of the size of the number

of current features in the solution; ε =

100ρ

and

γ = 0.1ε||∆F(σ

)||, where ||∆F(σ

)|| represents

the Euclidean norm of the ﬁrst computed gradient

vector. This combination of parameters guaran-

tees both a sufﬁciently small ε that avoids the re-

moval of relevant features and an adequate update

of the kernel variables, controlled by the magni-

tude of the components of ∆F(σ). This parameter

avoids a strong ﬂuctuation of the kernel variables

and negative widths, especially at the ﬁrst itera-

tions of the algorithm.

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

448

5.1 Description of Data Sets

In this subsection we brieﬂy describe the different

data sets mentioned above.

Spambase Data Set (Spam). The Spambase Data

set from the UCI data repository (Asuncion and

Newman, 2007) presents 57 features and 4,601 in-

stances (2,788 emails labeled as spam and 1,813 ham

emails). The data set was created by Mark Hopkins,

Erik Reeber, George Forman and Jaap Suermondt

from the Hewlett Packard Labs.

Most of the features indicate whether a particu-

lar word or character was frequently occurring in the

email. The data set presents 48 continuous attributes

representing the percentage of words in the email that

match a particular word, 6 continuous attributes repre-

senting the percentage of characters in the email that

match a particular character, the average length of un-

interrupted sequences of capital letters, the length of

the longest uninterrupted sequences of capital letters

and the total number of capital letters in the email.

The predictive variables were scaled between 0 and 1.

Phishing Data Set (Phishing). The phishing cor-

pus used to test the proposed methodology, was an

English language phishing email corpus built using

Jose Nazario’s phishing corpus

and the SPAMAS-

SASSIN ham collection. The phishing corpus consists

of 4,450 emails manually retrieved from November

27, 2004 to August 7, 2007.

The ham corpus was built using the Spamas-

sassin collection, from the Apache SPAMASSASSIN

Project

, based on a collection of 6,951 ham email

messages. Both phishing and ham messages are avail-

able in UNIX mbox format. All features were extracted

according to (L’Huillier et al., 2010).

5.2 Results using Kernel-penalized

Feature Selection

First we compare the results of the best model found

using the described model selection procedure for the

three different kernel functions: linear, polynomial,

and Gaussian kernel. The following set of values for

the parameters (penalty parameter C, degree of the

polynomial function d and Gaussian Kernel width σ)

were used (Maldonado and Weber, 2009):

C = {0.1, 0.5, 1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100,

“Ham” is the name used to describe regular messages

that are neither spam nor phishing.

Available at http://bit.ly/jnazariophishing [Online: ac-

cesses November 02, 2011].

Available at http://spamassassin.apache.org/publiccorp

us/ [Online: accessed November 02, 2011].

200, 300, 400, 500, 1000}

d = {2, 3, 4, 5, 6, 7, 8, 9}

ρ = {0.1, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 100}.

Best cross-validation results were achieved for

both data sets using the Gaussian Kernel. Then we

compared the classiﬁcation performance of the differ-

ent ranking criteria for feature selection by plotting

the mean test accuracy for an increasing number of

ranked features used for learning. Figures 1 and 2

show the results for each data set respectively. The

proposed KP-SVM approach provides only the infor-

mation until the stopping criterion is reached.

Figure 1: Mean of test accuracy for Spam vs. the number

of ranked variables used for training.

Figure 2: Mean of test accuracy for Phishing vs. the number

of ranked variables used for training.

These experiments underline that the proposed ap-

proach, KP-SVM, outperforms other feature selec-

tion methods in terms of classiﬁcation performance

for a small number of features in both data sets used.

Another important remark is that best classiﬁcation

performance is achieved for KP-SVM considering

= C for the Spam data set and C

= 0.5C for the

EMBEDDED FEATURE SELECTION FOR SPAM AND PHISHING FILTERING USING SUPPORT VECTOR

MACHINES

449

Phishing data set. For both data sets the use of fea-

ture penalization outperforms the model obtained us-

ing C

= 0, which can be considered a variant of the

ARD model presented in (Chapelle et al., 2002). This

fact proves the importance of feature selection in rel-

atively high dimensional data sets, such as the ones

presented in this work.

6 CONCLUSIONS

In this work we present an embedded approach for

feature selection using SVM applied to phishing and

spam classiﬁcation. A comparison with other feature

selection methods shows its advantages:

• It outperforms other techniques in terms of classi-

ﬁcation accuracy.

• It is not necessary to set a priori the number of

features to be selected, unlike other feature se-

lection approaches. The algorithm determines the

optimal feature number according to the regular-

ization parameter C

• It can be used with other kernel functions, such as

linear and polynomial kernels.

Even if several parameters have to be tuned, the

computational effort can be reduced since the search

for a feature subset can be obtained automatically, re-

ducing computational time by avoiding a validation

step on ﬁnding an adequate number of features.

Future work has to be done in various directions.

First, we consider the extension to highly imbalanced

data sets, a very relevant topic in phishing and spam

classiﬁcation, and in pattern recognition in general.

Furthermore, the current scenario for spam and phish-

ing classiﬁcation suggests the extension of the pro-

posed embedded feature selection technique to very

large databases as an important research opportunity.

ACKNOWLEDGEMENTS

Support from the Chilean “Instituto Sistemas Com-

plejos de Ingenier

ıa” (ICM: P-05-004-F) is greatly ac-

knowledged.

REFERENCES

Asuncion, A. and Newman, D. (2007). UCI machine learn-

ing repository.

Bergholz, A., Beer, J. D., Glahn, S., Moens, M.-F., Paass,

G., and Strobel, S. (2010). New ﬁltering approaches

for phishing email. Journal of Computer Security,

18(1):7–35.

Bradley, P. and Mangasarian, O. (1998). Feature selec-

tion v

ıa concave minimization and support vector ma-

chines. In Int. Conference on Machine Learning,

pages 82–90.

Canu, S. and Grandvalet, Y. (2002). Adaptive scaling for

feature selection in SVMs. Advances in Neural Infor-

mation Processing Systems, 15:553–560.

Chapelle, O., Vapnik, V., Bousquet, O., and Mukherjee,

S. (2002). Choosing multiple parameters for support

vector machines. Machine Learning, 46:131–159.

Goodman, J., Cormack, G. V., and Heckerman, D. (2007).

Spam and the ongoing battle for the inbox. Commun.

ACM, 50(2):24–33.

Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L. A.

(2006). Feature extraction, foundations and applica-

tions. Springer, Berlin.

Guyon, I., Saffari, A., Dror, G., and Cawley, G. (2009).

Model selection: Beyond the bayesian frequentist di-

vide. Journal of Machine Learning research, 11:61–

87.

L’Huillier, G., Hevia, A., Weber, R., and Rios, S. (2010).

Latent semantic analysis and keyword extraction for

phishing classiﬁcation. In ISI’10: Proceedings of the

IEEE International Conference on Intelligence and

Security Informatics, pages 129–131, Vancouver, BC,

Canada. IEEE.

Maldonado, S. and Weber, R. (2009). A wrapper method

for feature selection using support vector machines.

Information Sciences, 179:2208–2217.

Maldonado, S., Weber, R., and Basak, J. (2011). Kernel-

penalized SVM for feature selection. Information Sci-

ences, 181(1):115–128.

Neumann, J., Schn

orr, C., and Steidl, G. (2005). Combined

svm-based feature selection and classiﬁcation. Ma-

chine Learning, 61:129–150.

Rakotomamonjy, A. (2003). Variable selection using SVM-

based criteria. Journal of Machine Learning research,

3:1357–1370.

Tang, Y., Krasser, S., Alperovitch, D., and Judge, P. (2008).

Spam sender detection with classiﬁcation modeling on

highly imbalanced mail server behavior data. In Pro-

ceedings of the International Conference on Artiﬁcial

Intelligence and Pattern Recognition, AIPR’08, pages

174–180. ISRST.

Taylor, B., Fingal, D., and Aberdeen, D. (2007). The war

against spam: A report from the front line. In In NIPS

2007 Workshop on Machine Learning in Adversarial

Environments for Computer Security.

Vapnik, V. (1998). Statistical Learning Theory. John Wiley

and Sons.

Weston, J., Elisseeff, A., Sch

olkopf, B., and Tipping, M.

(2003). The use of zero-norm with linear models and

kernel methods. Journal of Machine Learning re-

search, 3:1439–1461.

Weston, J., Mukherjee, S., Chapelle, O., Ponntil, M., Pog-

gio, T., and Vapnik, V. (2001). Feature selection for

SVMs. In Advances in Neural Information Processing

Systems 13, volume 13.

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

450