
Intelligent methods are a relatively new trend in 
spam detection. They may eliminate disadvantages 
of the traditional methods. Intelligent methods use 
statistical and machine learning algorithms. The 
algorithms are capable to classify mail into several 
categories using a statistical or machine learning 
models constructed beforehand on the basis of the 
precedent information (Yang, 1999).  
To make such system work properly, it is 
necessary to train it on a set of e-mails that have 
been already classified as spam or legal messages. 
This training’s result is a model that is then used for 
a new mail classification. Nowadays the most 
popular intelligent method for spam detection is 
Naïve-Bayes method. (Sahami et al., 1998). Naïve-
Bayes method is being implemented and is 
successfully used in several spam-detection systems 
(Apache, 2004a; Farmer, 2004). 
Intelligent methods have several advantages in 
comparison with the traditional ones. They do not 
depend on external knowledge databases and do not 
need regular updates. They do not use specific 
features of particular language, so they are 
multilingual. They are able to adjust the models 
using new samples of spam without the 
administrator’s assistance and they can build 
personal filtering models. 
Nevertheless, despite their efficiency and 
intelligence these methods are not widely used in 
spam-detection systems at the enterprise level for 
several reasons. First of all, most intelligent methods 
are not stable enough when detecting legal mails and 
have a rather high level of false-positive errors. 
Intelligent methods have higher hardware 
requirements because they are based on 
computationally expensive algorithms.  
The aim of our research is to offer a 
comprehensive e-mail-classifying solution for 
enterprise-level system that will be based on the 
intelligent analysis of messages. The solution should 
have the advantages of intelligent methods such as 
personification and high spam detection rate at low 
quantity of false-positive errors. At the same time 
the system should provide necessary efficiency to be 
used on enterprise-level mail servers. 
2 OUR SOLUTION 
Our solution is based on the intelligent classification 
algorithm that allows reaching necessary quality on 
the one hand, and on a multi-agent architecture that 
provides necessary efficiency, on the other.  
For solving the classification problem we are 
using a statistical method based on support vector 
machines (SVM) (Scholkopf & Smola, 2000; 
Vapnik, 1998). This method was applied to text 
categorization task earlier (Joachims, 1998). It is 
necessary to solve two problems to apply SVM for 
spam detection task: select proper kernel-function 
and find appropriate representation of e-mails as 
feature vectors. 
We have selected the following representation 
for electronic messages: a feature set is defined as a 
set of all words that appeared in all analyzed 
messages more than the predetermined number of 
times. Furthermore, feature set is reduced by 
eliminating a set of predefined stop-words. 
Additionally, the feature set is expanded with 
features defined for all file extensions of files 
attached to the analyzed messages (Yang & 
Pedersen, 1997). 
 So, each message is represented as a subset of 
feature set. Each element of the set is a number of 
appearances of a particular feature in a message 
normalized by quantity of message’s features. 
We have carried out several experiments with 
various standard kernel-functions and have 
discovered that RBF kernel-function shows quite 
good results. It provides a high level of accuracy and 
comprehensible efficiency of the algorithm. 
Besides, the solution should meet the following 
basic requirements: high efficiency; enterprise level; 
the ability to take into account personal features of 
each user’s correspondence; platform independence; 
scalability; safety and privacy. These requirements 
lead us to a multi-agent architecture for the system. 
The general architecture of the system is shown on 
the figure 1.  
The central communication node of the system is 
presented by one or several web-servers. It provides 
communication environment for training and 
classifying agents, supports shared vocabulary, 
converts messages to feature sets and provides GUI 
for users. The communication node stores shared 
vocabulary, temporary feature vectors and some 
additional user’s information in the database. All 
time-consumptive operations like preprocessing and 
downloading messages, training user models and 
classification are moved to corresponding agents. 
The training agent is a process that analyses 
user’s messages and builds user’s personal model on 
the basis of this analysis. The training agent allows 
customization for different message storages. 
In current version it is located at the centralized 
mail server and accesses personal data using IMAP 
protocol. Another solution might be the personal 
agent on a user’s workstation that uses local mail 
storage from the personal folders. The common 
training workflow is the following. 
A user initializes training procedure using web-
based interface. 
 
ENTERPRISE ANTI-SPAM SOLUTION BASED ON MACHINE LEARNING APPROACH
189