Filtering spam at e-mail server level with improved CRM114

Víctor Méndez, Julio Cesar Hernandez, Jesus Carretero, Felix García

2004

Abstract

Security managers and network engineers are increasingly required to implant corporative spam-filtering services. End-users don't want to interact with spam-classify applications, so network engineers usually have to implement and manage the spam-filtering system at the e-mail server. Due to the processing speeds needed to put these solutions into work at the server level, the options at hand are reduced to applications of the black-list/white-list type. This is the reason behind the fact that most applications based on AI techniques run only on the client side, particularly those based in the Naïve Bayes scheme, which has proved to be one of the most successful approaches to fight against spam, but nowadays is not as fast as other techniques and still not able to process the high amount of email traffic expected at a mail server. However, spam mutates and the spamies techniques have quickly evolved to easily pass the traditional black/white list applications, so there is a compelling need for the use of more advanced techniques at the server level, notably those based in the Naïve Bayes algorithm. This article explores this possibility and concludes that, simple improvements to a well-known Naïve-Bayes technique (CRM114[2]), following some ideas suggested in [8], could turn this algorithm into a much faster and significantly better one that, due to these improvements in speed, could be used at the server level.

References

  1. Tom M. Mitchell. Machine Learning - McGraw-Hill, ISBN: 0-07-042807-7
  2. William S. Yerazunis. Sparse Binary Polynomial Hashing and the CRM114 Discriminator - MER Labs. Cambridge, MA. 2003 and Cambridge Spam Conference Proceeding - http://crm114.sourceforge.net/
  3. Paul Graham. A Plan for Spam. 2003 Cambridge Spam Conference Proceeding http://paulgraham.com/spam.html
  4. Paul Graham. Better Bayesian Filtering. 2003 Cambridge Spam Conference Proceeding http://paulgraham.com/better.html
  5. Jason D.M. Rennie, y Tommie Jaakkola. Automatic Featured Induction for Text Classification. - MIT, AI Labs. Abstract Book. 2002 and 2003 Spam Conferencehttp://www.ai.mit.edu/jrennie/spamconference/
  6. Matt Sergeant. Internet Level Spam Detection and SpamAssassin 2.50.- 2003 Cambridge Spam Conference Proceeding - http://axkit.org/docs/presentations/spam/
  7. Teodor Zlatanov. Spam Analisys in Gnus with spam. - 2003 Cambridge Spam Conference Proceeding - http://lifelogs.com/spam/spam.html
  8. Brian Burton. SpamProbe: Bayesian Spam Filtering Tweaks - 2003 Cambridge Spam Conference Proceeding - http://spamprobe.sourceforge.net/index.html
  9. John Graham The spammers compendium.- 2003 Cambridge Spam Conference Proceeding - http://popfile.sourceforge.net
  10. Kristian Eide. Winning the War on spam: Comparison of Bayesian spam filters. 2003.
  11. Unam public spam set 2002-2003: http://www.seguridad.unam.mx/Servicios/spam/spam/
  12. From call for donations for this specific use, at the Universidad Carlos III de Madrid, 2003.
  13. Personal communication with Juan Carlos Martin, Security and Network Manager of EspacioIT, which has over 3.000 mail users along different domains and mail servers. October, 2003
  14. Carreras & Marquez. Boosting Trees for Antispam Email Filtering. 2001 TLAP Research Center. LSI Department. Universitat Politecnica de Catalunya.
  15. L.F. Cranor and B.A. LaMaochia. Spam Comunications of the ACM, 1998.
  16. Sholan M.Weiss and others. Maximizing text-mining performance. - 1999 IEEE Intelligents Systems.-
  17. Joachims. Text categorization with support vector machine. Proc. 10th Eur. Conf. Machine Learning. 1998.
  18. R.E. Schapire and Y.Singer. BoosText: boosting based system for categorization Machine Learning. 2000.
  19. Yang & Liu. A re-examination of text categorization methods. Proc. 22nd ADM SIGIR Conference 1999.
  20. W.Cohen Learning Rules for Classifying Mail. AAAI Spring Symposium on Machine Learning in Information Access. 1996.
  21. http://spamassassin.org/publiccorpus/ . - public - corpus AT jmason dot org if you have questions.
Download


Paper Citation


in Harvard Style

Méndez V., Cesar Hernandez J., Carretero J. and García F. (2004). Filtering spam at e-mail server level with improved CRM114 . In Proceedings of the 2nd International Workshop on Security in Information Systems - Volume 1: WOSIS, (ICEIS 2004) ISBN 972-8865-07-4, pages 207-216. DOI: 10.5220/0002674702070216


in Bibtex Style

@conference{wosis04,
author={Víctor Méndez and Julio Cesar Hernandez and Jesus Carretero and Felix García},
title={Filtering spam at e-mail server level with improved CRM114},
booktitle={Proceedings of the 2nd International Workshop on Security in Information Systems - Volume 1: WOSIS, (ICEIS 2004)},
year={2004},
pages={207-216},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002674702070216},
isbn={972-8865-07-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 2nd International Workshop on Security in Information Systems - Volume 1: WOSIS, (ICEIS 2004)
TI - Filtering spam at e-mail server level with improved CRM114
SN - 972-8865-07-4
AU - Méndez V.
AU - Cesar Hernandez J.
AU - Carretero J.
AU - García F.
PY - 2004
SP - 207
EP - 216
DO - 10.5220/0002674702070216