Self-Organizing Maps in the Design of Anti-spam Filters - A Proposal based on Thematic Categories

Ylermi Cabrera-León, Patricio García Báez, Carmen Paz Suárez-Araujo

Abstract

Spam, or unsolicited messages sent massively, is one of the threats that affects email and other media. Its high volume generates substantial time and economic losses. A solution to this problem is presented: a hybrid anti-spam filter based on unsupervised Artificial Neural Networks (ANNs). It consists of two steps, preprocessing and processing, both based on different computation models: programmed and neural (using Kohonen SOM). This system has been optimized using, as a data corpus, ham from “Enron Email” and spam from two different sources: traditional (user’s inbox) and spamtrap-honeypot. It has been proved that thematic categories can be found both in spam and ham words. 1260 system configurations were analyzed, comparing their quality and performance with the most used metrics. All of them achieved AUC > 0.90 and the best 204 AUC > 0.95, despite just using 13 attributes for the input vectors of the SOM, one for each thematic category. Results were similar to other researchers’ over the same corpus, though they make use of different Machine Learning (ML) methods and a number of attributes several orders of magnitude greater. It was further tested with datasets not utilized during design, obtaining 0.77 < AUC < 0.96 with normalized data.

References

  1. Bekkerman, R., McCallum, A., and Huang, G. (2004). Automatic categorization of email into folders: Benchmark experiments on Enron and SRI corpora. Technical Report IR-418, Center for Intelligent Information Retrieval - University of Massachusetts Amherst.
  2. Blanco, Á., Ricket, A. M., and Martín-Merino, M. (2007). Combining SVM classifiers for email anti-spam filtering. In Sandoval, F., Prieto, A., Cabestany, J., and Graña, M., editors, Computational and Ambient Intelligence, volume 4507 of Lecture Notes in Computer Science. 9th International Work-Conference on Artificial Neural Networks, IWANN 2007, pages 903-910. Springer Berlin Heidelberg, San Sebastián, Spain.
  3. Borovicka, T., Jirina Jr., M., Kordik, P., and Jirina, M. (2012). Selecting Representative Data Sets. In Karahoca, A., editor, Advances in Data Mining Knowledge Discovery and Applications. InTech.
  4. Bruce, J. (2012). Grey Mail: The New Email Nuisance To Hit Your Inbox.
  5. Cabrera León, Y. and Acosta Padrón, O. (2011). Spam: Definition, statistics, anti-spam methods and legislation. Course project, Politechnika Wroclawska, Wroclaw, Poland.
  6. Cabrera León, Y., Suárez Araujo, C. P., and García Báez, P. (2015). Análisis del Uso de las Redes Neuronales Artificiales en el Diseño de Filtros Antispam: una Propuesta Basada en Arquitecturas Neuronales No Supervisadas. Final project, Universidad de Las Palmas de Gran Canaria, Las Palmas de Gran Canaria.
  7. Chapelle, O., Schölkopf, B., and Zien, A. (2006). SemiSupervised Learning, volume 2. MIT Press, Cambridge, MA, USA.
  8. Chhabra, P., Wadhvani, R., and Shukla, S. (2010). Spam Filtering using Support Vector Machine. In Special Issue of IJCCT Vol.1 Issue 2, 3, 4; 2010 for International Conference [ACCTA-2010], pages 166-171.
  9. Chuan, Z., Xianliang, L., Mengshu, H., and Xu, Z. (2005). A LVQ-based neural network anti-spam email approach. ACM SIGOPS Operating Systems Review, 39(1):34-39 (6).
  10. Cohen, W. W. (2004). Enron Email Dataset.
  11. Cormack, G. V. and Mojdeh, M. (2009). Machine learning for information retrieval: TREC 2009 web, relevance feedback and legal tracks. In The Eighteenth Text REtrieval Conference Proceedings (TREC 2009), pages 1-9, Gaithersburg, MD, USA.
  12. Drucker, H., Wu, D., and Vapnik, V. N. (1999). Support Vector Machines for spam categorization. IEEE Transactions on Neural Networks, 10(5):1048-1054.
  13. Erickson, D., Casado, M., and McKeown, N. (2008). The Effectiveness of Whitelisting: A User-Study. In Proc. of Conference on Email and Anti-Spam, pages 1-10.
  14. Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine Learning, 31(1):1- 38.
  15. Feroze, M. A., Baig, Z. A., and Johnstone, M. N. (2015). A Two-Tiered User Feedback-based Approach for Spam Detection. In Becker Westphall, C., Borcoci, E., and Manoharan, S., editors, ICSNC 2015: The Tenth International Conference on Systems and Networks Communications, November 15-20, 2015, Barcelona, Spain, pages 12-17. Curran Associates, Inc, Red Hook, NY.
  16. Freschi, V., Seraghiti, A., and Bogliolo, A. (2006). Filtering obfuscated email spam by means of phonetic string matching. In Advances in Information Retrieval, pages 505-509. Springer.
  17. Fumera, G., Pillai, I., and Roli, F. (2006). Spam filtering based on the analysis of text information embedded into images. The Journal of Machine Learning Research, 7:2699-2720.
  18. Gama, J., Žliobaite?, I., Bifet, A., Pechenizkiy, M., and Bouchachia, A. (2014). A Survey on Concept Drift Adaptation. ACM Computing Surveys (CSUR), 46(4):1- 37.
  19. Gao, Y., Yan, M., and Choudhary, A. (2009). Semi Supervised Image Spam Hunter: A Regularized Discriminant EM Approach. In International Conference on Advanced Data Mining and Applications, pages 152- 164. Springer Berlin Heidelberg.
  20. Graham-Cumming, J. (2006). SpamOrHam. Virus Bulletin, pages 22-24.
  21. Guenter, B. (1998). SPAM Archive: Email spam received yearly, since early 1998.
  22. Guzella, T. S. and Caminhas, W. M. (2009). A review of machine learning approaches to Spam filtering. Expert Systems with Applications, 36(7):10206-10222.
  23. Harris, E. (2003). The Next Step in the Spam Control War: Greylisting.
  24. Haykin, S. S. (1999). Neural Networks. A Comprehensive Foundation. Prentice-Hall International, Ontario, Canada, 2nd edition.
  25. Holden, S. (2004). Spam Filtering II: Comparison of a number of Bayesian anti-spam filters over different email corpora.
  26. Hovold, J. (2005). Naive Bayes Spam Filtering Using WordPosition-Based Attributes. In CEAS.
  27. Kohonen, T. (2001). Self-Organizing Maps. Springer-Verlag New York, Secaucus, NJ, USA, 3 edition.
  28. Kohonen, T. (2013). Essentials of the self-organizing map. Neural Networks, 37:52-65.
  29. Kolcz, A., Chowdhury, A., and Alspector, J. (2004). The impact of feature selection on signature-driven spam detection. In Proceedings of the 1st Conference on Email and Anti-Spam (CEAS-2004), pages 1-8.
  30. Kucherawy, M. and Crocker, D. (2012). RFC 6647 - Email Greylisting: An Applicability Statement for SMTP. Proposed standard.
  31. Kufandirimbwa, O. and Gotora, R. (2012). Spam Detection Using Artificial Neural Networks (Perceptron Learning Rule). Online Journal of Physical and Environmental Science Research, 1(2):22-29.
  32. Lertnattee, V. and Theeramunkong, T. (2004). Analysis of inverse class frequency in centroid-based text classification. volume 2, pages 1171-1176. IEEE.
  33. Lieb, R. (2002). Make Spammers Pay Before You Do.
  34. Liu, C. and Stamm, S. (2007). Fighting Unicode-obfuscated spam. In Proceedings of the Anti-Phishing Working Groups 2nd Annual eCrime Researchers Summit, pages 45-59. ACM.
  35. Lowd, D. and Meek, C. (2005). Good Word Attacks on Statistical Spam Filters. In Proceedings of the Second Conference on Email and Anti-Spam (CEAS), pages 1-8.
  36. Luo, X. and Zincir-Heywood, N. (2005). Comparison of a SOM based sequence analysis system and naive Bayesian classifier for spam filtering. InNeural Networks, 2005. IJCNN'05. Proceedings. 2005 IEEE International Joint Conference on, volume 4, pages 2571- 2576.
  37. Malathi, R. (2011). Email Spam Filter using Supervised Learning with Bayesian Neural Network. Computer Science, HH The Rajah's College, Pudukkottai-622, 1:89-100.
  38. Mason, J. (2009). Filtering Spam With SpamAssassin.
  39. MathWorks (2014). Parallel Computing Toolbox for Matlab R2014a - User's Guide.
  40. McAfee and ICF International (2009). The Carbon Footprint of Email Spam Report. Technical report.
  41. Metsis, V., Androutsopoulos, I., and Paliouras, G. (2006). Spam Filtering with Naive Bayes - Which Naive Bayes? In CEAS 2006 - Third Conference on Email and Anti-Spam, pages 27-28, Mountain View, California, USA.
  42. Metz, C. E. (1978). Basic principles of ROC analysis. In Seminars in Nuclear Medicine, volume 8, pages 283- 298. Elsevier.
  43. Meyer, T. A. and Whateley, B. (2004). SpamBayes: Effective open-source, Bayesian based, email classification system. In CEAS. Citeseer.
  44. Mojdeh, M. and Cormack, G. V. (2008). Semi-supervised spam filtering: Does it work? InProceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 745-746, Singapore. ACM.
  45. Narisawa, K., Bannai, H., Hatano, K., and Takeda, M. (2007). Unsupervised spam detection based on string alienness measures. In Discovery Science, pages 161-172. Springer-Verlag Berlin Heidelberg.
  46. P. Resnick, E. (2008). RFC 5322 - Internet Message Format. Draft standard.
  47. Pfahringer, B. (2006). A semi-supervised spam mail detector. pages 1-5, Berlin, Germany.
  48. Pitsillidis, A., Levchenko, K., Kreibich, C., Kanich, C., Voelker, G. M., Paxson, V., Weaver, N., and Savage, S. (2010). Botnet judo: Fighting spam with itself. In Symposium on Network and Distributed System Security (NDSS), pages 1-19.
  49. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3):130-137.
  50. Postini, Inc (2004). The shifting tactics of spammers: What you need to know about new email threats. White paper.
  51. Qian, F., Pathak, A., Hu, Y. C., Mao, Z. M., and Xie, Y. (2010). A Case for Unsupervised-Learning-based Spam Filtering. volume 38, pages 367-368. ACM.
  52. Ramachandran, A. and Feamster, N. (2006). Understanding the network-level behavior of spammers. In ACM SIGCOMM Computer Communication Review, volume 36, pages 291-302.
  53. Rao, J. M. and Reiley, D. H. (2012). The Economics of Spam. Journal of Economic Perspectives, 26(3):87-110.
  54. Rojas, R. (1996). Kohonen Networks. In Neural Networks: A Systematic Introduction, pages 391-412. SpringerVerlag, Berlin.
  55. Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E. (1998). A Bayesian Approach to Filtering Junk Email. AAAI Technical Report WS-98-05, Madison, Wisconsin.
  56. Santos, I., Sanz, B., Laorden, C., Brezo, F., and Bringas, P. G. (2011). Computational Intelligence in Security for Information Systems: 4th International Conference, CISIS 2011, Held at IWANN 2011. Torremolinos-Málaga, Spain.
  57. Sculley, D., Wachman, G., and Brodley, C. E. (2006). Spam Filtering Using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers. In TREC.
  58. Shunli, Z. and Qingshuang, Y. (2010). Personal Spam Filter by Semi-supervised Learning. In Proceedings of the Third International Symposium on Com Puter Science and Computational Technology (ISCSCT 7810), pages 171-174, Jiaozuo, P. R. China.
  59. Skillicorn, D. (2013). Other versions of the Enron data (preprocessed).
  60. Slaby, A. (2007). ROC Analysis with Matlab. In Information Technology Interfaces, 2007. ITI 2007. 29th International Conference on, pages 191-196. IEEE.
  61. Spammer-X, Posluns, J., and Sjouwerman, S. (2004). Inside the SPAM Cartel. Syngress - Elsevier, 1 edition.
  62. Sprengers, M. and Heskes, T. T. (2009). The Effects of Different Bayesian Poison Methods on the Quality of the Bayesian Spam Filter 'SpamBayes'. Bachelor thesis, Radboud University Nijmegen.
  63. Statista (2016). Global spam volume as percentage of total e-mail traffic from 2007 to 2014.
  64. Styler, W. (2011). The EnronSent Corpus. Technical Report 01-2011, University of Colorado at Boulder Institute of Cognitive Science.
  65. Suárez Araujo, C. P., García Báez, P., and Hernández Trujillo, Y. (2010). Neural Computation Methods in the Determination of Fungicides. In Fungicides. INTECH Open Access Publisher, odile carisse edition.
  66. Subramaniam, T., Jalab, H. A., and Taqa, A. Y. (2010). Overview of textual anti-spam filtering techniques. International Journal of the Physical Science, 5(12):1869-1882.
  67. Tan, H. S. and George, S. E. (2004). Investigating Learning Parameters in a Standard 2-D SOM Model to Select Good Maps and Avoid Poor Ones. In Australasian Joint Conference on Artificial Intelligence, pages 425- 437. Springer.
  68. The Apache SpamAssassin Project (2013). Index of the SpamAssassin's Public Corpus.
  69. The Apache SpamAssassin Project (2014). SpamAssassin v3.3.x: Tests Performed to Determine Spaminess and Haminess of a Message.
  70. Uemura, T., Ikeda, D., and Arimura, H. (2008). Unsupervised spam detection by document complexity estimation. In Discovery Science, pages 319-331.
  71. Vesanto, J., Himberg, J., Alhoniemi, E., and Parhankangas, J. (2000). SOM Toolbox for Matlab 5. Technical Report Report A57, Helsinki University of Technology.
  72. Vrusias, B. L. and Golledge, I. (2009a). Adaptable Text Filters and Unsupervised Neural Classifiers for Spam Detection. In Proceedings of the International Workshop on Computational Intelligence in Security for Information Systems CISIS'08, volume 53 of Advances in Soft Computing, pages 195-202. Springer Berlin Heidelberg.
  73. Vrusias, B. L. and Golledge, I. (2009b). Online SelfOrganised Map Classifiers as Text Filters for Spam Email Detection. Journal of Information Assurance and Security (JIAS), 4(2):151-160.
  74. Wang, D., Irani, D., and Pu, C. (2013). A Study on Evolution of Email Spam Over Fifteen Years. pages 1-10, Atlanta, Georgia (USA). IEEE.
  75. Wang, D. and Zhang, H. (2013). Inverse-CategoryFrequency Based Supervised Term Weighting Schemes for Text Categorization. Journal of Information Science & Engineering, 29(2):209-225.
  76. Wittel, G. L. and Wu, S. F. (2004). On Attacking Statistical Spam Filters. In CEAS.
  77. Xie, C., Ding, L., and Du, X. (2009). Anti-spam Filters Based on Support Vector Machines. In Advances in Computation and Intelligence. 4th International Symposium, ISICA 2009, volume 5821 of Lecture Notes in Computer Science, pages 349-357. Springer Berlin Heidelberg, Huangshi, China.
  78. Xu, J.-M., Fumera, G., Roli, F., and Zhou, Z.-H. (2009). Training SpamAssassin with Active Semi-supervised Learning. In Proceedings of the 6th Conference on Email and Anti-Spam (CEAS'09), pages 1-8. Citeseer.
  79. Yerazunis, W., Kato, M., Kori,, M., Shibata, H., and Hackenberg, K. (2010). Keeping the Good Stuff In: Confidential Information Firewalling with the CRM114 Spam Filter & Text Classifier. White Paper for Black Hat USA, pages 1-18.
  80. Zeimpekis, D., Kontopoulou, E. M., and Gallopoulos, E. (2011). Text to Matrix Generator (TMG).
  81. Zhang, Y. (2012). Lecture for Chapter 2 - Data Preprocessing.
  82. Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and Schölkopf, B. (2004). Learning with local and global consistency. Advances in neural information processing systems, 16:321-328.
  83. Zhou, D., Burges, C. J. C., and Tao, T. (2007). Transductive link spam detection. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, pages 21-28.
  84. Žliobaite?, I., Pechenizky, M., and Gama, J. (2015). An overview of concept drift applications. In Japkowicz, N. and Stefanowski, J., editors, Big Data Analysis: New Algorithms for a New Society, volume 16 of Studies in Big Data, pages 91-114. Springer International Publishing.
Download


Paper Citation


in Harvard Style

Cabrera-León Y., García Báez P. and Suárez-Araujo C. (2016). Self-Organizing Maps in the Design of Anti-spam Filters - A Proposal based on Thematic Categories . In Proceedings of the 8th International Joint Conference on Computational Intelligence - Volume 3: NCTA, (IJCCI 2016) ISBN 978-989-758-201-1, pages 21-32. DOI: 10.5220/0006041400210032


in Bibtex Style

@conference{ncta16,
author={Ylermi Cabrera-León and Patricio García Báez and Carmen Paz Suárez-Araujo},
title={Self-Organizing Maps in the Design of Anti-spam Filters - A Proposal based on Thematic Categories},
booktitle={Proceedings of the 8th International Joint Conference on Computational Intelligence - Volume 3: NCTA, (IJCCI 2016)},
year={2016},
pages={21-32},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006041400210032},
isbn={978-989-758-201-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Joint Conference on Computational Intelligence - Volume 3: NCTA, (IJCCI 2016)
TI - Self-Organizing Maps in the Design of Anti-spam Filters - A Proposal based on Thematic Categories
SN - 978-989-758-201-1
AU - Cabrera-León Y.
AU - García Báez P.
AU - Suárez-Araujo C.
PY - 2016
SP - 21
EP - 32
DO - 10.5220/0006041400210032