ATTRIBUTE CONSTRUCTION FOR E-MAIL FOLDERING BY USING WRAPPERED FORWARD GREEDY SEARCH

Pablo Bermejo, José A. Gámez, José M. Puerta

2007

Abstract

E-mail classification is one of the outstanding tasks in text mining, however most of the efforts in this topic have been devoted to the detection of spam or junk e-mail, that is, a classification problem with only two possible classes: spam and not-spam. In this paper we deal with a different e-mail classification problem known as e-mail foldering which consists on the classification of incoming mail into the different folders previously created by the user. This task has received less attention and is quite complex due to the (usually large) cardinality of the class variable (the number of folders). In this paper we try to improve the classification accuracy by looking for new attributes derived from the existing ones by using a data-driven approach. The attribute is constructed by taking into account the type of classifier to be used later and following a wrapper approach guided by a forward greedy search. The experiments carried out show that in all the cases the accuracy of the classifier is improved when the new attribute is added to the original ones.

References

  1. Bekkerman, R., McCallum, A., and Huang, G. (2005). Automatic categorization of email into folders: Bechmark experiments on enron and sri corpora. Technical report, Department of Computer Science. University of Massachusetts, Amherst.
  2. Table 4: Greedy execution time (hours).
  3. Brutlag, J. D. and Meek, C. (2000). Challenges of the email domain for text classification. In ICML 7800: Proceedings of the Seventeenth International Conference on Machine Learning.
  4. Freitas, A. A. (2001). Understanding the crucial role of attributeinteraction in data mining. Artif. Intell. Rev., 16:177-199.
  5. Hu, Y.-J. (1998a). Constructive induction: covering attribute spectrum In Feature Extraction, Construction and Selection: a data mining perspective. Kluwer.
  6. Hu, Y.-J. (1998b). A genetic programming approach to constructive induction. In 3rd Anual Genetic Programming Conference.
  7. Klimt, B. and Yang, Y. (2004). The enron corpus: a new dataset for email classification research. In 15th European Conference on Machine Learning, pages 217- 226.
  8. Larsen, O., Freitas, A., and Nievola, J. (2002). Constructing x-of-n attributes with a genetic algorithm. In Proc Genetic and Evolutionary Computation Conf (GECCO2002).
  9. Lewis, D. (1992). Representation and learning in information retrieval. PhD thesis, Department of Computer Science, University of Massachusetts.
  10. Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pages 4-15, Chemnitz, DE. Springer Verlag, Heidelberg, DE.
  11. Liu, H., Motoda, H., and Yu, L. (2002). Feature selection with selective sampling. In Nineteenth International Conference on Machine Learning, pages 395 - 402.
  12. Mateo, J. L. and de la Ossa, L. (2006). Lio: an easy and flexible library of metaheuristics. Technical report, Departamento de Sistemas Informticos, Escuela Politécnica Superior de Albacete, Universidad de Castilla-La Mancha.
  13. McCallum, A. and Nigam, K. (1998). A comparison of event models for naive bayes text classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pages 41-48.
  14. Otero, F., Silva, M., Freitas, A., and NIevola, J. (2003). Genetic programming for attribute construction in data mining. In Genetic Programming: Proc. 6th European Conference (EuroGP-2003).
  15. Salton, G. and Buckley, C. (1987). Term weighting approaches in automatic text retrieval. Technical report, Cornell University.
  16. Witten, I. H. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques (Second Edition). Morgan Kaufmann.
  17. Zheng, Z. (1995). Constructing nominal x-of-n attributes. In International Joint Conference on Artificial Intelligence (IJCAI-05). Morgan Kaufmann.
Download


Paper Citation


in Harvard Style

Bermejo P., A. Gámez J. and M. Puerta J. (2007). ATTRIBUTE CONSTRUCTION FOR E-MAIL FOLDERING BY USING WRAPPERED FORWARD GREEDY SEARCH . In Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 2: ICEIS, ISBN 978-972-8865-89-4, pages 247-252. DOI: 10.5220/0002376902470252


in Bibtex Style

@conference{iceis07,
author={Pablo Bermejo and José A. Gámez and José M. Puerta},
title={ATTRIBUTE CONSTRUCTION FOR E-MAIL FOLDERING BY USING WRAPPERED FORWARD GREEDY SEARCH},
booktitle={Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 2: ICEIS,},
year={2007},
pages={247-252},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002376902470252},
isbn={978-972-8865-89-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Ninth International Conference on Enterprise Information Systems - Volume 2: ICEIS,
TI - ATTRIBUTE CONSTRUCTION FOR E-MAIL FOLDERING BY USING WRAPPERED FORWARD GREEDY SEARCH
SN - 978-972-8865-89-4
AU - Bermejo P.
AU - A. Gámez J.
AU - M. Puerta J.
PY - 2007
SP - 247
EP - 252
DO - 10.5220/0002376902470252