Combining Clustering and Classification Approaches for Reducing the Effort of Automatic Tweets Classification

Elias de Oliveira, Henrique Gomes Basoni, Marcos Rodrigues Saúde, Patrick Marques Ciarelli

2014

Abstract

The classification problem has got a new importance dimension with the growing aggregated value which has been given to the Social Media such as Twitter. The huge number of small documents to be organized into subjects is challenging the previous resources and techniques that have been using so far. Futhermore, today more than ever, personalization is the most important feature that a system needs to exhibit. The goal of many online systems, which are available in many areas, is to address the needs or desires of each individual user. To achieve this goal, these systems need to be more flexible and faster in order to adapt to the user’s needs. In this work, we explore a variety of techniques with the aim of better classify a large Twitter data set accordingly to a user goal. We propose a methodology where we cascade an unsupervised following by supervised technique. For the unsupervised technique we use standard clustering algorithms, and for the supervised technique we propose the use of a kNN algorithm and a Centroid Based Classifier to perform the experiments. The results are promising because we reduced the amount of work to be done by the specialists and, in addition, we were able to mimic the human assessment decisions 0.7907 of the time, according to the F1-measure.

References

  1. Baeza-Yates, R. and Ribeiro-Neto, B. (2011). Modern Information Retrieval. Addison-Wesley, New York, 2 edition.
  2. Berry, M. W. (2003). Survey of Text Mining: Clustering, Classification, and Retrieval. Springer-Verlag, New York.
  3. Bruns, A. and Liang, Y. (2012). Tools and Methods for Capturing Twitter Data During Natural Disasters. First Monday, 17(4).
  4. Bryden, J., Funk, S., and Jansen, V. A. A. (2013). Word Usage Mirrors Community Structure in the Online Social Network Twitter. EPJ Data Science, 2(1):3+.
  5. Ciarelli, P. M., Oliveira, E., and Salles, E. O. T. (2013). Multi-label Incremental Learning Applied to Web Pages Categorization. Neural Computing and Applications, pages 1-17.
  6. Everitt, B. S., Landau, S., Leese, M., and Stahl, D. (2011). Cluster Analysis. John Wiley & Sons, Ltd, London, 5 edition.
  7. Gundecha, P. and Liu, H. (2012). Mining Social Media: A Brief Introduction. Tutorials in Operations Research, 1(4).
  8. Hadgu, A. T., Garimella, K., and Weber, I. (2013). Political Hashtag Hijacking in the U.S. In Proceedings of the 22Nd International Conference on World Wide Web Companion, WWW 7813 Companion, pages 55-56, Republic and Canton of Geneva, Switzerland. International World Wide Web Conferences Steering Committee.
  9. Han, E.-H. S. and Karypis, G. (2000). Centroid-Based Document Classification: Analysis and Experimental Results. Springer.
  10. Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: a review. ACM Computing Surveys, 31(3):264-323.
  11. Karypis, G. (2002). CLUTO a Clustering Toolkit. Technical report, Dept. of Computer Science, University of Minnesota. Technical Report 02-017.
  12. Kleinberg, J. (2002). An Impossibility Theorem for Clustering. pages 446-453. MIT Press.
  13. Kyriakopoulou, A. and Kalamboukis, T. (2007). Using Clustering to Enhance Text Classification. In 30nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 805-806, New York, NY, USA. ACM Press.
  14. Makazhanov, A., Rafiei, D., and Waqar, M. (2014). Predicting Political Preference of Twitter Users. Social Network Analysis and Mining, 4(1).
  15. Orengo, V. M. and Huyck, C. R. (2001). A Stemming Algorithmm for the Portuguese Language. In SPIRE, volume 8, pages 186-193.
  16. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47.
  17. Soucy, P. and Mineau, G. W. (2001). A Simple KNN Algorithm for Text Categorization. In ICDM 7801: Proceedings of the 2001 IEEE International Conference on Data Mining, pages 647-648, Washington, DC, USA. IEEE Computer Society.
  18. Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., and Demirbas, M. (2010). Short Text Classification in Twitter to Improve Information Filtering. In 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 7810, pages 841-842, New York, NY, USA. ACM.
  19. Vens, C., Verstrynge, B., and Blockeel, H. (2013). Semi-supervised Clustering with Example Clusters. In 5th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, pages 45-51, Vilamoura, Algarve, Portugal.
Download


Paper Citation


in Harvard Style

de Oliveira E., Gomes Basoni H., Rodrigues Saúde M. and Marques Ciarelli P. (2014). Combining Clustering and Classification Approaches for Reducing the Effort of Automatic Tweets Classification . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014) ISBN 978-989-758-048-2, pages 465-472. DOI: 10.5220/0005159304650472


in Bibtex Style

@conference{kdir14,
author={Elias de Oliveira and Henrique Gomes Basoni and Marcos Rodrigues Saúde and Patrick Marques Ciarelli},
title={Combining Clustering and Classification Approaches for Reducing the Effort of Automatic Tweets Classification},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)},
year={2014},
pages={465-472},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005159304650472},
isbn={978-989-758-048-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)
TI - Combining Clustering and Classification Approaches for Reducing the Effort of Automatic Tweets Classification
SN - 978-989-758-048-2
AU - de Oliveira E.
AU - Gomes Basoni H.
AU - Rodrigues Saúde M.
AU - Marques Ciarelli P.
PY - 2014
SP - 465
EP - 472
DO - 10.5220/0005159304650472