Clustering and Classifying Text Documents - A Revisit to Tagging Integration Methods

Elisabete Cunha, Álvaro Figueira, Óscar Mealha

2013

Abstract

In this paper we analyze and discuss two methods that are based on the traditional k-means for document clustering and that feature integration of social tags in the process. The first one allows the integration of tags directly into a Vector Space Model, and the second one proposes the integration of tags in order to select the initial seeds. We created a predictive model for the impact of the tags’ integration in both models, and compared the two methods using the traditional k-means++ and the novel k-C algorithm. To compare the results, we propose a new internal measure, allowing the computation of the cluster compactness. The experimental results indicate that the careful selection of seeds on the k-C algorithm present better results to those obtained with the k-means++, with and without integration of tags.

References

  1. Arthur, D. & Vassilvitskii, S. 2007. k-means++: the advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. New Orleans, Louisiana: Society for Industrial and Applied Mathematics.
  2. Cunha, E. & Figueira, Á. 2012. Automatic Clustering Assessment through a Social Tagging System. In: The 15th IEEE International Conference on Computational Science and Engineering, 5-7 Dec. 2012 Paphos, Cyprus. 74-81.
  3. Cunha, E., Figueira, Á. & MEALHA, O. 2013. Clustering Documents Using Tagging Communities and Semantic Proximity In: 8th Iberian Conference on Information Systems and Technologies (CISTI), in press.
  4. Davies, D. L. & Bouldin, D. W. 1979. A Cluster Separation Measure. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-1, 224-227.
  5. Dunn, J. C. 1974. Well separated clusters and optimal fuzzy-partitions Journal of Cybernetics, Vol. 4 pp. 95- 104.
  6. Dye, J. 2006. Folksonomy: A game of high-tech (and high-stakes) tag, Wilton, CT, ETATS-UNIS, Online.
  7. Feldman, R. & Sanger, J. 2007. The Text Mining Handbook Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press.
  8. Lee, C. S., Goh, D. H.-L., Razikin, K. & Chua, A. Y. K. 2009. Tagging, Sharing and the Influence of Personal Experience.
  9. Macqueen, J. 1967. Some Methods for Classification and Analysis of MultiVariate. Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability. University of California Press.
  10. Manning, C., Raghavan, P. & Schütze, H. 2009. An Introduction to Information Retrieval, Cambridge University Press. Cambridge, England.
  11. SnuderL, K. 2008. Tagging: Can user - generated content improve our service? Statiscal Jounal of the IAOS 25, 125-132.
  12. Springer, M., Dulabahn, B., Michel, P., Natanson, B., Reser, D., Woodward, D. & Zinkham, H. 2008. For The Common Good: The Library of Congreess. Flichr Pilot Project - Report Summary.
  13. Theodoridis, S. & Koutroumbas, K. 2009. Pattern Recognition, Fourth Edition, Academic Press.
  14. Trant, J. 2008. Tagging, Folksonomy and Art Museums: Results of Steve museum's research [Online]. Available:http://verne.steve.museum/SteveResearchRe port2008.pdf [Accessed 2011].
  15. Trant, J. 2009. Studying Social Tagging and Folksonomy: A Review and Framework.
  16. Wal, V. 2007. Folksonomy Coinage and Definition [Online].Available:http://vanderwal.net/folksonomy.ht ml [Accessed 2011].
  17. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., 2007. Top 10 algorithms in data mining. Knowl. Inf. Syst., 14, 1-37.
Download


Paper Citation


in Harvard Style

Cunha E., Figueira Á. and Mealha Ó. (2013). Clustering and Classifying Text Documents - A Revisit to Tagging Integration Methods . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: KDIR, (IC3K 2013) ISBN 978-989-8565-75-4, pages 160-168. DOI: 10.5220/0004545201600168


in Bibtex Style

@conference{kdir13,
author={Elisabete Cunha and Álvaro Figueira and Óscar Mealha},
title={Clustering and Classifying Text Documents - A Revisit to Tagging Integration Methods},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: KDIR, (IC3K 2013)},
year={2013},
pages={160-168},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004545201600168},
isbn={978-989-8565-75-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: KDIR, (IC3K 2013)
TI - Clustering and Classifying Text Documents - A Revisit to Tagging Integration Methods
SN - 978-989-8565-75-4
AU - Cunha E.
AU - Figueira Á.
AU - Mealha Ó.
PY - 2013
SP - 160
EP - 168
DO - 10.5220/0004545201600168