Contextual Latent Semantic Networks used for Document Classification

Ondrej Hava, Miroslav Skrbek, Pavel Kordik

2012

Abstract

Widely used document classifiers are developed over a bag-of-words representation of documents. Latent semantic analysis based on singular value decomposition is often employed to reduce the dimensionality of such representation. This approach overlooks word order in a text that can improve the quality of classifier. We propose language independent method that records the context of particular word into a context network utilizing products of latent semantic analysis. Words' contexts networks are combined to one network that represents a document. A new document is classified based on a similarity between its network and training documents networks. The experiments show that proposed classifier achieves better performance than common classifiers especially when a foregoing reduction of dimensionality is significant.

References

  1. Berry, P. M., Harrison, I., Lowrance, J. D., Rodriguez, A. C., & Ruspini, E. H. (2004). Link Analysis Workbench. Air Force Research Laboratory.
  2. Burt, R. S. (1978). Cohesion Versus Structural Equivalence as a Basis for Network Subgroups. Sociological Methods and Research, 7, pp. 189-212.
  3. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41, pp. 391-407.
  4. Eibe, F., & Remco, B. (2006). Naive Bayes for Text Classification with Unbalanced Classes. Proceedings of 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (pp. 503-510). Berlin: Springer.
  5. Gaizauskas, R., & Wilks, Y. (1998). Information extraction: beyond document retrieval. Journal of Documentation, 54(1), pp. 70-105.
  6. Han, E., Karypis, G., & Kumar, V. (2001). Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. Proceedings of 5th PacificAsia Conference on Knowledge Discovery and Data Mining (pp. 53-65). Springer-Verlag.
  7. Kelleher, D. (2004). Spam Filtering using Contextual Network Graphs.
  8. Landauer, T., Foltz, P., & Laham, D. (1998). An Introduction to Latent Semantic Analysis. Discourse Processes, 25, pp. 259-284.
  9. Marin, A. (2011). Comparison of Automatic Classifiers' Performances using Word-based Feature Extraction Techniques in an E-government setting. Kungliga Tekniska Högskolan.
  10. Salton, G., & Buckley, C. (1988). Term-weighting Approaches in Automatic Text Retrieval. Information Processing & Management, 24(5), pp. 513-523.
  11. Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer-Verlag.
  12. Wasserman, S., & Faust, K. (1994). Social Network Analysis: Methods and Applications. Cambridge University Press.
  13. Weiss, S., Indurkhya, N., Zhang, T., & Damerau, F. (2005). Text Mining. Springer.
  14. Yang, Y., & Pedersen, J. O. (1997). A Comparative Study on Feature Selection in Text Categorization. Proceedings of the Fourteenth International Conference on Machine Learning (pp. 412--420). Morgan Kaufmann Publishers.
  15. Zhang, T., & Oles, F. J. (2000). Text Categorization Based on Regularized Linear Classification Methods. Information Retrieval, 4, pp. 5-31.
Download


Paper Citation


in Harvard Style

Hava O., Skrbek M. and Kordik P. (2012). Contextual Latent Semantic Networks used for Document Classification . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2012) ISBN 978-989-8565-29-7, pages 425-430. DOI: 10.5220/0004109304250430


in Bibtex Style

@conference{sstm12,
author={Ondrej Hava and Miroslav Skrbek and Pavel Kordik},
title={Contextual Latent Semantic Networks used for Document Classification},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2012)},
year={2012},
pages={425-430},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004109304250430},
isbn={978-989-8565-29-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2012)
TI - Contextual Latent Semantic Networks used for Document Classification
SN - 978-989-8565-29-7
AU - Hava O.
AU - Skrbek M.
AU - Kordik P.
PY - 2012
SP - 425
EP - 430
DO - 10.5220/0004109304250430