Using Word Sense as a Latent Variable in LDA Can Improve Topic Modeling

Yunqing Xia, Guoyu Tang, Erik Cambria, Huan Zhao, Thomas Fang Zheng

2014

Abstract

Since proposed, LDA have been successfully used in modeling text documents. So far, words are the common features to induce latent topic, which are later used in document representation. Observation on documents indicates that the polysemous words can make the latent topics less discriminative, resulting in less accurate document representation. We thus argue that the semantically deterministic word senses can improve quality of the latent topics. In this work, we proposes a series of word sense aware LDA models which use word sense as an extra latent variable in topic induction. Preliminary experiments on document clustering on benchmark datasets show that word sense can indeed improve topic modeling.

References

  1. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. J. Mach. Learn. Res., 3:993- 1022.
  2. Boyd-Graber, J. L., Blei, D. M., and Zhu, X. (2007). A topic model for word sense disambiguation. In EMNLPCoNLL, pages 1024-1033. ACL.
  3. Chemudugunta, C., Smyth, P., and Steyvers, M. (2008). Combining concept hierarchies and statistical topic models. In Proceedings of the 17th ACM conference on Information and knowledge management, CIKM 7808, pages 1469-1470, New York, NY, USA. ACM.
  4. Dietz, L., Bickel, S., and Scheffer, T. (2007). Unsupervised prediction of citation influences. In In Proceedings of the 24th International Conference on Machine Learning, pages 233-240.
  5. Gabrilovich, E. and Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the 20th international joint conference on Artifical intelligence, IJCAI'07, pages 1606-1611, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  6. Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. PNAS, 101(suppl. 1):5228-5235.
  7. Guo, W. and Diab, M. (2011). Semantic topic models: combining word distributional statistics and dictionary definitions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 7811, pages 552-561, Stroudsburg, PA, USA. Association for Computational Linguistics.
  8. Hotho, A., Staab, S., and Stumme, G. (2003). Wordnet improves text document clustering. In In Proc. of the SIGIR 2003 Semantic Web Workshop, pages 541-544.
  9. Huang, H.-H. and Kuo, Y.-H. (2010). Cross-lingual document representation and semantic similarity measure: a fuzzy set and rough set based approach. Trans. Fuz Sys., 18(6):1098-1111.
  10. Kong, J. and Graff, D. (2005). Tdt4 multilingual broadcast news speech corpus. Linguistic Data Consortium, http://www. ldc. upenn. edu/Catalog/CatalogEntry. jsp.
  11. Lewis, D. D. (1997). Reuters-21578 text categorization test collection, distribution 1.0. http://www. research. att. com/˜ lewis/reuters21578. html.
  12. Steinbach, M., Karypis, G., and Kumar, V. (2000). A comparison of document clustering techniques. In In KDD Workshop on Text Mining.
  13. Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2004). Hierarchical dirichlet processes. Journal of the American Statistical Association, 101.
  14. Wang, X., McCallum, A., and Wei, X. (2007). Topical ngrams: Phrase and topic discovery, with an application to information retrieval. In Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, ICDM 7807, pages 697-702, Washington, DC, USA. IEEE Computer Society.
  15. Yao, X. and Van Durme, B. (2011). Nonparametric bayesian word sense induction. In Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing, pages 10-14. Association for Computational Linguistics.
Download


Paper Citation


in Harvard Style

Xia Y., Tang G., Zhao H., Cambria E. and Fang Zheng T. (2014). Using Word Sense as a Latent Variable in LDA Can Improve Topic Modeling . In Proceedings of the 6th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART, ISBN 978-989-758-015-4, pages 532-537. DOI: 10.5220/0004889705320537


in Bibtex Style

@conference{icaart14,
author={Yunqing Xia and Guoyu Tang and Huan Zhao and Erik Cambria and Thomas Fang Zheng},
title={Using Word Sense as a Latent Variable in LDA Can Improve Topic Modeling},
booktitle={Proceedings of the 6th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,},
year={2014},
pages={532-537},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004889705320537},
isbn={978-989-758-015-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,
TI - Using Word Sense as a Latent Variable in LDA Can Improve Topic Modeling
SN - 978-989-758-015-4
AU - Xia Y.
AU - Tang G.
AU - Zhao H.
AU - Cambria E.
AU - Fang Zheng T.
PY - 2014
SP - 532
EP - 537
DO - 10.5220/0004889705320537