Chinese Word Similarity Computation based on Automatically Acquired Knowledge

Yuteng Zhang, Wenpeng Lu, Hao Wu

Abstract

This paper describes our methods for Chinese word similarity computation based on automatically acquired knowledge on NLPCC-ICCPOL 2016 Task 3. All of the methods utilize off-the-shelf tools and data, which makes them easy to be replicated. We use Sogou corpus to train word vector for Chinese words and utilize Baidu to get Web page counts for word pairs. Both word vector and Web page counts can be acquired automatically. All of our methods don’t utilize any dictionary and manual-annotated knowledge, which avoids the huge human labor. Among the four submitted results, three systems achieve a similar Spearman correlation coefficient (0.327 by word vector, 0.328 by word vector and PMI, 0.314 by word vector and Dice). Besides, when all the English letters are converted to lowercase, the best performance of our methods is improved, which is 0.372 by word vector and Dice. All of the comparative methods and experiments are described in the paper.

References

  1. Bengio, Y., Ducharme, R., Vincent, P. and Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, pp.1137- 1155.
  2. Berant, J., Dagan, I. and Goldberger, J. (2012). Learning Entailment Relations by Global Graph Structure Optimization. Computational Linguistics, 38(1), pp.73-111.
  3. Biran, O., Brody, s. and Elhadad, N. (2011). Putting it simply: a context-aware approach to lexical simplification. In: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. pp.496-501.
  4. Church, K. and Patrick, H. (2002). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), pp.22-29.
  5. Gale, W. and Church, K. (1991). Identifying Word Correspondences in Parallel Texts. In: Speech and Natural Language, Proceedings of a Workshop held at Pacific Grove. pp.19-22.
  6. Lin, D. (1998). An Information-Theoretic Definition of Similarity. In: Fifteenth International Conference on Machine Learning. pp.296-304.
  7. Liu, P. and Zhao, T. (2010). Unsupervised Translation Disambiguation Based on Web Indirect Association of Bilingual Word. Journal of Software, 21(4), pp.575- 585.
  8. Liu, Q. and Li, S. (2002). Word Similarity Computing Based on How-net. Computational linguistics in Chinese.
  9. Lu, W., Huang, H. and Wu, H. (2014). Word sense disambiguation with graph model based on domain knowledge. Acta Automatica Sinica, 40(12), pp.2836- 2850.
  10. Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781.
  11. Pilevar, M., Jurgens, D. and Navigli, R. (2013). Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity. In: Association for Computational Linguistics. pp.1341-1351.
  12. Smadja, F., McKeown, K. and Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: a statistical approach. Computational Linguistics, 22(1), pp.1-38.
  13. Surdeanu, M., Ciaramita, M. and Zaragoza, H. (2011). Learning to Rank Answers to Non-Factoid Questions from Web Collections. Computational Linguistics, 37(2), pp.351-383.
  14. Wang, B. (1999). Research on automatic alignment for Chinese-English bilingual corpus. Doctor. Graduate University of Chinese Academy of Sciences(Institute of Computer Technology).
Download


Paper Citation


in Harvard Style

Zhang Y., Lu W. and Wu H. (2016). Chinese Word Similarity Computation based on Automatically Acquired Knowledge . In ISME 2016 - Information Science and Management Engineering IV - Volume 1: ISME, ISBN 978-989-758-208-0, pages 48-52. DOI: 10.5220/0006443500480052


in Bibtex Style

@conference{isme16,
author={Yuteng Zhang and Wenpeng Lu and Hao Wu},
title={Chinese Word Similarity Computation based on Automatically Acquired Knowledge},
booktitle={ISME 2016 - Information Science and Management Engineering IV - Volume 1: ISME,},
year={2016},
pages={48-52},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006443500480052},
isbn={978-989-758-208-0},
}


in EndNote Style

TY - CONF
JO - ISME 2016 - Information Science and Management Engineering IV - Volume 1: ISME,
TI - Chinese Word Similarity Computation based on Automatically Acquired Knowledge
SN - 978-989-758-208-0
AU - Zhang Y.
AU - Lu W.
AU - Wu H.
PY - 2016
SP - 48
EP - 52
DO - 10.5220/0006443500480052