A Hierarchical Book Representation of Word Embeddings for Effective Semantic Clustering and Search

Avi Bleiweiss

2017

Abstract

Semantic word embeddings have shown to cluster in space based on linguistic similarities that are quantifiably captured using simple vector arithmetic. Recently, methods for learning distributed word vectors have progressively empowered neural language models to compute compositional vector representations for phrases of variable length. However, they remain limited in expressing more generic relatedness between instances of a larger and non-uniform sized body-of-text. In this work, we propose a formulation that combines a word vector set of variable cardinality to represent a verse or a sentence, with an iterative distance metric to evaluate similarity in pairs of non-conforming verse matrices. In contrast to baselines characterized by a bag of features, our model preserves word order and is more sustainable in performing semantic matching at any of a verse, chapter and book levels. Using our framework to train word vectors, we analyzed the clustering of bible books exploring multidimensional scaling for visualization, and experimented with book searches of both contiguous and out-of-order parts of verses. We report robust results that support our intuition for measuring book-to-book and verse-to-book similarity.

References

  1. Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paca, M., and Soroa, A. (2009). A study on similarity and reBaeza-Yates, R. and Ribeiro-Neto, B., editors (1999). Modern Information Retrieval. ACM Press Series/Addison Wesley, Essex, UK.
  2. Baroni, M., Dinu, G., and Kruszewski, G. (2014). Don't count, predict! a systematic comparison of contextcounting vs. context-predicting semantic vectors. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 238-247, Baltimore, MD.
  3. Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A neural probabilistic language model. Machine Learning Research (JMLR), 3:1137-1155.
  4. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011). Natural language processing (almost) from scratch. Machine Learning Research (JMLR), 12:2493-2537.
  5. Cormen, T. H., Leiserson, C. H., Rivest, R. L., and Stein, C. (1990). Introduction to Algorithms. MIT Press/McGraw-Hill Book Company, Cambridge, MA.
  6. Duda, R. O., Hart, P. E., and Stork, D. G. (2001). Unsupervised learning and clustering. In Pattern Classification, pages 517-601. Wiley, New York, NY.
  7. Fu, R., Guo, J., Qin, B., Che, W., Wang, H., and Liu, T. (2014). Learning semantic hierarchies via word embeddings. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 1199-1209, Baltimore, MD.
  8. Google (2008). Google Bible Text. google.com/site/ruwach/bibletext.
  9. Guo, J., Che, W., and Wang, H., L. T. (2014). Revisiting embedding features for simple semi-supervised learning. In Empirical Methods in Natural Language Processing (EMNLP), pages 110-120, Doha, Qatar.
  10. Hofmann, T. and Buhmann, J. (1995). Multidimensional scaling and data clustering. In Advances in Neural Information Processing Systems, pages 459-466. MIT Press, Cambridge, MA.
  11. Iyyer, M., Boyd-Graber, J., Claudino, L., Socher, R., and Daume, H. (2014). A neural network for factoid question answering over paragraphs. In Empirical Methods in Natural Language Processing (EMNLP), pages 633-644, Doha, Qatar.
  12. Jiang, R., Liu, Y., and Xu, K. (2015). A general framework for text semantic analysis and clustering on Yelp reviews. http://cs229.stanford.edu/proj2015/ 003 report.pdf.
  13. Kaufman, L. and Rousseeuw, P. J., editors (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, NY.
  14. Kim, Y. (2014). Convolutional neural networks for sentence classification. In Empirical Methods in Natural Language Processing (EMNLP), pages 1746-1751, Doha, Qatar.
  15. Le, Q. V. and Mikolov, T. (2014). Distributed representations of sentences and documents. ArXiv e-prints, 1405.4053. http://adsabs.harvard.edu/abs/ 2014arXiv1405.4053L.
  16. Manning, C. D., Raghavan, P., and Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press, Cambridge, United Kingdom.
  17. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space. ArXiv e-prints, 1301.3781. http://adsabs.harvard.edu/abs/2013arXiv1301.3781M.
  18. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111- 3119. Curran Associates, Inc., Red Hook, NY.
  19. Pennington, J., Socher, R., and Manning, C. D. (2014). GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543, Doha, Qatar.
  20. R (1997). R project for statistical computing. http://www.rproject.org/.
  21. Salton, G., Wong, A., and Yang, C. S. (1975). A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11):613-620.
  22. Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Empirical Methods in Natural Language Processing (EMNLP), pages 1631-1642, Seattle, WA.
  23. Torgerson, W. S. (1958). Theory and Methods of Scaling. John Wiley and Sons, New York, NY.
  24. Turney, P. D. and Pantel, T. (2010). From frequency to meaning: Vector space models of semantics. Artificial Intelligence Research (JAIR), 37:141-188.
  25. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy, E. (2016). Hierarchical attention networks for document classification. In Human Language Technologies: North American Chapter of the Association for Computational Linguistics, pages 1480-1489, San Diego, California.
Download


Paper Citation


in EndNote Style

TY - CONF
JO - Proceedings of the 9th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - A Hierarchical Book Representation of Word Embeddings for Effective Semantic Clustering and Search
SN - 978-989-758-220-2
AU - Bleiweiss A.
PY - 2017
SP - 154
EP - 163
DO - 10.5220/0006192701540163


in Harvard Style

Bleiweiss A. (2017). A Hierarchical Book Representation of Word Embeddings for Effective Semantic Clustering and Search . In Proceedings of the 9th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-220-2, pages 154-163. DOI: 10.5220/0006192701540163


in Bibtex Style

@conference{icaart17,
author={Avi Bleiweiss},
title={A Hierarchical Book Representation of Word Embeddings for Effective Semantic Clustering and Search},
booktitle={Proceedings of the 9th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2017},
pages={154-163},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006192701540163},
isbn={978-989-758-220-2},
}