Language Model and Clustering based Information Retrieval

Irene Giakoumi, Christos Makris, Yiannis Plegas

2015

Abstract

In this paper, we describe two novel frameworks for improving search results. Both of them organize relevant documents into clusters utilizing a new soft clustering method and language models. The first framework is query-independent and takes into account only the inter-document lexical or semantic similarities in order to form clusters. Also, we try to locate the duplicated content inside the formed clusters. The second framework is query-dependent and uses a query expansion technique for the cluster formation. The experimental evaluation demonstrates that the proposed method performs well in the majority of the results.

References

  1. Agrawal, R., Gollapudi, S., Halverson, A., and Ieong, S. (2009). Diversifying search results. In Proceedings of the Second ACM International Conference on Web Search and Data Mining. ACM.
  2. Baeza-Yates, R. and Ribeiro-Neto, B. (2011). Modern Information Retrieval: the concepts and technology behind search. ACM Press, 2nd edition.
  3. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993-1022.
  4. Chen, H. and Karger, D. R. (2006). Less is more: Probabilistic models for retrieving fewer relevant documents. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 7806, pages 429-436. ACM.
  5. Clarke, C., Kolla, M., Cormack, G., Vechtomova, O., Ashkan, A., Buttcher, S., and MacKinnon, I. (2008). Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 7808, pages 659- 666. ACM.
  6. Croft, W. (1980). A model of cluster searching based on classification. Information Systems, 5(3):189 - 195.
  7. Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1- 2):177-196.
  8. Jardine, N. and van Rijsbergen, C. (1971). The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7(5):217 - 240.
  9. Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Ann. Math. Statist., 22(1):79-86.
  10. Kurland, O. (2006). Inter-document Similarities, Language Models, and Ad Hoc Information Retrieval. PhD thesis, Cornell University.
  11. Kurland, O. and Lee, L. (2004). Corpus structure, language models, and ad hoc information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 7804, pages 194-201. ACM.
  12. Lafferty, J. and Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 7801, pages 111-119. ACM.
  13. Liu, X. (2006). Cluster-based retrieval from a languagemodeling perspective. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development on Information Retrieval, pages 737-738.
  14. Liu, X. and Croft, W. B. (2004). Cluster-based retrieval using language models. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 7804, pages 186-193. ACM.
  15. Makris, C., Plegas, Y., and Stamou, S. (2012). Web query disambiguation using pagerank. JASIST, 63:1581- 1592.
  16. Manning, C. D., Raghavan, P., and Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
  17. Plegas, Y. and Stamou, S. (2013). Reducing information redundancy in search results. In Proceedings of the 28th Annual ACM Symposium on Applied Computing, SAC 7813, pages 886-893. ACM.
  18. Ponte, J. M. and Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 7898, pages 275-281. ACM.
  19. Radlinski, F., Bennett, P. N., Carterette, B., and Joachims, T. (2009). Redundancy, diversity and interdependent document relevance. SIGIR Forum, 43(2).
  20. Raiber, F. and Kurland, O. (2012). Exploring the cluster hypothesis, and cluster-based retrieval, over the web. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 7812, pages 2507-2510. ACM.
  21. Rijsbergen, C. J. V. (1979). Information Retrieval. Butterworth-Heinemann, 2nd edition.
  22. Voorhees, E. M. (1985). The cluster hypothesis revisited. In Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 7885, pages 188-196. ACM.
  23. Wu, Z. and Palmer, M. (1994). Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, ACL 7894, pages 133-138. Association for Computational Linguistics.
  24. Zhai, C. and Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179-214.
Download


Paper Citation


in Harvard Style

Giakoumi I., Makris C. and Plegas Y. (2015). Language Model and Clustering based Information Retrieval . In Proceedings of the 11th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-758-106-9, pages 479-486. DOI: 10.5220/0005409804790486


in Bibtex Style

@conference{webist15,
author={Irene Giakoumi and Christos Makris and Yiannis Plegas},
title={Language Model and Clustering based Information Retrieval},
booktitle={Proceedings of the 11th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2015},
pages={479-486},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005409804790486},
isbn={978-989-758-106-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 11th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - Language Model and Clustering based Information Retrieval
SN - 978-989-758-106-9
AU - Giakoumi I.
AU - Makris C.
AU - Plegas Y.
PY - 2015
SP - 479
EP - 486
DO - 10.5220/0005409804790486