ON USING THE NORMALIZED COMPRESSION DISTANCE TO CLUSTER WEB SEARCH RESULTS

Alexandra Cernian, Liliana Dobrica, Dorin Carstoiu, Valentin Sgarciu

2010

Abstract

Current Web search engines return long lists of ranked documents that users are forced to sift through to find relevant documents. This paper introduces a new approach for clustering Web search results, based on the notion of clustering by compression. Compression algorithms allow defining a similarity measure based on the degree of common information. Classification methods allow clustering similar data without any previous knowledge. The clustering by compression procedure is based on a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files. Our goal is to apply the clustering by compression algorithm in order to cluster the documents returned by a Web search engine in response to a user query.

References

  1. Cilibrasi, R., Vitanyi, P., 2005. Clustering by compression. IEEE Transactions on Information Theory, Vol. 51, No. 4, pp 1523-1545.
  2. Grunwald, P., Vitanyi, P., 2004. Shannon Information and Kolmogorov Complexity.
  3. Ionescu, T., 2005. Etude des méthodes de classification par compression , Supélec Gif-sur-Yvette, France.
  4. Delahaye, J., 2004. Classer musique, langues, images, textes et genomes, Pour la science, n°317.
  5. Altintas. I., Berkley, C., Jaeger E., Jones, M. Ludascher, B. and Mock, S., 2004. Kepler: an extensible system for design and execution of scientific workflows, Proceedings of the 16th International Conference on Scientific and Statistical Database Management, p: 423-424.
  6. Morrison, D. F., 1990. Multivariate Statistical Methods. New York: McGraw-Hill.
  7. Carrot2, 2002: http://project.carrot2.org/
  8. Zamir, O., Etzioni, O., 1998. Web document clustering: A feasibility demonstration. Proceedings of SIGIR 7898, pp. 46--53.
  9. Su, Z., Yang, Q., Zhang, H. J., X,u X., Hu, Y. H., 2001. Correlation-based Document Clustering using Web Logs, In Proceedings of the 34th Hawaii International Conference On System Sciences(HICSS-34).
  10. Beeferman, D., Berger, A., 2000. Agglomerative clustering of a search engine query log, In Proceedings of the Sixth ACM SIGKDD, pp. 407-416.
  11. Google SOAP API, 2006: http:// code.google.com/apis/soapsearch/
  12. Porter stemming algorithm, 2006: http:// tartarus.org/martin/PorterStemmer/
  13. Cutting, D., Karger, D., Pedersen, J., Tukey, J. W., 1992. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. Proceedings of the 15th Annual International ACM/SIGIR Conference, Copenhagen.
  14. Stefanowski, J., Weiss, D., 2003. Carrot2 and Language Properties in Web Search Results Clustering. In: Lecture Notes in Artificial Intelligence: Advances in Web Intelligence, Proceedings of the First International Atlantic Web Intelligence Conference, Madrid, Spain, vol. 2663 (-), pp. 240-249
  15. Zhang, D., Dong, Y., 2004. Semantic, Hierarchical, Online Clustering of Web Search Results. In Proceedings of the 6th Asia Pacific Web Conference (APWEB), Hangzhou, China
Download


Paper Citation


in Harvard Style

Cernian A., Dobrica L., Carstoiu D. and Sgarciu V. (2010). ON USING THE NORMALIZED COMPRESSION DISTANCE TO CLUSTER WEB SEARCH RESULTS . In Proceedings of the 5th International Conference on Software and Data Technologies - Volume 1: ICSOFT, ISBN 978-989-8425-22-5, pages 293-298. DOI: 10.5220/0002926102930298


in Bibtex Style

@conference{icsoft10,
author={Alexandra Cernian and Liliana Dobrica and Dorin Carstoiu and Valentin Sgarciu},
title={ON USING THE NORMALIZED COMPRESSION DISTANCE TO CLUSTER WEB SEARCH RESULTS},
booktitle={Proceedings of the 5th International Conference on Software and Data Technologies - Volume 1: ICSOFT,},
year={2010},
pages={293-298},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002926102930298},
isbn={978-989-8425-22-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 5th International Conference on Software and Data Technologies - Volume 1: ICSOFT,
TI - ON USING THE NORMALIZED COMPRESSION DISTANCE TO CLUSTER WEB SEARCH RESULTS
SN - 978-989-8425-22-5
AU - Cernian A.
AU - Dobrica L.
AU - Carstoiu D.
AU - Sgarciu V.
PY - 2010
SP - 293
EP - 298
DO - 10.5220/0002926102930298