Data Quality and Sparsity Issues in Collaborative Filtering on Web Logs

Miha Grcǎr, Dunja Mladenič, Marko Grobelnik



In this paper, we present our experience in applying collaborative filtering to real-life corporate data in the light of data quality and sparsity. The quality of collaborative filtering recommendations is highly dependent on the quality of the data used to identify users’ preferences. To understand the influence that highly sparse server-side collected data has on the accuracy of collaborative filtering, we ran a series of experiments in which we used publicly available datasets and, on the other hand, a real-life corporate dataset that does not fit the profile of ideal data for collaborative filtering. We have also experimentally compared two standard distance measures (Pearson correlation and Cosine similarity) used by k-Nearest Neighbor classifier, showing that depending on the dataset one outperforms the other - but no consistent difference can be claimed.


  1. BALDI, P., FRASCONI, P., and SMYTH, P. (2003): Modelling and Understanding Human Behavior on the Web. In: Modelling the Internet and the Web, ISBN: 0-470-84906-1, 171- 209.
  2. BREESE, J.S., HECKERMAN, D., and KADIE, C. (1998): Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In: Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence.
  3. CLAYPOOL, M., LE, P., WASEDA, M., and BROWN, D. (2001): Implicit Interest Indicators. In: Proceedings of IUI'01.
  4. DEERWESTER, S., DUMAIS, S.T., and HARSHMAN, R. (1990): Indexing by Latent Semantic Analysis. In: Journal of the Society for Information Science, Vol. 41, No. 6, 391-407.
  5. GOLDBERG, K., ROEDER, T., GUPTA, D., and PERKINS, C. (2001): Eigentaste: A Constant Time Collaborative Filtering Algorithm. In: Information Retrieval, No. 4, 133-151.
  6. GRCAR, M. (2004): User Profiling: Collaborative Filtering. In: Proceedings of SIKDD 2004 at Multiconference IS 2004, 75-78.
  7. GRCAR, M., MLADENIC D., GROBELNIK, M. (2005): Applying Collaborative Filtering to Real-life Corporate Data. In: Proceedings of the 29th Annual Conference of the German Classification Society (GfKl 2005), Springer, 2005.
  8. HERLOCKER, J.L., KONSTAN, J.A., TERVEEN, L.G., and RIEDL, J.T. (2004): Evaluating Collaborative Filtering Recommender Systems. In: ACM Transactions on Information Systems, Vol. 22, No. 1, 5-53.
  9. HOFMANN, T. (1999): Probabilistic Latent Semantic Analysis. In: Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence.
  10. KONSTAN, J.A., MILLER, B.N., MALTZ, D., HERLOCKER, J.L., GORDON, L.R., and RIEDL, J. (1997): GroupLens: Applying Collaborative Filtering to Usenet News. In: Communications of the ACM, Vol. 40, No. 3, 77-87.
  11. MELVILLE, P., MOONEY, R.J., and NAGARAJAN, R. (2002): Content-boosted Collaborative Filtering for Improved Recommendations. In: Proceedings of the 18th National Conference on Artificial Intelligence, 187-192.
  12. RESNICK, P., IACOVOU, N., SUCHAK, M., BERGSTROM, P., and RIEDL, J. (1994): GroupLens: An Open Architecture for Collaborative Filtering for Netnews. In: Proceedings of CSCW'94, 175-186.

Paper Citation

in Harvard Style

Grcǎr M., Mladenič D. and Grobelnik M. (2005). Data Quality and Sparsity Issues in Collaborative Filtering on Web Logs . In Proceedings of the 1st International Workshop on Web Personalisation, Recommender Systems and Intelligent User Interfaces - Volume 1: WPRSIUI, (ICETE 2005) ISBN 972-8865-38-4, pages 89-97. DOI: 10.5220/0001421500890097

in Bibtex Style

author={Miha Grcǎr and Dunja Mladenič and Marko Grobelnik},
title={Data Quality and Sparsity Issues in Collaborative Filtering on Web Logs},
booktitle={Proceedings of the 1st International Workshop on Web Personalisation, Recommender Systems and Intelligent User Interfaces - Volume 1: WPRSIUI, (ICETE 2005)},

in EndNote Style

JO - Proceedings of the 1st International Workshop on Web Personalisation, Recommender Systems and Intelligent User Interfaces - Volume 1: WPRSIUI, (ICETE 2005)
TI - Data Quality and Sparsity Issues in Collaborative Filtering on Web Logs
SN - 972-8865-38-4
AU - Grcǎr M.
AU - Mladenič D.
AU - Grobelnik M.
PY - 2005
SP - 89
EP - 97
DO - 10.5220/0001421500890097