Web Page Classification Based on Web Page Size and Hyperlinksand Website Hyperlink Structure

Denis L. Nkweteyim

2005

Abstract

This paper presents a new metric, Page Rank × Inverse Links-to-word count Ratio (PR × ILW), used in classifying web pages as content or navigation. The metric combines web page size and number of hyperlinks on a page, and the page rank metric based on website topology, to compute the new metric. We present a theoretical basis for the new metric, and the results of a web page classification study, which show that the new metric, when combined with the links-to-word count ratio of web pages, accurately classifies the pages into the two categories.

References

  1. Mobasher, B., R. Cooley, and J. Srivastava: Automatic Personalization Based on Web Usage Mining. In Communications of the ACM (2000) 142-151
  2. Cooley, R., B. Mobasher, and J. Srivastava: Web Mining: Information And Pattern Discovery on the World Wide Web. In International Conference on Tools With Artificial Intelligence, Newport Beach, CA, (1997) 558-567
  3. Morita, M. and Y. Shinoda: Information Filtering Based on User Behavior Analysis and Best Match Text Retrieval. In Seventeenth annual international ACM-SIGIR conference on Research and development in information retrieval (1994) 272-281
  4. Konstan, A..J., Miller, N., Maltz, D., Herlocker, L., Lee, R., Riedl, J.: GroupLens: Applying Collaborative Filtering to Usenet News. In Communications of the ACM, ACM Press, New York (1997) 77-87
  5. Kim, J., W.D. Oard, and K. Romanik: User Modeling for Information Filtering Based on Implicit Feedback. In ISKO-France 2001, Nanterre, France (2001)
  6. Claypool, M., Le, P., Waseda, M., Brown, D.: Implicit Interest Indicators. In Proceedings of the 6th International Conference on Intelligent User Interfaces, Santa Fe, NM (2001) 33-40
  7. Kelly, D. and Belkin, N.: Reading Time, Scrolling and Interaction. In Proceedings of the 24th Annual International ACM SIGIR conference on Research and Development in Information Retrieval (2001) 408-409
  8. Chen, M.S., Park, J.S., Yu, P.S.: Data Mining for Path Traversal Patterns in a Web Environment., in Proceedings of the 16th International Conference on Distributed Computing Systems (1996) 385-392
  9. Hill, C.W., Hollan, D. J., Wroblewski, D., McCandless, T.: Edit Wear and Read Wear. In Proceedings of Conference on Human Factors and Computing Systems Monterey, CA (1992)
  10. Nichols, D.: Implicit Rating and Filtering. In Proceedings of the Fifth DELOS Workshop on Filtering and Collaborative Filtering Budapest, Hungary (1998) 31-36
  11. Oard, W.D., Kim, J.: Implicit Feedback for Recommender Systems. In AAAI Workshop on Recommender Systems Madison, WI (1998)
  12. Pitkow, J.: In Search of Reliable Usage Data on the WWW. In Proceedings of the Sixth International WWW Conference (1997) 1343-1355
  13. Pirolli, P., Pitkow, J., Rao, R.: Silk from a Sow's Ear: Extracting usable structures from the Web. In Conference on Human Factors in Computing Systems (CHI-96) (1996)
  14. Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for Mining World Wide Web Browsing Patterns. Journal of Knowledge and Information Systems 1(1) (1999) 1999
  15. Yan, T., Jacobsen, M., Garcia-Molina, H., Dayal, U.: From User Access Patterns to Dynamic Hypertext Linking. In 5th World Wide Web Conference. Paris (1996)
  16. Larson, K. Czerwinski, M.: Web Page Design: Implications of Memory, Structure, and Scent for Information Retrieval. In CHI'98 Human Factors in Computing Systems. ACM Press (1998)
  17. Pirolli, P.: Computational Models of Information Scent-Following in a Very Large Browsable Text Collection. In CHI 7897 Human Factors in Computing Systems Atlanta, GA. ACM Press (1997)
  18. Pirolli, P., Fu, W.: SNIF-ACT: A Model of Information Foraging on the World Wide Web. In: Brusilovsky, P., Corbert, A., De Rosis, F. (eds.): User Modeling 2003 (2003)
  19. Card, S.K., Pirolli, P., Van Der Wege, M., Morrison, J. B., Reeder, R.W., Schraedley, P.K., Boshart, J.: Information Scent as a Driver of Web Behavior Graphs: Results of a Protocol Analysis Method For Web Usability, in Proceedings of the Conference on Human factors in Computing Systems, CHI 7801. ACM Press, Seattle, WA (2001) 498-505
  20. Rogers, I.: The Google PageRank Algorithm and How it Works. Retrieved September 5 2003 from http://www.iprcom.com/ papers/pagerank/
  21. Craven, P.: Google's PageRank Explained and How to Make The Most of it. Retrieved September 5 2003 from http://www. webworkshop.net/ pagerank.html
  22. Rocchio, J.: Relevance Feedback in Information Retrieval. In: Salton, G. (ed): The Smart Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, Englewood, NJ.(1971) p. 313-323
  23. Porter, M.: An Algorithm for Suffix Stripping. Program. 14 (1980) 130-137
Download


Paper Citation


in Harvard Style

L. Nkweteyim D. (2005). Web Page Classification Based on Web Page Size and Hyperlinksand Website Hyperlink Structure . In Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2005) ISBN 972-8865-28-7, pages 225-233. DOI: 10.5220/0002577102250233


in Bibtex Style

@conference{pris05,
author={Denis L. Nkweteyim},
title={Web Page Classification Based on Web Page Size and Hyperlinksand Website Hyperlink Structure},
booktitle={Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2005)},
year={2005},
pages={225-233},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002577102250233},
isbn={972-8865-28-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 5th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2005)
TI - Web Page Classification Based on Web Page Size and Hyperlinksand Website Hyperlink Structure
SN - 972-8865-28-7
AU - L. Nkweteyim D.
PY - 2005
SP - 225
EP - 233
DO - 10.5220/0002577102250233