CLASSIFYING WEB PAGES BY GENRE - Dealing with Unbalanced Distributions, Multiple Labels and Noise

Jane E. Mason, Michael Shepherd, Jack Duffy, Vlado Kešelj

Abstract

Web page genre classification is a potentially powerful tool for filtering the results of online searches. The goal of this research is to develop an approach to the problem of Web page genre classification that is effective not only on balanced, single-label corpora, but also on unbalanced and multi-label corpora, and in the presence of noise, in order to better represent a real world environment. The approach is based on n-gram representations of the Web pages and centroid representations of the genre classes. Experimental results compare very favorably with those of other researchers.

References

  1. Crowston, K. and Williams, M. (1997). Reproduced and Emergent Genres of Communication on the WorldWide Web. In Proc. 30th Hawaii Intl. Conf. on System Sciences, pages 30-39.
  2. Finn, A. and Kushmerick, N. (2006). Learning to Classify Documents According to Genre. Journal of American Society for Information Science and Technology, 57(11):1506-1518.
  3. Jebari, C. (2008). Refined and Incremental Centroid-based Approach for Genre Categorization of Web Pages. In Proc. 17th Intl. World Wide Web Conf.
  4. Kanaris, I. and Stamatatos, E. (2009). Learning to Recognize Webpage Genres. Information Processing & Management, 45(5):499-512.
  5. Kes?elj, V., Peng, F., Cercone, N., and Thomas, T. (2003). Ngram-based author profiles for authorship attribution. In Proc. Conf. Pacific Association for Computational Linguistics, pages 255-264.
  6. Levering, R., Cutler, M., and Yu, L. (2008). Using Visual Features for Fine-Grained Genre Classification of Web Pages. In Proc. 41st Hawaii Intl. Conf. on System Sciences. IEEE Computer Society.
  7. Mason, J., Shepherd, M., and Duffy, J. (2009a). An Ngram Based Approach to Automatically Identifying Web Page Genre. In Proc. 42nd Hawaii Intl. Conf. on System Sciences.
  8. Mason, J., Shepherd, M., and Duffy, J. (2009b). Classifying Web Pages by Genre: A Distance Function Approach. In Proc. 5th Intl. Conf. on Web Information Systems and Technologies.
  9. Mason, J., Shepherd, M., and Duffy, J. (2009c). Classifying Web Pages by Genre: An n-gram Based Approach. In Proc. Intl. Conf. on Web Intelligence.
  10. Mason, J., Shepherd, M., Duffy, J., Kes?elj, V., and Watters, C. (2010). An n-gram Based Approach to Multilabeled Web Page Genre Classification. In Proc. 43rd Hawaii Intl. Conf. on System Sciences.
  11. Meyer zu Eissen, S. and Stein, B. (2004). Genre Classification of Web Pages. In Proc. 27th German Conf. on Artificial Intelligence. Springer.
  12. Rosso, M. (2008). User-based identification of Web genres. Journal of the American Society for Information Science and Technology, 59(7).
  13. Santini, M. (2008). Zero, Single, or Multi? Genre of Web Pages Through the Users' Perspective. Information Processing and Management, 44(2):702-737.
  14. Shepherd, M., Watters, C., and Kennedy, A. (2004). Cybergenre: Automatic Identification of Home Pages on the Web. Journal of Web Engineering, 3(3&4):236-251.
  15. Stein, B. and Meyer zu Eissen, S. (2008). Retrieval models for genre classification. Scandinavian Journal of Information Systems, 20(1):93-119.
  16. Vidulin, V., Lus?trek, M., and Gams, M. (2007). Training the Genre Classifier for Automatic Classification of Web Pages. In Proc. 29th Intl. Conf. on Information Technology Interfaces, pages 93-98.
Download


Paper Citation


in Harvard Style

E. Mason J., Shepherd M., Duffy J. and Kešelj V. (2011). CLASSIFYING WEB PAGES BY GENRE - Dealing with Unbalanced Distributions, Multiple Labels and Noise . In Proceedings of the 7th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8425-51-5, pages 589-594. DOI: 10.5220/0003343505890594


in Bibtex Style

@conference{webist11,
author={Jane E. Mason and Michael Shepherd and Jack Duffy and Vlado Kešelj},
title={CLASSIFYING WEB PAGES BY GENRE - Dealing with Unbalanced Distributions, Multiple Labels and Noise},
booktitle={Proceedings of the 7th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2011},
pages={589-594},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003343505890594},
isbn={978-989-8425-51-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - CLASSIFYING WEB PAGES BY GENRE - Dealing with Unbalanced Distributions, Multiple Labels and Noise
SN - 978-989-8425-51-5
AU - E. Mason J.
AU - Shepherd M.
AU - Duffy J.
AU - Kešelj V.
PY - 2011
SP - 589
EP - 594
DO - 10.5220/0003343505890594