ONLINE WEB GENRE CLASSIFICATION, IS IT DOABLE?

Hoda Badesh, James Blustein, Anwar Alhenshiri

Abstract

This paper investigates the feasibility and effectiveness of online clustering of Web search results by genre. Although there are several research studies that have investigated the accuracy of classifying Web pages by genres, research has focused only on off-line clustering and classification due to the large number of documents on the Web. This research intends to investigate the feasibility of creating sets of Web pages to represent main genres on the Web. Each genre, as identified in the work of Santini (2006), will be represented by a set of Web pages. Web search results will be compared to those sets and classified accordingly. Search results will be grouped according to their similarities to which set of genre representatives. The resulting clusters of Web search results will be rendered to the user. A user study will be conducted to examine the validity and accuracy of online clustering based on Web genres.

References

  1. Alhenshiri, A., Brooks, S., Watters, C., Shepherd, M., 2010. Augmenting the Visual Presentation of Web Search Results. In proceedings of the 5th International Conference on Digital Information Management, Thunder Bay, ON, Canada, (to appear).
  2. Carpineto, C., Osinski, S., Romano, G., Weiss, D., 2009. A Survey of Web Clustering Engines. ACM Computing Surveys, vol. 41, issue 3, Article No. 17.
  3. Levering, R., Cutler, M., and Yu, L., 2008. Using Visual Features for Fine-Grained Genre Classification of Web Pages. In Proceedings of the 41st Annual Hawaii International Conference on System Sciences, Hawaii, USA, 131.
  4. Manning, C. D., Raghavan, P., Sch├╝tze, H., 2008. Introduction to Information Retrieval. Cambridge University Press.
  5. Mason, J., E., Shepherd, M., Duffy, J., 2009. An N-Gram Based Approach to Automatically Identifying Web Page Genre. HICSS 2009: 1-10.
  6. Rosso, A. M., 2005. What type of page is this?: Genre as Web Descriptor. In Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, Denver, CO, USA, 398.
  7. Stubbe, A., Ringlstetter, C., Zheng, T., Goebel, R., 2007. Incremental Genre Classification. In Proceeding of Colloquium held in conjunction with Corpus Linguistics, Birmingham, UK.
  8. Santini, M., 2006. Interpreting Genre Evolution on the Web. In EACL 2006 Workshop: NEW TEXT - Wikis and blogs and other dynamic text sources, Trento, 32- 40.
  9. Santini, M., Sharoff, S., 2009. Web Genre Benchmark Under Construction. Journal for Language Technology and Computational Linguistics (JLCL). Volume 25, Number 1- Special Issue: Automatic Genre Identification: Issues, and Prospects.
  10. Teevan, J. 2008. How People Recall, Recognize, and Reuse Search Results. ACM Transactions on Information Systems, vol. 26, issue 4. Article No. 19.
  11. Turetken, O., & Sharda, R., 2005. Clustering-based Visual Interfaces for Presentation of Web Search Results: An Imperical Investigation. Information Systems Frontier, 7(3), 273-297.
Download


Paper Citation


in Harvard Style

Badesh H., Blustein J. and Alhenshiri A. (2011). ONLINE WEB GENRE CLASSIFICATION, IS IT DOABLE? . In Proceedings of the 7th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8425-51-5, pages 278-281. DOI: 10.5220/0003314202780281


in Bibtex Style

@conference{webist11,
author={Hoda Badesh and James Blustein and Anwar Alhenshiri},
title={ONLINE WEB GENRE CLASSIFICATION, IS IT DOABLE?},
booktitle={Proceedings of the 7th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2011},
pages={278-281},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003314202780281},
isbn={978-989-8425-51-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - ONLINE WEB GENRE CLASSIFICATION, IS IT DOABLE?
SN - 978-989-8425-51-5
AU - Badesh H.
AU - Blustein J.
AU - Alhenshiri A.
PY - 2011
SP - 278
EP - 281
DO - 10.5220/0003314202780281