A SEMANTIC SCRAPING MODEL FOR WEB RESOURCES - Applying Linked Data to Web Page Screen Scraping

José Ignacio Fernández-Villamor, Jacobo Blasco-García, Carlos Á. Iglesias, Mercedes Garijo

2011

Abstract

In spite of the increasing presence of SemanticWeb Facilities, only a limited amount of the available resources in the Internet provide a semantic access. Recent initiatives such as the emerging Linked Data Web are providing semantic access to available data by porting existing resources to the semantic web using different technologies, such as database-semantic mapping and scraping. Nevertheless, existing scraping solutions are based on ad-hoc solutions complemented with graphical interfaces for speeding up the scraper development. This article proposes a generic framework for web scraping based on semantic technologies. This framework is structured in three levels: scraping services, semantic scraping model and syntactic scraping. The first level provides an interface to generic applications or intelligent agents for gathering information from the web at a high level. The second level defines a semantic RDF model of the scraping process, in order to provide a declarative approach to the scraping task. Finally, the third level provides an implementation of the RDF scraping model for specific technologies. The work has been validated in a scenario that illustrates its application to mashup technologies.

References

  1. Berners-Lee, T., Hendler, J., Lassila, O., et al. (2001). The semantic web. Scientific american, 284(5):28-37.
  2. Bizer, C., Heath, T., and Berners-Lee, T. (2009). Linked data-the story so far. sbc, 14(w3c):9.
  3. Bolin, M., Webber, M., Rha, P., Wilson, T., and Miller, R. C. (2005). Automation and customization of rendered web pages. Symposium on User Interface Software and Technology, page 163.
  4. Breslin, J., Decker, S., Harth, A., and Bojars, U. (2006). SIOC: an approach to connect web-based communities. International Journal of Web Based Communities, 2(2):133-142.
  5. Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y. (2003). Extracting content structure for web pages based on visual representation. In Proc.5 th Asia Pacific Web Conference, pages 406-417.
  6. Chang, C., Kayed, M., Girgis, M., and Shaalan, K. (2006). A survey of web information extraction systems. IEEE Transactions on Knowledge and Data Engineering, pages 1411-1428.
  7. Fielding, R. T. (2000). Architectural Styles and the Design of Network-based Software Architectures. PhD thesis, University of California.
  8. Hazaël-Massieux, D. and Connolly, D. (2004). Gleaning resource descriptions from dialects of languages (grddl). World Wide Web Consortium, W3C Coordination Group Note NOTE-grddl-20040413.
  9. Hogue, A. (2005). Thresher: Automating the unwrapping of semantic content from the world wide web. In Proceedings of the Fourteenth International World Wide Web Conference, pages 86-95. ACM Press.
  10. Huynh, D., Mazzocchi, S., and Karger, D. (2007). Piggy bank: Experience the semantic web inside your web browser. Web Semantics: Science, Services and Agents on the World Wide Web, 5(1):16-27.
  11. Kosala, R. and Blockeel, H. (2000). Web mining research: A survey. ACM SIGKDD Explorations Newsletter, 2(1):1-15.
  12. Kushmerick, N. (1997). Wrapper induction for information extraction.
  13. Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118:2000.
  14. Pan, A., Raposo, J., Í lvarez, M., Montoto, P., Orjales, V., Hidalgo, J., Ardao, L., Molano, A., and Vin˜a, A. (2002). The denodo data integration platform. Very Large Data Bases, page 986.
  15. Toomim, M., Drucker, S. M., Dontcheva, M., Rahimi, A., Thomson, B., and Landay, J. A. (2009). Attaching UI enhancements to websites with end users. Conference on Human Factors in Computing Systems, pages 1859-1868.
  16. Wei, L., Meng, X., and Meng, W. (2006). Vision-based web data records extraction. In WebDB.
  17. Wong, J. and Hong, J. I. (2007). Making mashups with marmite:towards end-user programming for the web. Conference on Human Factors in Computing Systems, page 1435.
Download


Paper Citation


in Harvard Style

Ignacio Fernández-Villamor J., Blasco-García J., Á. Iglesias C. and Garijo M. (2011). A SEMANTIC SCRAPING MODEL FOR WEB RESOURCES - Applying Linked Data to Web Page Screen Scraping . In Proceedings of the 3rd International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-8425-41-6, pages 451-456. DOI: 10.5220/0003185704510456


in Bibtex Style

@conference{icaart11,
author={José Ignacio Fernández-Villamor and Jacobo Blasco-García and Carlos Á. Iglesias and Mercedes Garijo},
title={A SEMANTIC SCRAPING MODEL FOR WEB RESOURCES - Applying Linked Data to Web Page Screen Scraping},
booktitle={Proceedings of the 3rd International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2011},
pages={451-456},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003185704510456},
isbn={978-989-8425-41-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 3rd International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - A SEMANTIC SCRAPING MODEL FOR WEB RESOURCES - Applying Linked Data to Web Page Screen Scraping
SN - 978-989-8425-41-6
AU - Ignacio Fernández-Villamor J.
AU - Blasco-García J.
AU - Á. Iglesias C.
AU - Garijo M.
PY - 2011
SP - 451
EP - 456
DO - 10.5220/0003185704510456