DESIGN OF AUTOMATICALLY ADAPTABLE WEB WRAPPERS

Emilio Ferrara, Robert Baumgartner

Abstract

Nowadays, the huge amount of information distributed through the Web motivates studying techniques to be adopted in order to extract relevant data in an efficient and reliable way. Both academia and enterprises developed several approaches of Web data extraction, for example using techniques of artificial intelligence or machine learning. Some commonly adopted procedures, namely wrappers, ensure a high degree of precision of information extracted from Web pages, and, at the same time, have to prove robustness in order not to compromise quality and reliability of data themselves. In this paper we focus on some experimental aspects related to the robustness of the data extraction process and the possibility of automatically adapting wrappers. We discuss the implementation of algorithms for finding similarities between two different version of a Web page, in order to handle modifications, avoiding the failure of data extraction tasks and ensuring reliability of information extracted. Our purpose is to evaluate performances, advantages and draw-backs of our novel system of automatic wrapper adaptation.

References

  1. Baumgartner, R., Gottlob, G., and Herzog, M. (2009). Scalable web data extraction for online market intelligence. Proc. VLDB Endow., 2(2):1512-1523.
  2. Bille, P. (2005). A survey on tree edit distance and related problems. Theoretical Computer Science, 337(1- 3):217-239.
  3. Chidlovskii, B. (2001). Automatic repairing of web wrappers. In Proc. of the 3rd international workshop on Web information and data management, pages 24-30.
  4. Collins, M. J. (1996). A new statistical parser based on bigram lexical dependencies. In Proc. of the 34th Annual Meeting on Association for Computational Linguistics, pages 184-191, Morristown, NJ, USA.
  5. Ferrara, E. and Baumgartner, R. (2010). Automatic Wrapper Adaptation by Tree Edit Distance Matching (to appear). Smart Innovation, Systems and Technologies. Springer-Verlag.
  6. Ferrara, E., Fiumara, G., and Baumgartner, R. (2010). Web Data Extraction, Applications and Techniques: A Survey. Technical report.
  7. Kim, Y., Park, J., Kim, T., and Choi, J. (2008). Web information extraction by HTML tree edit distance matching. In Convergence Information Technology, 2007. International Conference on, pages 2455-2460.
  8. Kowalkiewicz, M., Kaczmarek, T., and Abramowicz, W. (2006). MyPortal: robust extraction and aggregation of web content. In Proc. of the 32nd international conference on Very large data bases, pages 1219-1222.
  9. Laender, A., Ribeiro-Neto, B., Silva, A. D., and JS (2002). A brief survey of web data extraction tools. ACM Sigmod, 31(2):84-93.
  10. Lerman, K., Minton, S., and Knoblock, C. (2003). Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18(2003):149-181.
  11. Meng, X., Hu, D., and Li, C. (2003). Schema-guided wrapper maintenance for web-data extraction. In Proc. of the 5th ACM international workshop on Web information and data management, pages 1-8, NY, USA.
  12. Raposo, J., Pan, A., Í lvarez, M., and Vin˜a, A. (2005). Automatic wrapper maintenance for semi-structured web sources using results from previous queries. SAC 7805: Proc. of the 2005 ACM symposium on Applied computing, pages 654-659.
  13. Selkow, S. (1977). The tree-to-tree editing problem. Information Processing Letters, 6(6):184 - 186.
  14. Tai, K. (1979). The tree-to-tree correction problem. Journal of the ACM (JACM), 26(3):433.
  15. Winkler, W. E. (1999). The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau.
  16. Wong, T. and Lam, W. (2005). A probabilistic approach for adapting information extraction wrappers and discovering new attributes. In ICDM'04. Proc. of the fourth IEEE International Conference on Data Mining, pages 257-264.
  17. Yang, W. (1991). Identifying syntactic differences between two programs. Software - Practice and Experience, 21(7):739-755.
  18. Zhai, Y. and Liu, B. (2005). Web data extraction based on partial tree alignment. In WWW 7805: Proc. of the 14th International Conference on World Wide Web, pages 76-85, New York, NY, USA.
Download


Paper Citation


in Harvard Style

Ferrara E. and Baumgartner R. (2011). DESIGN OF AUTOMATICALLY ADAPTABLE WEB WRAPPERS . In Proceedings of the 3rd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART, ISBN 978-989-8425-40-9, pages 211-217. DOI: 10.5220/0003131802110217


in Bibtex Style

@conference{icaart11,
author={Emilio Ferrara and Robert Baumgartner},
title={DESIGN OF AUTOMATICALLY ADAPTABLE WEB WRAPPERS},
booktitle={Proceedings of the 3rd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,},
year={2011},
pages={211-217},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003131802110217},
isbn={978-989-8425-40-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 3rd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,
TI - DESIGN OF AUTOMATICALLY ADAPTABLE WEB WRAPPERS
SN - 978-989-8425-40-9
AU - Ferrara E.
AU - Baumgartner R.
PY - 2011
SP - 211
EP - 217
DO - 10.5220/0003131802110217