EFFICIENTLY LOCATING COLLECTIONS OF WEB PAGES TO WRAP

Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo

Abstract

Many large web sites contain highly valuable information. Their pages are dynamically generated by scripts which retrieve data from a back-end database and embed them into HTML templates. Based on this observation several techniques have been developed to automatically extract data from a set of structurally homogeneous pages. These tools represent a step towards the automatic extraction of data from large web sites, but currently their input sample pages have to be manually collected. To scale the data extraction process this task should be automated, as well. We present techniques to automatically gathering structurally similar pages from large web sites. We have developed an algorithm that takes as input one sample page, and crawls the site to find pages similar in structure to the given page. The collected pages can feed an automatic wrapper generator to extract data. Experiments conducted over real life web sites gave us encouraging results.

References

  1. Chakrabarti, S., DOM, B., Kumar, S., Raghavan, P., Rajagopalan, S., Tomkins, A., Gibson, D., and Kleinberg, J. (1999a). Mining the web's link structure. Computer, 32(8):60-67.
  2. Chakrabarti, S., van den Berg, M., and Dom, B. (1999b). Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks (Amsterdam, Netherlands), 31(11-16):1623-1640.
  3. Crescenzi, V. and Mecca, G. (2004). Automatic information extraction from large web sites. Journal of the ACM, 51(5).
  4. Crescenzi, V., Mecca, G., and Merialdo, P. (2001). ROADRUNNER: Towards automatic data extraction from large Web sites. In VLDB 2001.
  5. Crescenzi, V., Merialdo, P., and Missier, P. (2003). Finegrain web site structure discovery. In Proc. of ACM CIKM WIDM 2003.
  6. Dean, J. and Henzinger, M. R. (1999). Finding related pages in the world wide web. Computer Networks, 31:1467- 1479.
  7. Kao, H., Lin, S., Ho, J., and M.-S., C. (2004). Mining web informative structures and contents based on entropy analysis. IEEE Trans. on Knowledge and Data Engineering, 16(1):41-44.
  8. Laender, A., Ribeiro-Neto, B., Da Silva, A., and J., T. (2002). A brief survey of web data extraction tools. ACM SIGMOD Record, 31(2).
  9. Lerman, K., Getoor, L., Minton, S., and Knoblock, C. (2004). Using the structure of web sites for automatic segmentation of tables. SIGMOD 2004.
  10. Liu, Z., Ng, W. K., and Lim, E.-P. (2004). An automated algorithm for extracting website skeleton. In Proc. of DASFAA 2004.
  11. Palmieri, J., da Silva, A., Golgher, P., and Laender, A. (2002). Collecting hidden web pages for data extraction. In Proc. of ACM CIKM WIDM 2002.
  12. Raghavan, S. and Garcia-Molina, H. (2001). Crawling the hidden web. In Proc. of VLDB 2001.
  13. Small, H. (1973). Co-citation in the scientific literature: a new measure on the relationship between two documents. Journal of the American Society for Information Science, 24(4):28-31.
  14. Spertus, E. (1997). Mining structural information on the web. In Proc. of WWW Conf. 1997.
  15. Van Rijsbergen, C. (1979). Information Retrieval, 2nd edition. University of Glasgow.
  16. Wang, J. and Lochovsky, F. (2002). Data-rich section extraction from html pages. In WISE 2002.
  17. Ziv Bar-Yossef, Z. and Rajagopalan, S. (2002). Template detection via data mining and its applications. In Proc. of WWW Conf. 2002.
Download


Paper Citation


in Harvard Style

Blanco L., Crescenzi V. and Merialdo P. (2005). EFFICIENTLY LOCATING COLLECTIONS OF WEB PAGES TO WRAP . In Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 972-8865-20-1, pages 247-254. DOI: 10.5220/0001234202470254


in Bibtex Style

@conference{webist05,
author={Lorenzo Blanco and Valter Crescenzi and Paolo Merialdo},
title={EFFICIENTLY LOCATING COLLECTIONS OF WEB PAGES TO WRAP},
booktitle={Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2005},
pages={247-254},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001234202470254},
isbn={972-8865-20-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the First International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - EFFICIENTLY LOCATING COLLECTIONS OF WEB PAGES TO WRAP
SN - 972-8865-20-1
AU - Blanco L.
AU - Crescenzi V.
AU - Merialdo P.
PY - 2005
SP - 247
EP - 254
DO - 10.5220/0001234202470254