SEMI-SUPERVISED INFORMATION EXTRACTION FROM VARIABLE-LENGTHWEB-PAGE LISTS

Daniel Nikovski, Alan Esenther, Akihiro Baba

2009

Abstract

We propose two methods for constructing automated programs for extraction of information from a class of web pages that are very common and of high practical significance — variable-length lists of records with identical structure. Whereas most existing methods would require multiple example instances of the target web page in order to be able to construct extraction rules, our algorithms require only a single example instance. The first method analyzes the document object model (DOM) tree of the web page to identify repeatable structure that includes all of the specified data fields of interest. The second method provides an interactive way of discovering the list node of the DOMtree by visualizing the correspondence between portions of XPath expressions and visual elements in the web page. Both methods construct extraction rules in the form of XPath expressions, facilitating ease of deployment and integration with other information systems.

References

  1. Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., and Teixeira, J. S. (2002). A brief survey of Web data extraction tools. SIGMOD Record (ACM Special Interest Group on Management of Data), 31(2):84-93.
  2. Liu, B. (2007). Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Data-Centric Systems and Applications. Springer.
  3. Liu, L., Pu, C., and Han, W. (2000). XWRAP: An XMLenabled wrapper construction system for web information sources. In Proceedings of the International Conference on Data Engineering, pages 611-621.
  4. Sahuguet, A. and Azavant, F. (1999). Building light-weight wrappers for legacy web data-sources using W4F. In 25th Conference on Very Large Database Systems, pages 738-741, Edingurgh, UK.
  5. Schwartz, R. L. (2007). HTML scraping with XPath. Linux Magazine, 2007(4).
  6. van den Heuvel, W.-J. and Thiran, P. (2003). A methodology for designing federated enterprise models with conceptualized legacy wrappers. In Proceedings of the Fifth International Conference on Enterprise Information Systems ICEIS'03, pages 353-358.
Download


Paper Citation


in Harvard Style

Nikovski D., Esenther A. and Baba A. (2009). SEMI-SUPERVISED INFORMATION EXTRACTION FROM VARIABLE-LENGTHWEB-PAGE LISTS . In Proceedings of the 11th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-8111-84-5, pages 261-266. DOI: 10.5220/0001858402610266


in Bibtex Style

@conference{iceis09,
author={Daniel Nikovski and Alan Esenther and Akihiro Baba},
title={SEMI-SUPERVISED INFORMATION EXTRACTION FROM VARIABLE-LENGTHWEB-PAGE LISTS},
booktitle={Proceedings of the 11th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2009},
pages={261-266},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001858402610266},
isbn={978-989-8111-84-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 11th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - SEMI-SUPERVISED INFORMATION EXTRACTION FROM VARIABLE-LENGTHWEB-PAGE LISTS
SN - 978-989-8111-84-5
AU - Nikovski D.
AU - Esenther A.
AU - Baba A.
PY - 2009
SP - 261
EP - 266
DO - 10.5220/0001858402610266