A TREE LEARNING APPROACH TO WEB DOCUMENT SECTIONAL HIERARCHY EXTRACTION

F. Canan Pembe, Tunga Güngör

2010

Abstract

There is an increasing availability of documents in electronic form due to the widespread use of the Internet. Hypertext Markup Language (HTML) which is mostly concerned with the presentation of documents is still the most commonly used format on the Web, despite the appearance of semantically richer markup languages such as XML. Effective processing of Web documents has several uses such as the display of content on small-screen devices and summarization. In this paper, we investigate the problem of identifying the sectional hierarchy of a given HTML document together with the headings in the document. We propose and evaluate a learning approach suitable to tree representation based on Support Vector Machines.

References

  1. Branavan, S. R. K., Deshpande, P., Barzilay, R., 2007. Generating a table-of-contents. In Proc. of ACL.
  2. Brugger, R., Zramdini, A., Ingold, R., 1997. Modeling documents for structure recognition using generalized n-grams. In Proc. of ICDAR, pp. 56-60.
  3. Collins, M., Roark, B., 2004. Incremental parsing with the perceptron algorithm. In Proc. of ACL.
  4. Covington, M. A., 2001. A fundamental algorithm for dependency parsing. In Proc. of ACM Southeast Conference.
  5. Curran, J. R., Wong, R. K., 1999. Transformation-based learning for automatic translation from HTML to XML. In Proc. of ADCS99.
  6. Feng, J., Haffner, P., Gilbert, M., 2005. A learning approach to discovering Web page semantic structures, In Proc. of ICDAR, pp. 1055 - 1059.
  7. GATE, 2009. A General Architecture for Text Engineering. Available at: http://gate.ac.uk/.
  8. Joachims, T., 1999. Making large-scale SVM learning practical. In Advances in Kernel Methods - Support Vector Learning. MIT Press.
  9. Mao, S., Rosenfeld, A., Kanungo T., 2003. Document structure analysis algorithms: a literature survey. In Proc. of SPIE Electronic Imaging, pp.197-207.
  10. Mayfield, J., McNamee, P., Piatko, C., Pearce, C., 2003. Lattice-based tagging using Support Vector Machines. In Proc. of ICIKM, pp. 303-308.
  11. Platt, J. C., 1999. Probabilistic outputs for Support Vector Machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, A. Smola, P.Bartlett, B. Scholkopf, D. Schuurmans (eds.). MIT Press.
  12. Shilman, M., Liang, P., Viola, P., 2005. Learning nongenerative grammatical models for document analysis. In Proc.of ICCV.
  13. The Lobo Project, 2009. Cobra: Java HTML renderer & parser. Available at: http://lobobrowser.org/cobra.jsp.
  14. TREC, 2004. Text REtrieval Conference. Available at: http://www.trec.org.
  15. W3C, 2005. Document Object Model (DOM). Available at: http://www.w3.org/DOM/.
  16. Xue, Y., Hu, Y., Xin, G., Song, R., Shi, S., Cao, Y., Lin, C.-Y., Li, H., 2007. Web page title extraction and its application. In Information Processing and Management, Vol. 43, No. 5, pp. 1332-1347.
Download


Paper Citation


in Harvard Style

Canan Pembe F. and Güngör T. (2010). A TREE LEARNING APPROACH TO WEB DOCUMENT SECTIONAL HIERARCHY EXTRACTION . In Proceedings of the 2nd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART, ISBN 978-989-674-021-4, pages 447-450. DOI: 10.5220/0002590004470450


in Bibtex Style

@conference{icaart10,
author={F. Canan Pembe and Tunga Güngör},
title={A TREE LEARNING APPROACH TO WEB DOCUMENT SECTIONAL HIERARCHY EXTRACTION},
booktitle={Proceedings of the 2nd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,},
year={2010},
pages={447-450},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002590004470450},
isbn={978-989-674-021-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 2nd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,
TI - A TREE LEARNING APPROACH TO WEB DOCUMENT SECTIONAL HIERARCHY EXTRACTION
SN - 978-989-674-021-4
AU - Canan Pembe F.
AU - Güngör T.
PY - 2010
SP - 447
EP - 450
DO - 10.5220/0002590004470450