A MULTI-SEQUENCE ALIGNMENT ALGORITHM FOR WEB TEMPLATE DETECTION

Filippo Geraci, Marco Maggini

Abstract

Nowadays most of Web pages are automatically assembled by content management systems or editing tools that apply a fixed template to give a uniform structure to all the documents beloging to the same site. The template usually contains side information that provides better graphics, navigation bars and menus, banners and advertisements that are aimed to improve the users’ browsing experience but may hinder tools for automatic processing of Web documents. In this paper, we present a novel template removing technique that exploits a sequence alignment algorithm from bioinformatics that is able to automatically extract the template from a quite small sample of pages from the same site. The algorithm detects the common structure of HTML tags among pairs of pages and merges the partial hypotheses using a binary tree consensus schema. The experimental results show that the algorithm is able to attain a good precision and recall in the retrieval of the real template structure exploiting just 16 sample pages from the site. Moreover, the positive impact of the template removing technique is shown on a Web page clustering task.

References

  1. Bar-Yossef, Z. and Rajagopalan, S. (2002). Template detection via data mining and its applications. In Proceedings of the 11th international conference on World Wide Web, WWW 7802, pages 580-591, New York, NY, USA. ACM.
  2. Chakrabarti, D., Kumar, R., and Punera, K. (2007). Pagelevel template detection via isotonic smoothing. In Proceedings of the 16th international conference on World Wide Web, WWW 7807, pages 61-70, New York, NY, USA. ACM.
  3. Cover, T. M. and Thomas, J. A. (1991). Elements of information theory. John Wiley & Sons, New York, US.
  4. Debnath, S., Mitra, P., and Giles, C. L. (2005). Automatic extraction of informative blocks from webpages. In Proceedings of the 2005 ACM symposium on Applied computing, SAC 7805, pages 1722-1726, New York, NY, USA. ACM.
  5. Geraci, F., Pellegrini, M., Maggini, M., and Sebastiani, F. (2008). Cluster generation and labelling for web snippets: A fast, accurate hierarchical solution. Internet Matematics, 3(4):413-443.
  6. Gonzalez, T. F. (1985). Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38(2/3):293-306.
  7. Kovacevic, M., Diligenti, M., Gori, M., and Milutinovic, V. (2002). Recognition of common areas in a web page using visual information: a possible application in a page classification. In Proceedings of the IEEE International Conference on Data Mining, ICDM 7802.
  8. Needlemana, S. B. and Wunscha, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443-453.
  9. Song, R., Liu, H., Wen, J.-R., and Ma, W.-Y. (2004). Learning block importance models for web pages. In Proceedings of the 13th international conference on World Wide Web, WWW 7804, pages 203-211, New York, NY, USA. ACM.
  10. Strehl, A. (2002). Relationship-based Clustering and Cluster Ensembles for High-dimensional Data Mining. PhD thesis, University of Texas, Austin, US.
  11. Wong, T.-L. and Lam, W. (2007). Adapting web information extraction knowledge via mining site-invariant and site-dependent features. ACM Trans. Internet Technol., 7.
  12. Yi, L., Liu, B., and Li, X. (2003). Eliminating noisy information in web pages for data mining. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 296- 305.
Download


Paper Citation


in Harvard Style

Geraci F. and Maggini M. (2011). A MULTI-SEQUENCE ALIGNMENT ALGORITHM FOR WEB TEMPLATE DETECTION . In - KDIR, (IC3K 2011) ISBN , pages 0-0. DOI: 10.5220/0003712801210128


in Bibtex Style

@conference{kdir11,
author={Filippo Geraci and Marco Maggini},
title={A MULTI-SEQUENCE ALIGNMENT ALGORITHM FOR WEB TEMPLATE DETECTION},
booktitle={ - KDIR, (IC3K 2011)},
year={2011},
pages={},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003712801210128},
isbn={},
}


in EndNote Style

TY - CONF
JO - - KDIR, (IC3K 2011)
TI - A MULTI-SEQUENCE ALIGNMENT ALGORITHM FOR WEB TEMPLATE DETECTION
SN -
AU - Geraci F.
AU - Maggini M.
PY - 2011
SP - 0
EP - 0
DO - 10.5220/0003712801210128