loading
Documents

Research.Publish.Connect.

Paper

Paper Unlock
A MULTI-SEQUENCE ALIGNMENT ALGORITHM FOR WEB TEMPLATE DETECTION

Authors: Filippo Geraci 1 and Marco Maggini 2

Affiliations: 1 CNR, Italy ; 2 Universitá di Siena, Italy

Keyword(s): Web information retrieval, Web template detection.

Abstract: Nowadays most of Web pages are automatically assembled by content management systems or editing tools that apply a fixed template to give a uniform structure to all the documents beloging to the same site. The template usually contains side information that provides better graphics, navigation bars and menus, banners and advertisements that are aimed to improve the users’ browsing experience but may hinder tools for automatic processing of Web documents. In this paper, we present a novel template removing technique that exploits a sequence alignment algorithm from bioinformatics that is able to automatically extract the template from a quite small sample of pages from the same site. The algorithm detects the common structure of HTML tags among pairs of pages and merges the partial hypotheses using a binary tree consensus schema. The experimental results show that the algorithm is able to attain a good precision and recall in the retrieval of the real template structure exploiting just 16 sample pages from the site. Moreover, the positive impact of the template removing technique is shown on a Web page clustering task. (More)

PDF ImageFull Text

Download
Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 54.92.174.226

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Geraci F. and Maggini M. (2011). A MULTI-SEQUENCE ALIGNMENT ALGORITHM FOR WEB TEMPLATE DETECTION.In - KDIR, (IC3K 2011) ISBN , pages 0-0. DOI: 10.5220/0003712801210128

@conference{kdir11,
author={Filippo Geraci and Marco Maggini},
title={A MULTI-SEQUENCE ALIGNMENT ALGORITHM FOR WEB TEMPLATE DETECTION},
booktitle={ - KDIR, (IC3K 2011)},
year={2011},
pages={},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003712801210128},
isbn={},
}

TY - CONF

JO - - KDIR, (IC3K 2011)
TI - A MULTI-SEQUENCE ALIGNMENT ALGORITHM FOR WEB TEMPLATE DETECTION
SN -
AU - Geraci F.
AU - Maggini M.
PY - 2011
SP - 0
EP - 0
DO - 10.5220/0003712801210128

Login or register to post comments.

Comments on this Paper: Be the first to review this paper.