Authors:
Barkol Omer
;
Bergman Ruth
and
Golan Shahar
Affiliation:
HP Labs, Israel
Keyword(s):
Frequent trees, Tree edit distance, RTDM, DOM, Web mining, Web data records.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Information Extraction
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Soft Computing
;
Symbolic Systems
;
Web Mining
Abstract:
The importance of recognizing repeating structures in web applications has generated a large body of work on algorithms for mining the HTML Document Object Model (DOM). A restricted tree edit distance metric, called the Restricted Top Down Metric (RTDM), is most suitable for DOM mining as well as less computationally expensive than the general tree edit distance. Given two trees with input size n1 and n2, the current methods take time O(n1 · n2) to compute RTDM. Consider, however, looking for patterns that form subtrees within a web page with n elements. The RTDM must be computed for all subtrees, and the running time becomes O(n4). This paper proposes a new algorithm which computes the distance between all the subtrees in a tree in time O(n2), which enables us to obtain better quality as well as better performance, on a DOM mining task. In addition, we propose a new tree edit-distance—SiSTeR (Similar Sibling Trees aware RTDM). This variant of RTDMallows considering the case were rep
etitious (very similar) subtrees of different quantity
appear in two trees which are supposed to be considered as similar.
(More)