Authors:
Ermelinda Oro
;
Francesco Riccetti
and
Massimo Ruffolo
Affiliation:
University of Calabria, Italy
Keyword(s):
Information extraction, Web wrapping, PDF wrapping, Spatial reasoning, Grammars, Chart parsing.
Related
Ontology
Subjects/Areas/Topics:
Agents
;
Artificial Intelligence
;
Biomedical Engineering
;
Biomedical Signal Processing
;
Data Manipulation
;
Health Engineering and Technology Applications
;
Human-Computer Interaction
;
Methodologies and Methods
;
Neurocomputing
;
Neurotechnology, Electronics and Informatics
;
Pattern Recognition
;
Physiological Computing Systems
;
Sensor Networks
;
Soft Computing
;
Vision and Perception
;
Web Information Systems and Technologies
;
Web Intelligence
Abstract:
In last years the huge relevance of accessing and acquiring information made available byWeb (HTML) pages and business (PDF) documents has grown much further. In this paper we present a textual query language, named ViQueL, whose main feature is to identify and extract relevant information from HTML and PDF documents on the base of their visual appearance by using easy-to-write queries. The proposed language is founded on spatial grammars, i.e. context free grammars extended by spatial constructs. Despite a considerable expressive power, combined complexity of ViQueL is in P-Time. Moreover, experiments show that ViQueL is reasonably efficient for real-life extraction tasks.