Authors:
Pedro Lopes
;
Davide Pinto
;
David Campos
and
José Luís Oliveira
Affiliation:
Universidade de Aveiro, Portugal
Keyword(s):
Information retrieval, Web crawling, Crawler, Text processing, Multithread, Directed crawling.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Information Extraction
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Symbolic Systems
Abstract:
The Internet is becoming the primary source of knowledge. However, its disorganized evolution brought about an exponential increase in the amount of distributed, heterogeneous information. Web crawling engines were the first answer to ease the task of finding the desired information. Nevertheless, when one is searching for quality information related to a certain scientific domain, typical search engines like Google are not enough. This is the problem that directed crawlers try to solve. Arabella is a directed web crawler that navigates through a predefined set of domains searching for specific information. It includes text-processing capabilities that increase the system’s flexibility and the number of documents that can be crawled: any structured document or REST web service can be processed. These complex processes do not harm overall system performance due to the multithreaded engine that was implemented, resulting in an efficient and scalable web crawler.