ARABELLA - A Directed Web Crawler

Pedro Lopes, Davide Pinto, David Campos, José Luís Oliveira

2009

Abstract

The Internet is becoming the primary source of knowledge. However, its disorganized evolution brought about an exponential increase in the amount of distributed, heterogeneous information. Web crawling engines were the first answer to ease the task of finding the desired information. Nevertheless, when one is searching for quality information related to a certain scientific domain, typical search engines like Google are not enough. This is the problem that directed crawlers try to solve. Arabella is a directed web crawler that navigates through a predefined set of domains searching for specific information. It includes text-processing capabilities that increase the system’s flexibility and the number of documents that can be crawled: any structured document or REST web service can be processed. These complex processes do not harm overall system performance due to the multithreaded engine that was implemented, resulting in an efficient and scalable web crawler.

References

  1. Berners-Lee, T., Hendler, J. & Lassila, O. (2001) The Semantic Web. Sci Am, 284, 34-43.
  2. Brin, S. & Page, L. (1998) The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30, 107-117.
  3. Chakrabarti, S. (2003) Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kauffman.
  4. Ding, L., Finin, T., et al. (2004) Swoogle: A Search and Metadata Engine for the Semantic Web. Language, 652-659.
  5. Eichmann, D. (1994) The RBSE Spider -Balancing Effective Search Against Web Load. Proceedings of the First International World Wide Web Conference. Geneva, Switzerland.
  6. Fokkema, I. F., Den Dunnen, J. T. & Taschner, P. E. (2005) LOVD: easy creation of a locus-specific sequence variation database using an "LSDB-in-abox" approach. Human Mutation, 26, 63-68.
  7. Lin, S., Li, Y.-M. & Li, Q.-C. (2008) Information Mining System Design and Implementation Based on Web Crawler. Science, 1-5.
  8. Menczer, F., Pant, G., et al. (2001) Evaluating TopicDriven Web Crawlers. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, NY, USA, ACM.
  9. Miller, R. C. & Bharat, K. (1998) SPHINX: a framework for creating personal, site-specific Web crawlers. Computer Networks and ISDN Systems, 30, 119-130.
  10. Mukhopadhyay, D., Biswas, A. & Sinha, S. (2007) A new approach to design domain specific ontology based web crawler. 10th International Conference on Information Technology (ICIT 2007), 289-291.
  11. Ntoulas, A., Zerfos, P. & Cho, J. (2005) Downloading textual hidden web content through keyword queries. JCDL 7805: Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries. New York, NY, USA.
  12. Oliveira, J. L., Dias, G. M. S., et al. (2004) DiseaseCard: A Web-based Tool for the Collaborative Integration of Genetic and Medical Information. Proceedings of the 5th International Symposium on Biological and Medical Data Analysis, ISBMDA 2004. Barcelona, Spain, Springer.
  13. Peisu, X., Ke, T. & Qinzhen, H. (2008) A Framework of Deep Web Crawler. 27th Chinese Control Conference. China, IEEE.
  14. Pinkerton, B. (1994) Finding what people want: Experiences with the WebCrawler. Proceedings of the Second International World Wide Web Conference.
  15. Srinivasamurthy, K. (2004) Ontology-based Web Crawler. Computing, 4-8.
  16. Suel, T. & Shkapenyuk, V. (2002) Design and Implementation of a High-Performance Distributed Web Crawler. World Wide Web Internet And Web Information Systems.
  17. Tripathy, A. & Patra, P. K. (2008) A web mining architectural model of distributed crawler for internet searches using PageRank algorithm. 2008 IEEE AsiaPacific Services Computing Conference, 513-518.
  18. Tsay, J.-J., Shih, C.-Y. & Wu, B.-L. (2005) AuToCrawler: An Integrated System for Automatic Topical Crawler. Machine Learning.
Download


Paper Citation


in Harvard Style

Lopes P., Pinto D., Campos D. and Luís Oliveira J. (2009). ARABELLA - A Directed Web Crawler . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009) ISBN 978-989-674-011-5, pages 270-273. DOI: 10.5220/0002291602700273


in Bibtex Style

@conference{kdir09,
author={Pedro Lopes and Davide Pinto and David Campos and José Luís Oliveira},
title={ARABELLA - A Directed Web Crawler},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)},
year={2009},
pages={270-273},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002291602700273},
isbn={978-989-674-011-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)
TI - ARABELLA - A Directed Web Crawler
SN - 978-989-674-011-5
AU - Lopes P.
AU - Pinto D.
AU - Campos D.
AU - Luís Oliveira J.
PY - 2009
SP - 270
EP - 273
DO - 10.5220/0002291602700273