loading
Papers

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Shervin Daneshpajouh 1 ; Mojtaba Mohammadi Nasiri 1 and Mohammad Ghodsi 2

Affiliations: 1 Sharif University of Technology, Iran, Islamic Republic of ; 2 Sharif University of Technology; School of Computer Science, Institute for Studies in Theoretical Physics and Mathematics, Iran, Islamic Republic of

ISBN: 978-989-8111-27-2

Keyword(s): Crawling, Communities, Seed Quality Metric, Crawl Quality Metric, HITS, Web Graph, Hyperlink Analysis.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Knowledge Discovery and Information Retrieval ; Knowledge-Based Systems ; Searching and Browsing ; Social Information Systems ; Society, e-Business and e-Government ; Soft Computing ; Symbolic Systems ; Web Information Systems and Technologies ; Web Interfaces and Applications ; Web Mining

Abstract: In this paper, we present a new and fast algorithm for generating the seeds set for web crawlers. A typical crawler normally starts from a fixed set like DMOZ links, and then continues crawling from URLs found in these web pages. Crawlers are supposed to download more good pages in less iterations. Crawled pages are good if they have high PageRanks and are from different communities. In this paper, we present a new algorithm with O(n) running time for generating crawler's seeds set based on HITS algorithm. A crawler can download qualified web pages, from different communities, starting from generated seeds set using our algorithm in less iteration.

PDF ImageFull Text

Download
CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.233.226.151

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Daneshpajouh S.; Mohammadi Nasiri M.; Ghodsi M. and (2008). A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET.In Proceedings of the Fourth International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, ISBN 978-989-8111-27-2, pages 98-105. DOI: 10.5220/0001527400980105

@conference{webist08,
author={Shervin Daneshpajouh and Mojtaba {Mohammadi Nasiri} and Mohammad Ghodsi},
title={A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET},
booktitle={Proceedings of the Fourth International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,},
year={2008},
pages={98-105},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001527400980105},
isbn={978-989-8111-27-2},
}

TY - CONF

JO - Proceedings of the Fourth International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,
TI - A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET
SN - 978-989-8111-27-2
AU - Daneshpajouh, S.
AU - Mohammadi Nasiri, M.
AU - Ghodsi, M.
PY - 2008
SP - 98
EP - 105
DO - 10.5220/0001527400980105

Login or register to post comments.

Comments on this Paper: Be the first to review this paper.