A Pipeline-oriented Processing Approach to Continuous and Long-term Web Scraping

Stefan Huber, Fabio Knoll, Mario Döller

2022

Abstract

Web scraping is a widely-used technique to extract unstructured data from different websites and transform it into a unified and structured form. Due to the nature of the WWW, long-term and continuous web scraping is a volatile and error-prone endeavor. The setup of a reliable extraction procedure comes along with various challenges. In this paper, a system design and implementation for a pipeline-oriented approach to web scraping is proposed. The main goal of the proposal is to establish a fault-tolerant execution of web scraping tasks with proper error handling strategies set in place. As errors are prevalent in web scraping, logging and error replication procedures are part of the processing pipeline. These mechanisms allow for effectively adapting web scraper implementations to evolving website targets. An implementation of the system was evaluated in a real-world case study, where thousands of web pages were scraped and processed on a daily basis. The results indicated that the system allows for effectively operating reliable and long-term web scraping endeavors.

Download


Paper Citation


in Harvard Style

Huber S., Knoll F. and Döller M. (2022). A Pipeline-oriented Processing Approach to Continuous and Long-term Web Scraping. In Proceedings of the 17th International Conference on Software Technologies - Volume 1: ICSOFT, ISBN 978-989-758-588-3, pages 441-448. DOI: 10.5220/0011275100003266


in Bibtex Style

@conference{icsoft22,
author={Stefan Huber and Fabio Knoll and Mario Döller},
title={A Pipeline-oriented Processing Approach to Continuous and Long-term Web Scraping},
booktitle={Proceedings of the 17th International Conference on Software Technologies - Volume 1: ICSOFT,},
year={2022},
pages={441-448},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011275100003266},
isbn={978-989-758-588-3},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 17th International Conference on Software Technologies - Volume 1: ICSOFT,
TI - A Pipeline-oriented Processing Approach to Continuous and Long-term Web Scraping
SN - 978-989-758-588-3
AU - Huber S.
AU - Knoll F.
AU - Döller M.
PY - 2022
SP - 441
EP - 448
DO - 10.5220/0011275100003266