A Pipeline-oriented Processing Approach to Continuous and Long-term

Web Scraping

Stefan Huber

, Fabio Knoll

and Mario D

oller

University of Applied Sciences Kufstein, Andreas Hofer-Straße 7, 6330 Kufstein, Austria

Keywords:

Web Scraping, Data Pipelines, Fault-tolerant Execution.

Abstract:

Web scraping is a widely-used technique to extract unstructured data from different websites and transform it

into a uniﬁed and structured form. Due to the nature of the WWW, long-term and continuous web scraping

is a volatile and error-prone endeavor. The setup of a reliable extraction procedure comes along with various

challenges. In this paper, a system design and implementation for a pipeline-oriented approach to web scraping

is proposed. The main goal of the proposal is to establish a fault-tolerant execution of web scraping tasks

with proper error handling strategies set in place. As errors are prevalent in web scraping, logging and error

replication procedures are part of the processing pipeline. These mechanisms allow for effectively adapting

web scraper implementations to evolving website targets. An implementation of the system was evaluated in a

real-world case study, where thousands of web pages were scraped and processed on a daily basis. The results

indicated that the system allows for effectively operating reliable and long-term web scraping endeavors.

1 INTRODUCTION

Knowledge workers typically spend a substantial part

of their working time with searching and processing

information published on the WWW. As new infor-

mation is published frequently, clerks have to regu-

larly monitor all relevant sources and extract impor-

tant information as daily routines. These routines

are cumbersome, time-consuming and potentially de-

crease work productivity. Furthermore, the increasing

amount of data is overwhelming for manual process-

ing and important information might be missed.

A widely-used technique to automatically ex-

tract unstructured data from the WWW and trans-

form it into a uniﬁed and structured form is web

scraping (Saurkar et al., 2018). A web scraper is

a program, which is speciﬁcally targeted to extract

data from certain web pages, by making use of the

HTML-structure, CSS-selectors, XPATH-expressions

and regular expressions (Khder, 2021). Web scrap-

ers are commonly built by employing speciﬁc tools,

such as Scrapy or BeatifulSoup (Mitchell, 2018). The

drawback of these approaches is, its static behavior in

case of changes in the structure of targeted web pages.

https://orcid.org/0000-0001-7229-9740

https://orcid.org/0000-0002-9368-8056

https://orcid.org/0000-0002-9716-564X

As websites are not primarily built to being

scraped, implementing a web scraper comes along

with several challenges and is considered an iterative

and lengthy process (Landers et al., 2016). Typically,

aspects such as pagination, dynamic content or access

barriers are common obstacles (Meschenmoser et al.,

2016) which need to be handled. Especially, due to

networking issues or adaptations in the website struc-

ture, failures in the extraction process, should be ex-

pected in continuous and long-term web scraping.

State of the art pipeline-oriented data process-

ing architectures, such as stream processing (Noghabi

et al., 2017; Lin et al., 2016) or ETL-pipelines (Simit-

sis et al., 2010) offer desirable execution attributes,

which would enhance the reliability of web scrap-

ing tasks. Inspired by these ﬁndings, in this paper,

a pipeline-oriented system design for web scraping is

presented and also evaluated on a real-world use case.

The paper start with an overview of challenges in

web scraping in Section 2. Section 3 derives require-

ments for a reliable web scraping system. Section 4

describes the proposed software design and Section 5

introduces a case study for evaluating the proposed

architecture in a real-world use case. In Section 6 the

approach is reﬂected in the form of a discussion. The

paper is ended by an overview of related work in Sec-

tion 7 and conclusions in Section 8.

Huber, S., Knoll, F. and Döller, M.

A Pipeline-oriented Processing Approach to Continuous and Long-term Web Scraping.

DOI: 10.5220/0011275100003266

In Proceedings of the 17th International Conference on Software Technologies (ICSOFT 2022), pages 441-448

ISBN: 978-989-758-588-3; ISSN: 2184-2833

441

2 CHALLENGES FOR RELIABLE

WEB SCRAPING

A web scraper is typically targeted to extract data

from a speciﬁc web page structure (Dongo et al.,

2020). Therefore, a sample set of web pages, from

a distinct source, are selected to form a test dataset

(Landers et al., 2016) for the implementation. For

the extraction procedure it is required, that prospec-

tive web pages, generally conform to the structure of

the web pages in the test dataset.

Studies on the evolution of the WWW have shown

that from a content perspective web pages undergo

changes at a rather rapid turnover rate (Ntoulas et al.,

2004). Additionally, from a structural perspective, the

rate of change, when considering the markup (Fetterly

et al., 2004; Adar et al., 2009), is also rather frequent.

Therefore, when considering long-term and continu-

ous web scraping endeavors, the real or current state

of a web page may not reﬂect the structure in the used

testbed dataset.

Deviations of the current web page structure with

respect to the testbed structure could lead to failures

in the extraction process. These type of failures we

termed representational issues. In addition, as web

scrapers are operating over the Internet, failures due

to networking issues, need to be considered for a

reliable extraction procedure. Table 1 summarizes

the types of issues and the corresponding mitigation

strategies. The rest of this section describes the issues

in more detail.

We classiﬁed structural deviations from the expec-

tation as either major or minor. A major structural

change would typically result after a redesign of the

website. Such changes are rather seldom and most

likely would require a complete rewrite of the web

scraper. Minor structural changes on the other hand

could happen more often, especially on popular web-

sites. These changes result from adding new features

(e.g., user comments) to a website or by restructured

parts of a website (e.g., based on the results of A/B

testing). As the basic structure of the website stays

the same, such changes would only require an adap-

tation of the web scraper.

The information presented inside web pages is

typically created by humans. Depending on the val-

idation capabilities of the input systems, some entry

ﬁelds might be omitted (e.g., a missing e-mail ad-

dress) or provided with idiosyncratic formatting (e.g.,

date or phone number formatting) by content editors.

This may lead to missing or ill-formatted data. A web

scraper could integrate a certain tolerance for pars-

ing ill-formatted inputs. For missing data, also default

values could be used. As unpredictable formatting is-

sues could arise over time, a continuous adaptation of

the scraper might be necessary. In case of a lower

importance of a speciﬁc data input, an ill-formatted

value might even be ignored by the web scraper.

Finally, as web scrapers are operating over the

Internet various network issues may occur. For in-

stance, these can be internal or external network out-

ages, temporary unavailability, SSL connection er-

rors, blockages of various causes, website mainte-

nance or network timeouts due to congestion. A web-

site could even be moved to another location or in-

troduce an authentication mechanism. On an abstract

level, we distinguished networking issues into tempo-

rary and non-temporary. The former type could po-

tentially be resolved either by a delayed retry or by a

high timeout threshold (Liu and Menczer, 2011). The

later type would require adaptation of the web scraper

implementation.

3 REQUIREMENTS FOR

RELIABLE WEB SCRAPING

In consideration of the challenges introduced in the

previous section, several requirements have been de-

rived, which must be fulﬁlled in order to achieve a

reliable web scraping system.

3.1 Fault-tolerance

The system should be tolerant to errors, which occur

within the execution of a web scraping task. An error

must not lead to the termination of the entire program

ﬂow, but only skip the web scraping task in which the

error occurred. For instance, a formatting issue on a

single web page that leads to a parsing error should

not affect the scraping procedure of the whole web-

site.

3.2 Error-handling Strategies

For a web scraping system, which is daily operating

on thousands of web pages, errors should be expected.

The system should be able to identify and classify er-

rors. Based on the error type, different error-handling

or mitigation strategies could be executed. E.g., a

temporary networking issue, due to a maintenance

downtime of a website, should be detectable as a type

of error by the system. A correctly conﬁgured web-

server returns an HTTP 503 status code in such a sit-

uation. An error-handling strategy, in such a case,

could be a retry of the web scraping task after a cer-

tain timeout.

ICSOFT 2022 - 17th International Conference on Software Technologies

442

Table 1: Types of web scraping issues and mitigation strategies.

web scraping issue mitigation strategy

representational issue major structural issue web scraper rewrite

minor structural issue web scraper adaptation

ill-formatting issue tolerance, ignore or adapt

networking issue temporal delayed retry or timeout threshold adaptation

non-temporal web scraper adaptation

3.3 Monitoring and Logging

The system should include an effective monitoring

procedure, which would on the one hand allow to

check the successfully executed web scraping tasks

and on the other hand logs tasks which resulted in an

error. Moreover, in the case of an error a correspond-

ing error context should be stored. This data would

consist of the web page, if it was accessible, and cor-

responding contextual meta data, such as the access

time, HTTP headers and status codes. The error con-

text, which produced the error, should be stored in a

form that can be used as input to a test case for the

speciﬁc web scraper implementation. This would al-

low the maintainers of the system to replicate the error

and adjust the web scraping task. Also, an analysis of

the error could result in adapting the error-handling

strategies of the system. A new type of error could

result in the identiﬁcation of a new mitigation strat-

egy. As a whole, the monitoring data should be us-

able to continuously improve the system and adapt to

changes in the web scraping targets.

4 SYSTEM DESIGN

The system design was derived from the previously

introduced requirements and is depicted in Fig. 1.

This section describes design principles and compo-

nents of the system.

4.1 Job Granularity

A web scraping task for an entire website is divided

into several ﬁne-granular job implementations. The

granularity of a job implementation is guided by the

ability of error detection and maintainability. An anal-

ogy could be derived from end-to-end testing of web-

applications. The page object design pattern (Leotta

et al., 2013) is an empirically validated concept for

maintainable end-to-end test suites, as such test suites

face similar issues as web scrapers regarding struc-

tural changes of websites.

A popular and frequently used design pattern in

web development is the master-detail pattern (Molina

et al., 2002). For a website that implements this pat-

tern, one job could be developed for scraping the links

to the detail pages from the master page and another

job could be developed for scraping the data from

the individual detail pages. Additional complexities,

such as downloading and extracting data from linked

and related documents, could be also implemented by

encapsulating each sub-task in its own corresponding

job implementation.

A ﬁne-granular division of an entire web scraping

task into jobs has several advantages. It ensures pre-

cise root cause identiﬁcation of an error, since it is evi-

dent in which part of the task the error has occurred. It

enables the desired property of fault-tolerance. In the

case of a scraping error on a single web page (e.g., due

to a format parsing issue), only a single job fails and

not the entire scraping task for the website. Also, in

consideration of the development workﬂow of a large

web scraping endeavor, each job can be developed

and tested in isolation.

4.2 Event-driven Workﬂow

Jobs are implemented loosely coupled with no direct

dependencies to other jobs. A job triggers a generic

event after its execution. The corresponding process-

ing result of the job is attached as payload to the event

and processed by the Pipeline Flow Manager compo-

nent of the system.

This components couples the different jobs as an

event-driven workﬂow. Based on the job events, cor-

responding successor jobs are selected and started.

The selection of successor jobs is based on the pay-

load of the preceding job.

The ﬂow logic of the jobs is deﬁned based on

the event-condition-action (ECA) metaphor (Kappel

et al., 2000). This approach allows for adapting and

extending the execution pipeline without modifying

existing job implementations.

4.3 Parallelization

The job execution is handled in the form of a FIFO

job queue. Worker processes on the corresponding

machine take open jobs from the queue for process-

A Pipeline-oriented Processing Approach to Continuous and Long-term Web Scraping

443

Figure 1: Overview of the system design.

ing. As jobs are loosely-coupled, worker processes

can operate in parallel. The processing time of the en-

tire pipeline can be regulated by the amount of work-

ers in place. Depending on the workload of the web

scraping endeavor, jobs can be distributed on different

machines.

4.4 Error Handling

To keep the job implementations lightweight and to

avoid the repetition of implementing similar error

handling procedures in different jobs, the entire error

handling logic is organized in the separate Pipeline

Exception Manager component. An error inside a

web scraping job should raise an exception based on

a deﬁned exception hierarchy. The handling of a

speciﬁc exception happens then by an error handling

strategy selected in the Pipeline Exception Manager.

Similar to the Pipeline Flow Manager, a conﬁg-

uration in the form of ECA-rules allows the system

to select a corresponding error handling strategy. De-

pending on the type of error, either an automatic or

a manual recovery can be carried out. An automatic

recovery will create a new job corresponding to the

exception and will dispatch this job via the Pipeline

Flow Manager.

A generic form of a error handling strategy is a

delayed retry of the same job, which raised the ex-

ception. This strategy could be employed to handle

temporary issues. Also, more sophisticated strategies

are conceivable, which e.g. adapt the inputs to a cer-

tain job or select an alternative job, which might be

able to handle the task more appropriately.

For more severe issues, which cannot be resolved

autonomously, the system will notify a correspond-

ing user. Serious errors can often only be handled by

system administrators or developers, whereas more

harmless errors can also be solved by clerks working

with the system. Ideally, there is a deﬁned hierarchy

in the organization for handling errors. For the case

study in Section 5.1, such an organizational setting is

described.

As additional metadata to a raised exception a

PipelineExceptionContext is attached. This con-

text is speciﬁc for each exception type and provides

data for debugging and reproducing the error. E.g.,

for a speciﬁc scraping job which produced an error,

the entire web page is stored as part of the error con-

text. The stored website can be used to generate in-

put for corresponding test cases to adapt or extend the

web scraper job implementation.

5 EVALUATION

The proposed system was evaluated in a real-world

case study (Section 5.1). Implementation details

(Section 5.2) and results on the error handling (Sec-

tion 5.3) are described in this section.

5.1 Case Study

The case study was developed with a company op-

erating in the construction business within the EU.

The company is regularly participating in bidding

procedures for new projects. Within the EU, public

institutions must select suppliers in the form of an

open competitive bidding procedure if the procure-

ment value exceeds a certain threshold.

Particularly in the construction sector, for smaller

projects, municipalities or companies also use a

bidding procedures to select appropriate suppliers.

Therefore, tenders are published by companies and

governments on different platforms and in different

formats on the WWW to advertise for potential bid-

ders.

A common platform for the publication of tenders

is TED

. However, many states, cities or municipali-

ties offer their own portals for publishing such tenders

and within the various platforms tenders are published

in highly heterogeneous formats.

https://ted.europa.eu

ICSOFT 2022 - 17th International Conference on Software Technologies

444

Figure 2: Overview of job processing pipeline of the case study.

Clerks working in the sales department of the

company screen sources on the WWW as a daily

routine. From the search interfaces of the portals,

they collect relevant tenders for the different depart-

ments in the company. The corresponding data is ex-

tracted and entered into a company internal informa-

tion system. This is a rather cumbersome and time-

consuming endeavor for the clerks.

To automatize the whole process of collecting,

extracting, classifying and storing tenders from the

WWW, an implementation of the proposed system

design was set in place. An overview of the respec-

tive processing pipeline is given in Fig. 2. In addition

to the web scraping jobs, also two machine learning

based jobs for doublet detection and tender classiﬁca-

tion are part of the pipeline, this paper only focuses

on web scraping aspects. A terminal job, which ex-

ports positively classiﬁed tenders into the company’s

information system, is also part of the pipeline.

Within the case study, tenders from 16 different

web portals are continuously scraped. This results

in up to 2000 web pages which are scraped and pro-

cessed on a daily basis.

5.2 Implementation Details

The pipeline for the case study was implemented in

the programming language PHP using the Laravel

web framework. The framework offers a lightweight

and ﬂexible queue and job implementation, which fa-

cilitates the basis for the proposed pipeline-oriented

system design.

Among the websites to be scraped for the case

study, many implemented the master-detail design

pattern, which resulted in consistent pipeline struc-

tures. For each master page, a corresponding master

job was developed. The master job extracts primar-

ily the links to the detail pages. Similarly, a separate

job was developed for the detail pages, which then

extracts detailed data about the tenders.

https://laravel.com

The structure of the various web pages varied

widely, ranging from unstructured and rather poorly

organized HTML pages to well-structured XML or

JSON ﬁles. Some websites are also implemented

with JavaScript rendered content and reloaded parts of

the web page dynamically. Therefore such web page

were rendered inside the headless web browser engine

Puppeteer

before starting the extraction procedure.

5.3 Error Handling Strategies

Within the case study automatic and manual proce-

dures for error handling were set in place. In addition

to the actual error handling procedures, an organiza-

tional structure in the form of hierarchical roles was

established. The roles clerk, administrator and devel-

oper were deﬁned.

As an automatic error handling procedure a de-

layed retry was used for jobs, which failed due to net-

work issues. A job is retried twice, ﬁrstly with a delay

of one hour and secondly with a delay of ﬁve hours. If

the second retry failed, the failure is notiﬁed to an ad-

ministrator, which tries to analyze and ﬁx the problem

or escalate the issue to a developer.

Furthermore, for other types of errors, such as any

kind of representational or unclassiﬁed errors, an or-

ganizational procedure is deﬁned. On a daily basis,

the exceptions for each website are counted and if

the exceptions are below a deﬁned threshold, which

is conﬁgured for each website, the exceptions are no-

tiﬁed to clerks. The clerks can check the web pages,

which produced the error, and either extract the data

manually or ignore the web page. The web pages are

also logged and can be analyzed by the developer at a

later stage.

If the exception count is above a certain threshold,

an administrator is notiﬁed. A high amount of excep-

tions for a certain website indicates a systematic error.

The administrator can analyze the problem and either

trigger a manual retry or escalate the problem to the

https://pptr.dev/

A Pipeline-oriented Processing Approach to Continuous and Long-term Web Scraping

445

developer. A developer could do a deeper analysis

of the problem, based on the stored error context and

could integrate necessary changes to adapt the imple-

mentation.

5.4 Web Scraping Results

For demonstrating the operation of the system a typi-

cal 5 week period from June 2021 to July 2021 is doc-

umented and respective results are displayed in Fig. 3,

Fig. 4 and Fig. 5. An administrator would monitor the

system with similar diagrams.

The daily amount of successfully scraped web

pages is displayed in Fig. 3. The largest amount

of successful scrapings typically happens on Fridays.

On weekends only a few tenders are published, which

results in a low amount of scraped web pages. Fig. 4

gives an overview of the amount of daily scraping er-

rors. In the following list, notable incidents are sum-

marized:

• At the beginning of Fig. 4 a higher error count

was observable. This was caused by a redesign of

a target website. The web scraper needed to be

largely rewritten, which took several days of iter-

ative implementation and testing work. This re-

sulted in a higher amount of successfully scraped

web pages after the rewrite of the web scraper, as

unsuccessfully scraped tenders from the past have

been caught up with the updated job implementa-

tion.

• On 07/13 a new type of tender was published on

one of the websites. The new type had a different

formatting for the tender title on the web page.

More than 20 tenders could not be successfully

scraped based on this formatting change. The im-

plementation of the web scraping job could be

adapted promptly by a developer and this type of

error was solved.

• For most days several errors with unsystematic

causes happen. Primarily, the errors originate

from ill-formatted or missing data. Such errors

often fall into the unclassiﬁed error category. If

there is any pattern detectable in such errors a

developer could adapt the respective implementa-

tion. Also a few networking errors did arise, such

as network timeouts or page not found errors dur-

ing the observed period.

The distribution of the different error types is dis-

played in Fig. 5. The majority of errors belong to the

class of representational errors. These result from in-

puts to web scrapers, which don’t comply with the ex-

pected representation. Within the 5 week period, net-

work errors occurred only at a rather low frequency.

All errors which let a web scraper fail in an unex-

pected way are declared as unclassiﬁed errors. These

types of errors require an investigation by the devel-

opers.

6 DISCUSSION

The proposed system was implemented and used

within a real-world use case. It was demonstrated

that the system is capable of coping with chal-

lenges that typically arise in continuous and long-term

web scraping endeavors. Speciﬁcally, the pipeline-

oriented approach and the ﬁne-grained encapsulation

of sub-tasks into jobs has many merits for the domain

of web scraping.

The system design is based on the assumption that

scraping errors will happen continuously, as the web

scraping targets are evolving. The proposed system

includes proper logging and replication facilities to

enable maintainers of the system to effectively adapt

web scraper implementations to changes in the web

scraping target.

In addition to the technical aspects it is important

to have an organizational structure deﬁned to handle

errors in the web scraping process, which cannot be

solved automatically by the system. For the use case,

it turned out beneﬁciary to also involve non-technical

clerks. For low-severity and seldom issues, such as an

ill-formatted date, a clerk could manually extract the

data and ﬁx the issue quickly. The requirement for

such an approach is, that the exact web page which

produced the error can be pinpointed to the clerk,

which the system’s logging facility is capable of.

Some web pages require the execution of

JavaScript to be rendered properly. For web scrap-

ing this requires rendering the web page using a head-

less browser. A headless browser can be conﬁgured to

block the download of content, which is not required

for the scraping task. This lowers the overhead for the

server and speeds up the scraping process and should

be considered a good practice for web scraping.

The code for web scraper implementations tend to

get rather complex in terms of execution ﬂows and

conditional branching. Encapsulating each type of

web page into it’s own job implementation turned out

to be helpful to keep the code units manageable and

free of unintended side-effects.

For the web scrapers which were implemented in

the case study, the code complexity particularly in-

creased, when text anchors were required for iden-

tifying the relevant parts to extract. Depending on

the structure of web pages CSS-selectors or XPATH-

expressions are sometimes not capable of identifying

ICSOFT 2022 - 17th International Conference on Software Technologies

446

Figure 3: Scraped web pages.

Figure 4: Error count.

Figure 5: Error types.

the part of interest for extraction. Text anchors also

turned out to be rather unreliable, especially for con-

tinuous and long-term usage. Therefore as another

good practice the avoidance of text anchors should

be a high priority, although for some subtle scraping

tasks, text anchors are necessary.

7 RELATED WORK

Web scraping is employed in many use cases to col-

lect and aggregate data from various sources from the

WWW. In a case study for gathering biomedical data,

the authors of (Glez-Pe

na et al., 2014) demonstrated

the usefulness of web scraping to extract data from

different sources. Also in the works of (Bonifacio

et al., 2015) and (Kunang et al., 2018) it was demon-

strated how the application of web scraping for gath-

ering climate and weather data from different sources

could increase the productivity of researchers in the

ﬁeld. Another example of the effectiveness of web

scraping was demonstrated by the Vigi4Med (Audeh

et al., 2017) tool. The tool scrapes forum discus-

sions from more than 22 different medical websites.

The resulting data is used as a complementary source

to study adverse drug reactions. Although all pre-

sented approaches successfully employed web scrap-

ing in a speciﬁc context, the described implementa-

tions follow a rather ad-hoc approach. Issues which

may arise in a long-term and continuous application

of web scraping have not been addressed by the im-

plementations.

Establishing a pipeline-oriented data processing

system involves various challenges in the area of data

cleansing, data collection and data analytics (Pervaiz

et al., 2019), also challenges concerning infrastruc-

ture, organizational barriers and data quality are

prevalent (Munappy et al., 2020). However, desirable

properties such as fault-tolerance, auditability or

error handling and mitigation strategies emerge from

the application of the pipeline paradigm. Especially,

in the context of long-term and reliable web scraping,

such properties are inevitable.

The concept of event-driven pipeline processing

has been successfully employed in the domain of

computer vision workﬂows by the EPypes architec-

ture (Semeniuta and Falkman, 2019). The architec-

ture makes use of the event-driven paradigm in the

context of distributed environments in order to sup-

port evolving computational graphs without sacriﬁc-

ing developer ﬂexibility.

Triggerﬂow (L

opez et al., 2020) is another system,

which makes use of event-based control ﬂow in order

to orchestrate serverless workﬂows. Extensibility and

ﬂexibility are key properties named as advantages of

the Triggerﬂow architecture.

Also in the domain of classical workﬂow manage-

ment systems (WfMS) event-driven processing has

been applied in order to adapt according to evolving

needs. In (Goh et al., 2001) ECA-rules have been pro-

posed to modify workﬂow executions at runtime to

handle deviations in business processes.

8 CONCLUSION

In this paper, a pipeline-oriented system design to the

rather error-prone and volatile endeavor of long-term

and continuous web scraping was envisioned. A web

scraper is typically built based on the assumption that

the web page, which needs to be scraped, conforms

to an expected structure. Due to the fact that websites

are changing rather frequently (Ntoulas et al., 2004;

Fetterly et al., 2004; Adar et al., 2009), web scraper

implementations need to be adapted alongside these

changes.

Our proposed system design is based on the as-

sumption that failures in web scraping are expected.

Fine-granular job implementations and proper log-

ging for reproducing errors allow developers to ef-

fectively adapt implementations to changing scraping

targets. In a real-world case study, we could demon-

strate that our proposed system allows for effectively

operating reliable and long-term web scraping en-

deavors.

A Pipeline-oriented Processing Approach to Continuous and Long-term Web Scraping

447

REFERENCES

Adar, E., Teevan, J., Dumais, S. T., and Elsas, J. L. (2009).

The web changes everything: understanding the dy-

namics of web content. In Proceedings of the Second

ACM International Conference on Web Search and

Data Mining, pages 282–291.

Audeh, B., Beigbeder, M., Zimmermann, A., Jaillon, P., and

Bousquet, C. (2017). Vigi4med scraper: a framework

for web forum structured data extraction and semantic

representation. PloS one, 12(1):e0169658.

Bonifacio, C., Barchyn, T. E., Hugenholtz, C. H., and Kien-

zle, S. W. (2015). Ccdst: A free canadian climate data

scraping tool. Computers & Geosciences, 75:13–16.

Dongo, I., Cadinale, Y., Aguilera, A., Mart

ınez, F., Quin-

tero, Y., and Barrios, S. (2020). Web scraping versus

twitter api: A comparison for a credibility analysis. In

Proceedings of the 22nd International Conference on

Information Integration and Web-Based Applications

& Services, iiWAS ’20, page 263–273, New York, NY,

USA. Association for Computing Machinery.

Fetterly, D., Manasse, M., Najork, M., and Wiener, J. L.

(2004). A large-scale study of the evolution of

web pages. Software: Practice and Experience,

34(2):213–237.

Glez-Pe

na, D., Lourenc¸o, A., L

opez-Fern

andez, H.,

Reboiro-Jato, M., and Fdez-Riverola, F. (2014). Web

scraping technologies in an api world. Brieﬁngs in

bioinformatics, 15(5):788–797.

Goh, A., Koh, Y.-K., and Domazet, D. S. (2001). Eca rule-

based support for workﬂows. Artiﬁcial intelligence in

engineering, 15(1):37–46.

Kappel, G., Rausch-Schott, S., and Retschitzegger, W.

(2000). A framework for workﬂow management sys-

tems based on objects, rules and roles. ACM Comput-

ing Surveys (CSUR), 32(1es):27–es.

Khder, M. A. (2021). Web scraping or web crawling: State

of art, techniques, approaches and application. Inter-

national Journal of Advances in Soft Computing & Its

Applications, 13(3).

Kunang, Y. N., Purnamasari, S. D., et al. (2018). Web scrap-

ing techniques to collect weather data in south sumat-

era. In 2018 International Conference on Electrical

Engineering and Computer Science (ICECOS), pages

385–390. IEEE.

Landers, R. N., Brusso, R. C., Cavanaugh, K. J., and Coll-

mus, A. B. (2016). A primer on theory-driven web

scraping: Automatic extraction of big data from the

internet for use in psychological research. Psycholog-

ical methods, 21(4):475.

Leotta, M., Clerissi, D., Ricca, F., and Spadaro, C. (2013).

Improving test suites maintainability with the page ob-

ject pattern: An industrial case study. In 2013 IEEE

Sixth International Conference on Software Testing,

Veriﬁcation and Validation Workshops, pages 108–

113. IEEE.

Lin, W., Qian, Z., Xu, J., Yang, S., Zhou, J., and Zhou, L.

(2016). Streamscope: continuous reliable distributed

processing of big data streams. In 13th {USENIX}

Symposium on Networked Systems Design and Imple-

mentation ({NSDI} 16), pages 439–453.

Liu, B. and Menczer, F. (2011). Web Crawling, pages 311–

362. Springer Berlin Heidelberg, Berlin, Heidelberg.

opez, P. G., Arjona, A., Samp

e, J., Slominski, A., and Vil-

lard, L. (2020). Triggerﬂow: trigger-based orchestra-

tion of serverless workﬂows. In Proceedings of the

14th ACM International Conference on Distributed

and Event-based Systems, pages 3–14.

Meschenmoser, P., Meuschke, N., Hotz, M., and Gipp, B.

(2016). Scraping scientiﬁc web repositories: Chal-

lenges and solutions for automated content extraction.

D-Lib Magazine, 22(9/10):15.

Mitchell, R. (2018). Web scraping with Python: Collecting

more data from the modern web. ” O’Reilly Media,

Inc.”.

Molina, P. J., Meli

a, S., and Pastor, O. (2002). User inter-

face conceptual patterns. In International Workshop

on Design, Speciﬁcation, and Veriﬁcation of Interac-

tive Systems, pages 159–172. Springer.

Munappy, A. R., Bosch, J., and Olsson, H. H. (2020). Data

pipeline management in practice: Challenges and op-

portunities. In International Conference on Product-

Focused Software Process Improvement, pages 168–

184. Springer.

Noghabi, S. A., Paramasivam, K., Pan, Y., Ramesh,

N., Bringhurst, J., Gupta, I., and Campbell, R. H.

(2017). Samza: stateful scalable stream processing

at linkedin. Proceedings of the VLDB Endowment,

10(12):1634–1645.

Ntoulas, A., Cho, J., and Olston, C. (2004). What’s new

on the web? the evolution of the web from a search

engine perspective. In Proceedings of the 13th inter-

national conference on World Wide Web, pages 1–12.

Pervaiz, F., Vashistha, A., and Anderson, R. (2019). Exam-

ining the challenges in development data pipeline. In

Proceedings of the 2nd ACM SIGCAS Conference on

Computing and Sustainable Societies, pages 13–21.

Saurkar, A. V., Pathare, K. G., and Gode, S. A. (2018). An

overview on web scraping techniques and tools. Inter-

national Journal on Future Revolution in Computer

Science & Communication Engineering, 4(4):363–

367.

Semeniuta, O. and Falkman, P. (2019). Epypes: a

framework for building event-driven data processing

pipelines. PeerJ Computer Science, 5:e176.

Simitsis, A., Wilkinson, K., Dayal, U., and Castellanos, M.

(2010). Optimizing etl workﬂows for fault-tolerance.

In 2010 IEEE 26th International Conference on Data

Engineering (ICDE 2010), pages 385–396. IEEE.

ICSOFT 2022 - 17th International Conference on Software Technologies

448