Capability-based Scheduling of Scientiﬁc Workﬂows in the Cloud

Michel Kr

amer

1,2 a

Fraunhofer Institute for Computer Graphics Research IGD, Fraunhoferstr. 5, 64283 Darmstadt, Germany

Technical University of Darmstadt, 64289 Darmstadt, Germany

Keywords:

Scientiﬁc Workﬂow Management Systems, Cloud Computing, Distributed Systems, Task Scheduling.

Abstract:

We present a distributed task scheduling algorithm and a software architecture for a system executing scientiﬁc

workﬂows in the Cloud. The main challenges we address are (i) capability-based scheduling, which means

that individual workﬂow tasks may require speciﬁc capabilities from highly heterogeneous compute machines

in the Cloud, (ii) a dynamic environment where resources can be added and removed on demand, (iii) scal-

ability in terms of scientiﬁc workﬂows consisting of hundreds of thousands of tasks, and (iv) fault tolerance

because in the Cloud, faults can happen at any time. Our software architecture consists of loosely coupled

components communicating with each other through an event bus and a shared database. Workﬂow graphs are

converted to process chains that can be scheduled independently. Our scheduling algorithm collects distinct

required capability sets for the process chains, asks the agents which of these sets they can manage, and then

assigns process chains accordingly. We present the results of four experiments we conducted to evaluate if our

approach meets the aforementioned challenges. We ﬁnish the paper with a discussion, conclusions, and future

research opportunities. An implementation of our algorithm and software architecture is publicly available

with the open-source workﬂow management system “Steep”.

1 INTRODUCTION

Scientiﬁc workﬂow management systems are used in

a wide range of areas including (but not limited to)

Bioinformatics (Oinn et al., 2004), Geology (Graves

et al., 2011), Geoinformatics (Kr

amer, 2018), and As-

tronomy (Berriman et al., 2004) to automate the pro-

cessing of very large data sets. A scientiﬁc workﬂow

is typically represented by a directed acyclic graph

that describes how an input data set is processed by

certain tasks in a given order to produce a desired out-

come. Such workﬂows can become very large with

hundreds up to several thousands of tasks processing

data volumes ranging from gigabytes to terabytes.

Distributed computing infrastructures such as the

Grid (Foster and Kesselman, 1998), or the Cloud

(Mell and Grance, 2011) combine the computational

power and storage resources of a large number of in-

dependent machines. They provide a well-suited en-

vironment for the execution of scientiﬁc workﬂows.

In order to make best use of available resources, the

workﬂow tasks have to be assigned to the machines

in a smart way. While the general task scheduling

problem is known to be NP-complete (Ullman, 1975;

https://orcid.org/0000-0003-2775-5844

Johnson and Garey, 1979), the topic is of high inter-

est to the research community and several approaches

with varying aims and requirements have been pub-

lished to ﬁnd practical solutions for the Grid and the

Cloud (Hemamalini, 2012; Singh and Chana, 2016).

In this paper, we present a distributed task

scheduling algorithm and a corresponding software

architecture for a scientiﬁc workﬂow management

system that speciﬁcally targets the Cloud. The main

challenge here is that, on the one hand, machines are

highly heterogeneous in terms of hardware, number

of virtual CPUs, main memory, and available stor-

age, but also with regard to installed software, drivers,

and operating systems. On the other hand, the differ-

ent tasks in a scientiﬁc workﬂow also have require-

ments. A compute-intensive task might need a min-

imum number of CPUs or even a graphics process-

ing unit (GPU), whereas another task might require a

large amount of main memory, and a third one needs

a speciﬁc software to be installed. In other words, the

machines have certain capabilities and the tasks have

requirements regarding these capabilities (or required

capabilities). This has to be considered during task

scheduling. As we show in Section 2, this concept has

not been fully covered by existing approaches yet.

Krämer, M.

Capability-based Scheduling of Scientiﬁc Workﬂows in the Cloud.

DOI: 10.5220/0009805400430054

In Proceedings of the 9th International Conference on Data Science, Technology and Applications (DATA 2020), pages 43-54

ISBN: 978-989-758-440-4

In addition to the heterogeneity of machines, the

topology of a Cloud is highly dynamic as new com-

pute and storage resources can be added on demand

and removed at any time. This property is often used

to scale a distributed application up when needed (e.g.

to manage peak load or to speed up processing) and

later down again to save resources and, in conse-

quence, money. Of course, scaling up only makes

sense if work can actually be distributed, which is typ-

ically the case when a workﬂow is very large and con-

tains many tasks that could potentially be executed in

parallel.

Also, it is known that in a distributed environment

(and in a Cloud in particular), faults such as crashed

machines, network timeouts, or missing messages can

happen at any time (Chircu, 2018). This highly af-

fects the execution of scientiﬁc workﬂows, which of-

ten take several hours or even days to complete.

1.1 Challenges and Requirements

To summarise the above, a scientiﬁc workﬂow man-

agement system running in the Cloud has to deal with

at least the following major challenges:

Capability-based Scheduling. Workﬂow tasks re-

quire different capabilities from the machines but, in

contrast, the infrastructure is highly heterogeneous.

Dynamic Environment. The execution environment

is highly dynamic and new compute resources can be

added and removed on demand.

Scalability. Scientiﬁc workﬂows can become very

large and may contain hundreds of thousands of tasks.

Fault Tolerance. In a distributed system, faults can

occur at any time.

From these challenges, we derive speciﬁc require-

ments that our scheduling algorithm and the software

architecture of our scientiﬁc workﬂow management

system should satisfy:

REQ 1. The algorithm should be able to assign tasks

to heterogeneous machines, while matching the ca-

pabilities the tasks need with the capabilities the ma-

chines provide.

REQ 2. Our system should not assume a static num-

ber of machines. Instead, it should horizontally scale

the workﬂow execution to new machines added to the

cluster and be able to handle machines being removed

(be it because a user or a service destroyed the ma-

chine or because of a fault).

REQ 3. If necessary, the execution of workﬂow

tasks that require capabilities currently not available

in the cluster should be postponed. The overall work-

ﬂow execution should not be blocked. The algorithm

should continue with the remaining tasks and resched-

ule the postponed ones as soon as machines with the

required capabilities become available.

REQ 4. The system should support rapid elasticity.

This means it should automatically trigger the acqui-

sition of new machines on demand (e.g. during peak

load or when capabilities are missing).

REQ 5. The system should be scalable so it can man-

age workﬂows with a large number of tasks.

REQ 6. As faults can happen at any time in a dis-

tributed environment, our system should be able to

recover from those faults and automatically continue

executing workﬂows.

1.2 Contributions

The main contribution of this paper is our scheduling

algorithm that is able to assign workﬂow tasks to het-

erogeneous machines in the Cloud based on required

capability sets.

In addition, we present a software architecture of

a scientiﬁc workﬂow management system our algo-

rithm is embedded in. We describe a set of compo-

nents that communicate with each other through an

event bus and a database to perform task scheduling

in a scalable and fault-tolerant manner.

The remainder of this paper is structured as fol-

lows. We ﬁrst analyse the state of the art in Section 2

and describe the research gap our work bridges. In

Section 3, we introduce an approach to map scien-

tiﬁc workﬂow graphs dynamically to individual pro-

cess chains (i.e. linear sequences of workﬂow tasks),

which can be treated independently by our schedul-

ing algorithm. After this, we present the software ar-

chitecture in Section 4 and ﬁnally our main contribu-

tion, the scheduling algorithm, in Section 5. In Sec-

tion 6, we also present the results of four experiments

we conducted to evaluate if our approach meets the

challenges and requirements deﬁned above. We ﬁn-

ish the paper in Section 7 with conclusions and future

research opportunities.

An implementation of our scheduling algorithm

and the software architecture is publicly available

with the Steep Workﬂow Management System, which

has recently been released under an open-source li-

cence on GitHub: https://steep-wms.github.io/

DATA 2020 - 9th International Conference on Data Science, Technology and Applications

2 RELATED WORK

Task scheduling can be performed in various ways.

Many algorithms employ heuristics to optimise re-

source usage and to reduce the makespan, i.e. the time

passed between the start of the ﬁrst task in a sequence

and the end of the last one. Min-Min and Max-Min

(Ibarra and Kim, 1977; Freund et al., 1998), for exam-

ple, iterate through all tasks in the sequence and cal-

culate their earliest completion time on all machines.

Min-Min schedules the task with the minimum earli-

est completion time while Max-Min selects the task

with the maximum one. This process continues until

all tasks have been processed.

The algorithm Sufferage, in contrast, is able to

reassign a task from machine M to another one if

there is a second task that would achieve better per-

formance on M (Maheswaran et al., 1999). Casanova

et al. present a heuristic called XSufferage that ex-

tends Sufferage and also considers data transfer cost

(Casanova et al., 2000). They claim that their ap-

proach leads to a shorter makespan because of pos-

sible ﬁle reuse. Gherega and Pupezescu improve

this algorithm even further and present DXSufferage,

which is based on the multi-agent paradigm (Gherega

and Pupezescu, 2011). Their approach prevents the

heuristic itself from becoming a bottleneck in the

scheduling process.

The algorithms mentioned above are optimised

for certain situations. In contrast, genetic algorithms

(GA) are able to automatically adapt to changing con-

ditions. A GA mimics the process of natural evolu-

tion. It uses historical information to select the best

mapping of tasks to machines. Good results with GA,

for example, were achieved by Hamad and Omara

who use Tournament Selection (Hamad and Omara,

2016) or by Page and Naughton whose algorithm

does not make assumptions about the characteristics

of tasks or machines (Page and Naughton, 2005).

There are other approaches that apply behaviour

known from nature to the task scheduling problem.

Ant colony optimisation algorithms (Tawfeek et al.,

2013; Li et al., 2011), for example, try to dynam-

ically adapt scheduling strategies to changing envi-

ronments. Thennarasu et al. present a scheduler that

mimics the behaviour of humpback whales to maxi-

mize work completion and to meet deadline and bud-

get constraints (Thennarasu et al., 2020).

Besides scheduling algorithms that process in-

dividual tasks, there are more complex ones that

consider the interdependencies in a scientiﬁc work-

ﬂow and try to ﬁnd optimal solutions by analysing

the directed graph in total. Blythe at al. investi-

gate the difference between task-based approaches

like the ones mentioned above and the more com-

plex workﬂow-based approaches (Blythe et al., 2005).

They conclude that data-intensive applications beneﬁt

from workﬂow-based approaches because the work-

ﬂow system can start to transfer data before it is used

by the tasks, which leads to optimised resource usage.

Binato et al. present such a workﬂow-based ap-

proach using a greedy randomized adaptive search

procedure (GRASP) (Binato et al., 2002). Their algo-

rithm creates multiple scheduling solutions iteratively

and then selects the one that is expected to perform

best. Topcuoglu et al. present two algorithms: HEFT

and CPOP (Topcuoglu et al., 2002). HEFT traverses

the complete workﬂow graph and calculates priorities

for individual tasks based on the number of succes-

sors, average communication costs, and average com-

putation costs. CPOP extends this and prioritises crit-

ical paths in workﬂow graphs.

There are a number of distributed scientiﬁc work-

ﬂow management systems that typically implement

one or more of the algorithms mentioned above.

Examples are Pegasus (Deelman et al., 2015), Ke-

pler (Altintas et al., 2004), Taverna (Hull et al., 2006),

Galaxy (Giardine et al., 2005), Airﬂow (Apache Air-

ﬂow, 2020), and Nextﬂow (Di Tommaso et al., 2017).

Other frameworks that can process large data sets

in the Cloud are Spark (Zaharia et al., 2010) and

Flink (Carbone et al., 2015). They are not workﬂow

management systems but follow a similar approach

and also need to schedule tasks from a directed graph.

2.1 Research Gap

In the scientiﬁc community, dynamically changing

environments, very large workﬂows, and fault tol-

erance are considered major challenges for modern

distributed scientiﬁc workﬂow management systems,

which have not been fully covered by existing ap-

proaches yet and therefore offer many research op-

portunities (Deelman et al., 2018). In Section 1, we

discussed these challenges and added another major

one, namely that tasks in a scientiﬁc workﬂow need

certain capabilities from the machines but the Cloud

is highly heterogeneous.

A system that addresses all four of these chal-

lenges needs to be designed from the ground up with

them in mind. None of the approaches, algorithms,

and systems mentioned above (and to the best of our

knowledge, no other existing work) cover all of them

in one algorithm design. In this paper, we present

such an algorithm as well as the software architec-

ture it is embedded in.

There are similarities, however, between our ap-

proach and existing ones. DXSufferage, for exam-

Capability-based Scheduling of Scientiﬁc Workﬂows in the Cloud

ple, uses the multi-agent paradigm (Gherega and Pu-

pezescu, 2011). Similar to agents, our components

are independent and communicate with each other

through an event bus. There can be multiple sched-

ulers sharing work and processing the same workﬂow.

In addition, we convert workﬂow graphs to process

chains, which group tasks with the same required ca-

pabilities and common input/output data. Just like in

XSufferage (Casanova et al., 2000), this can poten-

tially lead to better ﬁle reuse.

Note that, in this paper, we focus on the afore-

mentioned Cloud challenges. Evaluating different

scheduling heuristics to achieve the shortest make-

span is beyond the scope of this paper and remains

for future work.

Also note that our approach is not directly com-

parable to workﬂow-based scheduling algorithms that

consider the graph in total. Instead, we employ a hy-

brid strategy that ﬁrst splits the graph into process

chains and then schedules these instead of individual

tasks.

3 WORKFLOW SCHEDULING

As described above, a scientiﬁc workﬂow is typi-

cally represented by a directed graph that describes

in which order certain tasks need to be applied to an

input data set to produce a desired outcome. Figure 1a

shows a simple example of such a workﬂow in the ex-

tended Petri Net notation proposed by van der Aalst

and van Hee (2004). In this example, an input ﬁle is

ﬁrst processed by a task A. This task produces two

results. The ﬁrst one is processed by task B whose

result is in turn sent to C. The second result of A is

processed by D. The outcomes of C and D are ﬁnally

processed by task E.

In order to be able to schedule such a workﬂow

in a distributed environment, the graph has to be

transformed to individual executable units. Our sci-

entiﬁc workﬂow management system follows a hy-

brid scheduling approach that applies heuristics on the

level of the workﬂow graph and later on the level of

individual executable units. We assume that tasks that

access the same data should be executed on the same

machine to reduce the communication overhead and

to improve ﬁle reuse. We therefore group tasks into

so-called process chains, which are linear sequential

lists (without branches and loops).

While grouping the tasks, we also need to take

the capabilities into account that they require from the

machines in the Cloud. In our implementation, capa-

bilities are user-deﬁned strings. For example, the set

{”Ubuntu”, ”GPU”} might mean that a task depends

B D

(a) Workﬂow graph.

B D

(b) Generated process

chains.

Figure 1: A workﬂow is split into four individual process

chains in three iterations.

on the Linux distribution Ubuntu as well as the pres-

ence of a graphics processing unit. We call the union

of the required capabilities of all tasks in a process

chain a required capability set.

Our system transforms workﬂows to process

chains in an iterative manner. In each iteration, it ﬁnds

the longest linear sequences of tasks with similar re-

quired capabilities and groups them to process chains

(see Figure 1b). For our example workﬂow, task A

will be put into a process chain in iteration 1. Our sys-

tem then schedules the execution of this process chain

according to the algorithm from Section 5. After the

execution has ﬁnished, the system uses the results to

produce a process chain containing B and C (assum-

ing they require similar capabilities) and another one

containing D. These process chains are then sched-

uled to be executed in parallel. The results are ﬁnally

used to generate the fourth process chain containing

task E, which is also scheduled for execution.

4 SOFTWARE ARCHITECTURE

Figure 2 shows the main components of our scien-

tiﬁc workﬂow management system: the HTTP server,

the controller, the scheduler, the agent, and the cloud

manager. Together, they form an instance of our sys-

tem. In practice, a single instance typically runs on

a separate virtual machine, but multiple instances can

also be started on the same machine. Each component

DATA 2020 - 9th International Conference on Data Science, Technology and Applications

HTTP server

Controller

Scheduler

Instance 1

Agent

...

Database

Event bus

Cloud manager

HTTP server

Controller

Scheduler

Instance n

Agent

Cloud manager

Figure 2: An overview of the components in our scientiﬁc

workﬂow management system and how they communicate

with each other.

can be enabled or disabled in a given instance. That

means, in a cluster, there can be instances that have all

ﬁve components enabled, and others that have only an

agent, for example.

All components of all instances communicate with

each other through messages sent over an event bus.

Further, the HTTP server, the controller, and the

scheduler are able to connect to a shared database. In

the following, we describe the roles and responsibili-

ties of each component.

The HTTP server provides information about

scheduled, running, and ﬁnished workﬂows to clients.

Clients can also upload a new workﬂow. In this case,

the HTTP server puts the workﬂow into the database

and sends a message to one of the instances of the

controller.

The controller receives this message, loads the

workﬂow from the database, and starts transforming it

iteratively to process chains as described in Section 3.

Whenever it has generated new process chains, it puts

them into the database and sends a message to all in-

stances of the scheduler.

The schedulers then apply our algorithm (see Sec-

tion 5) and select agents to execute the process chains.

The schedulers load the process chains from the

database, send them via the event bus to the selected

agents for execution, and ﬁnally write the results into

the database. The schedulers also send a message

back to the controller so it can continue with the next

iteration and generate more process chains until the

workﬂow has been completely transformed.

In case a scheduler does not ﬁnd an agent suitable

for the execution of a process chain, it sends a mes-

sage to the cloud manager (a component that interacts

with the API of the Cloud infrastructure) and asks it

to create a new agent.

Note that messages between the HTTP server, the

controller, and scheduler may get lost (e.g. because of

network failures). Due to this, the controller and the

scheduler also check the database for new workﬂows

and process chains respectively at a regular interval.

We found 20 seconds to be a sensible value in prac-

tice, but in our implementation, this is conﬁgurable.

This approach decouples the components from each

other and increases fault tolerance.

5 CAPABILITY-BASED

SCHEDULING ALGORITHM

In this section, we present the capability-based

scheduling algorithm that is executed in our sched-

uler component. We ﬁrst describe the main schedul-

ing function (Section 5.1) and then discuss how our

algorithm selects candidate agents (Section 5.2).

5.1 Main Scheduling Function

Listing 1 shows the main function of our schedul-

ing algorithm. As mentioned above, the scheduler

calls this function at regular intervals and immedi-

ately after new process chains have been added to the

database.

At the beginning, our algorithm calls

ﬁndRequiredCapabilitySets(). This function per-

forms a database query to retrieve all distinct sets of

capabilities required to execute the process chains not

scheduled yet. In other words, given a capability set

= {c

, ..., c

} for a process chain pc

, the result of

ﬁndRequiredCapabilitySets() is a set S = {R

, ..., R

}

of distinct required capability sets.

From line 3 on, our algorithm performs up to

maxLookups scheduling operations. After the regu-

lar interval or when new process chains have been

added, the function will be called with maxLookups

set to inﬁnity. The main idea is that the function will

try to schedule as many process chains as possible un-

til it reaches a break statement. There is only one in

line 12 that is reached when there is no agent available

anymore (see below).

Capability-based Scheduling of Scientiﬁc Workﬂows in the Cloud

1 function lookup(maxLookups):

2 S = ﬁndRequiredCapabilitySets()

3 for i ∈ [0, maxLookups):

4 candidates = selectCandidates(S)

5 if candidates ==

6 /∗ All agents are busy or none of them

7 have the required capabilities. ∗/

8 for R ∈ S:

9 if existsProcessChain(R):

10 launch:

11 requestAgent(R)

12 break

13 for (candidate, R) ∈ candidates:

14 pc = ﬁndProcessChain(R)

15 if pc == undeﬁned:

16 /∗ All process chains with R were

17 executed in the meantime. ∗/

18 continue

19 agent = allocate(candidate)

20 if agent == undeﬁned:

21 /∗ Agent is not available any more. ∗/

22 continue

23 /∗ Execute process chain

24 asynchronously. ∗/

25 launch:

26 executeProcessChain(pc, agent)

27 deallocate(agent)

28 /∗ Agent is has become available.

29 Trigger next lookup. ∗/

30 lookup(1)

Listing 1: The main function of our algorithm checks what

capabilities are required at the moment and if there are

available agents that can execute process chains with these

capabilities. If so, it retrieves such process chains from the

database and schedules their execution.

Inside the main for loop, the function ﬁrst selects

a set of candidate agents that are able to execute at

least one of the given required capability sets from S

(line 4). The function selectCandidates() is described

in detail in Section 5.2. It returns a list of pairs of

a candidate agent and the required capability set R it

can execute.

If this list is empty (line 5), all agents are currently

busy or there is no agent that would be able to ex-

ecute at least one R ∈ S. In this case, the function

iterates over all required capability sets (line 8) and

checks if there actually is a corresponding registered

process chain in the database (line 9). This is neces-

sary because all process chains with a certain required

capability set may have already been processed since

ﬁndRequiredCapabilitySets() was called (e.g. by an-

other scheduler instance or in a preceding iteration of

the outer for loop). If there is a process chain, the

function requestAgent will be called, which asks the

cloud manager component (see Section 4) to create

a new agent that has the given required capabilities

(line 11). We use the keyword launch here to indicate

that the call to requestAgent is asynchronous, mean-

ing the algorithm does not wait for an answer.

The algorithm then leaves the outer for loop be-

cause it is unnecessary to perform any more schedul-

ing operations while none of the agents can execute

process chains (line 12). Process chains with required

capabilities none of the agents can provide will es-

sentially be postponed. As soon as the cloud manager

has created a new agent with the missing capabilities,

the lookup function will be called again and any post-

poned process chains can be scheduled.

If there are agents available that can execute pro-

cess chains with any of the required capability sets

from S, the algorithm iterates over the result of

selectCandidates() in line 13. For each pair of a can-

didate agent and the corresponding required capabil-

ity set R it can execute, the algorithm tries to ﬁnd

a matching registered process chain with R in the

database. If there is none, it assumes that all process

chains with this required capability set have already

been executed in the meantime (line 15). Otherwise,

it tries to allocate the candidate agent, which means

it asks it to prepare itself for the execution of a pro-

cess chain and to not accept other requests anymore

(line 19). If the agent cannot be allocated, it was prob-

ably allocated by another scheduler instance in the

meantime since selectCandidates was called (line 20).

Otherwise, the algorithm launches the execution

of the process chain in the background and contin-

ues with the next scheduling operation. The code

block from line 25 to line 30 runs asynchronously

in a separate thread and does not block the outer for

loop. As soon as the process chain has been exe-

cuted completely in this thread, our algorithm deal-

locates the agent in line 27 so it becomes available

again. It then calls the lookup function and passes 1

for maxLookups because exactly one agent has be-

come available and therefore only one process chain

has to be scheduled.

5.2 Selecting Candidate Agents

The function selectCandidates takes a set S =

, ..., R

} of required capability sets and returns a

DATA 2020 - 9th International Conference on Data Science, Technology and Applications

1 function selectCandidates(S):

2 candidates =

3 for a ∈ Agents:

4 send S to a and wait for response

5 if a is available:

6 get best R

∈ S from response

7 P = (a, R

)

8 candidates = candidates ∪ {P}

9 L = all P ∈ candidates with best a for each R

10 return L

Listing 2: Pseudo code of the function that selects agents

based on their capabilities.

list L = {P

, ..., P

} of pairs P = (a, R

) of an agent a

and the matching required capability set R

. Listing 2

shows the pseudo code.

The function sends all required capability sets to

each agent via the event bus. The agents respond

whether they are available and which required capa-

bility set they support best. The function collects all

responses in a set of candidates. It ﬁnally selects ex-

actly one agent for each required capability set.

The decision of which available agent to select can

be implemented based on certain heuristics (e.g. the

expected earliest completion time of process chains

with the given required capability set). In our im-

plementation, we select the agent that was idle for

the longest time. In practice, this has proven to be

a heuristic that achieves good throughput and, at the

same time, prevents starvation because every agent

will be selected eventually.

Note that some or all agents might not be avail-

able, in which case the result of selectCandidates con-

tains less required capability sets than S or is even

empty.

6 EVALUATION

In Section 1.1, we deﬁned four major challenges for

the management of scientiﬁc workﬂows in the Cloud

and derived six requirements for our system. In order

to evaluate if our scheduling algorithm and our soft-

ware architecture meet these challenges and require-

ments, we conducted four practical experiments (one

for each challenge). In this section, we present the

results of these experiments and discuss beneﬁts and

drawbacks of our approach.

Table 1: This caption has more than one line so it has to be

justiﬁed.

Required capability

set

Maximum number of

agents

R1 2

R2 2

R3 1

R4 1

R3 + R4 2

Total 8

6.1 Setup

All experiments were performed in the same environ-

ment. We set up our system in a private OpenStack

Cloud and conﬁgured it so that it could access the

Cloud’s API and create further virtual machines on

demand. We deployed the full stack of components

presented in Section 4 to each virtual machine. These

components communicated with each other through a

distributed event bus. They also shared a MongoDB

database deployed to a separate virtual machine. In

the following, according to our system architecture,

we use the term agent for a virtual machine running

our system and capable of executing process chains.

We deﬁned four types of agents with the capabil-

ity sets R1, R2, R3, R4, as well as ﬁfth type offering

both capability sets R3 and R4. To simulate a het-

erogeneous environment, we conﬁgured a maximum

number of agents that our system was allowed to cre-

ate per required capability set (see Table 1).

For each experiment, we collected all log ﬁles of

all instances of our system and converted them to

graphs. Figure 3 shows the results. Each of the sub-

ﬁgures—which we discuss in detail in the following

sections—depicts a timeline of a workﬂow run. The

lanes (from left to right) represent individual agents

and indicate when they were busy executing process

chains. Each required capability set has a different

colour (see legend in Figure 3d). The colour of the

agents and the process chains speciﬁes what capabil-

ities they offered or required respectively. A process

chain has a start (emphasized by a darker shade of the

colour) and an end. In experiment 4, we also killed

agents on purpose. The point in time when the fault

was induced is marked by a black X.

6.2 Experiments

Experiment 1: Capability-based Scheduling

(Requirement covered: REQ 1)

Our primary goal in this paper was to create a schedul-

ing algorithm that is able to assign workﬂow task to

Capability-based Scheduling of Scientiﬁc Workﬂows in the Cloud

(a) 100 process chains are distributed to the agents with the

correct capabilities.

(b) The system creates agents with capabilities required by

1000 process chains on demand.

(d) Legend: colours for required capability sets, start and

end of a process chain, and time when an agent was killed.

(e) The system is able to recover from faults and to still

ﬁnish all 1000 process chains from the current workﬂow.

Figure 3: Results of our experiments.

distributed machines based on required and offered

capabilities. Figure 3a shows we reached this goal.

We deployed a static number of eight agents with

different capability sets. We then sent a workﬂow

consisting of 100 process chains to one of the in-

stances of our system. As soon as the workﬂow was

saved in the database, all scheduler instances started

assigning process chains to the individual agents. The

colours in the ﬁgure show that all process chains were

correctly assigned. The workﬂow took 13 minutes

and 14 seconds to complete in total.

Note that our algorithm is designed to ask the

agents which required capability set they would like

to execute before assigning process chains to them

(see Section 5.2). Agents A6 and A7 were able to exe-

cute process chains requiring both R3 and R4 but they

obviously preferred to accept the ones with R3 ﬁrst

before they continued with R4.

Experiment 2: Dynamic Environment

(Requirements covered: REQ 1–4)

For our second experiment, we deployed only one

agent with capability set R1. We then executed a

workﬂow with 1000 process chains. Figure 3b shows

the timeline of the workﬂow run.

As soon as the ﬁrst scheduler instance found all

required capability sets in the database, it asked the

cloud manager component of our system to create

new agents. Starting a virtual machine and deploying

itself to it took our system almost three minutes. Pro-

cess chains requiring missing capabilities were post-

poned but the scheduler continued assigning the ones

with R1. As soon as the new agents had started, pro-

cess chains were assigned to them.

Note that in our implementation, the cloud man-

ager only creates one agent of a certain capability set

at a time. Also, as described earlier, we conﬁgured a

maximum number of agents per capability set. These

are the reasons why only four new agents appear at

DATA 2020 - 9th International Conference on Data Science, Technology and Applications

about minute 3, two others at minute 6, and a

ﬁnal one between minutes 8 and 9.

The experiment shows that our system can create

new virtual machines on demand and that the sched-

ulers make use of new resources as soon as they be-

come available.

Experiment 3: Scalability

(Requirements covered: REQ 1–5)

In order to show the scalability of our system, we

launched a workﬂow with 150000 process chains.

Similar to the second experiment, we started with one

agent. The other ones were automatically created by

our system on demand.

Figure 3c shows the timeline of the workﬂow over

more than 17 hours. The darker colour of the graph

stems from the fact that there are a lot more start

events in this workﬂow than in the other ones.

Again, all process chains were assigned to the

correct machines. Although the number of process

chains the system had to manage was very large, it did

not crash and kept being responsive the whole time.

Experiment 4: Fault Tolerance

(Requirements covered: all)

Our ﬁnal experiment tested if our system can man-

age faults during a workﬂow run. Figure 3e shows

the timeline. We started with eight agents and ex-

ecuted the same workﬂow as in experiment 2 with

1000 process chains. At about minute 2, we started

to randomly kill agents (indicated in the ﬁgure by a

black X) by sending them a SIGKILL signal.

We killed nine agents during the workﬂow run.

The ﬁgure shows that each time, the system was able

to recover from the faults. It created new agents with

the missing required capabilities and started assigning

process chains to them as soon as they became avail-

able. Between minutes 2 and 6, approximately, there

was no agent that was able to execute process chains

with R3. Similarly, between minutes 9 and 13, ap-

proximately, R4 was not processed by any agent (A13

would have been able to, but as mentioned earlier, it

preferred to execute R3 ﬁrst). The execution of these

process chains was postponed and resumed later.

6.3 Discussion

The results of our experiments show that our system

meets all of the challenges and requirements for the

management of scientiﬁc workﬂows in the Cloud de-

ﬁned in Section 1.1.

In order to assign process chains to the correct ma-

chines with matching capabilities, our scheduler asks

each agent whether it wants to execute a process chain

with a given required capability set or not. An alter-

native approach would be to let the agents fetch the

process chains themselves whenever they are ready to

execute something. However, in this case, it would

not be possible to create agents on demand. If there

is no agent fetching process chains, nothing can be

executed. Our scheduler, on the other hand, has an

overview of all required capability sets and can ac-

quire new resources when necessary.

As described in Section 5.2, our current imple-

mentation of the scheduler chooses between multiple

available agents by comparing their idle time. Al-

though this works very well in practice, our exper-

iments have shown that this approach leaves room

for improvement. For example, in experiment 1

(Figure 3a), if the schedulers had assigned process

chains with R4 to agents A6 and A7 earlier, then the

makespan could have been reduced by approximately

one or two minutes (about 5–10%). Agents A3, A6,

and A7 would have ﬁnished earlier with R4 while

agent A4 could have processed R3 until the end.

Nevertheless, our approach allows various

scheduling strategies to be implemented since our

scheduler has a complete overview of the required

capability sets and the available agents. Different

heuristics such as Min-Min or Max-Min can be im-

plemented in the selectCandidates function without

changing the general structure of the algorithm.

Note that these heuristics would work on the level

of required capability sets and not on the individual

process chains, which differentiates our approach

from existing works. However, investigating different

heuristics and optimising the makespan was not a

goal of this paper and remains for future work.

The scalability and fault tolerance of our approach

mostly stems from the fact that we use a database to

store process chains. First, this out-of-core approach

reduces the required main memory. Only a few pro-

cess chains (plus all distinct required capability sets)

need to be kept in memory at a time, which allows

hundreds of thousands of process chains to be sched-

uled without any issue. Second, since the database

holds the remaining process chains to execute, it es-

sentially keeps the current state of the overall work-

ﬂow execution. If one scheduler instance crashes, an-

other one can take over. Our open-source implemen-

tation even supports resuming workﬂows if all sched-

ulers have crashed after a restart of the whole cluster.

The database can considered a single point of fail-

ure. If it becomes unavailable, workﬂow execution

cannot continue. In pratice, this is not a problem be-

cause as soon as it is up again, our scheduling algo-

rithm can proceed and no information will be lost.

However, the database has an impact on scheduling

Capability-based Scheduling of Scientiﬁc Workﬂows in the Cloud

performance. Our algorithm as it is presented in Sec-

tion 5 needs to fetch the set of distinct required ca-

pability sets each time the lookup function is called.

Queries for distinct values are known to be time-

consuming, especially for large collections like in our

case. The only way to ﬁnd all distinct values is to

perform a sequential scan. Most DBMS even have to

sort the collection ﬁrst. There are approaches to ﬁnd

approximations of distinct values (Kane et al., 2010;

Bar-Yossef et al., 2002) but our algorithm needs exact

results. Our open-source implementation therefore

caches the set of required capability sets for a conﬁg-

urable amount of time. Our algorithm is robust, and

even when this cache becomes inconsistent with the

database, it will still work correctly. It will either not

ﬁnd a process chain for a given required capability set

(Listing 1, line 15), or one or more required capability

sets will be missing from the cache. In the ﬁrst case,

the algorithm will just continue with the next candi-

date agent and required capability set (line 18). Re-

garding the second case, adding new process chains

will trigger a cache update, so the next scheduling op-

eration will be able to use all required capability sets.

In any case, after the conﬁgurable amount of time has

passed, the cache will be updated anyhow.

Our experiments have revealed other places where

our system could be improved. At the moment, our

cloud manager creates only one agent per capability

set at a time. This could be parallelised in the fu-

ture to further reduce the makespan. Also, there are

small gaps visible between process chain executions

in our evaluation timelines. In experiment 4, they are

due to the fact that we killed schedulers that had just

allocated an agent and did not get the chance to deal-

locate it. In our implementation, agents automatically

deallocate themselves if they do not receive a message

from a scheduler for a certain amount of time, but un-

til then, they do not execute anything. The gaps in the

other experiments are most likely due to long database

operations but need further investigation. As can be

seen in Figure 3, these gaps are very small (only a

few milliseconds) and do not happen very often.

7 CONCLUSION

In Section 1.1, we identiﬁed four major challenges re-

garding the execution of scientiﬁc workﬂows in the

Cloud: capability-based scheduling, a dynamic envi-

ronment, scalability, and fault tolerance. These chal-

lenges, and in particular capability-based scheduling,

have not been fully covered in the scientiﬁc com-

munity yet. In this paper, we presented a schedul-

ing algorithm and a software architecture for a scien-

tiﬁc workﬂow management system that address these

challenges. In four experiments, we have shown that

our approach works and meets our requirements. We

also discussed beneﬁts and drawbacks.

We implemented our algorithm and the software

architecture in the Steep Workﬂow Management Sys-

tem, which has recently been released under an open-

source licence (https://steep-wms.github.io/). We are

using Steep in various projects. One of them, for ex-

ample, deals with the processing of large point clouds

and panorama images that have been acquired with a

mobile mapping system in urban environments. The

data often covers whole cities, which makes the work-

ﬂows particularly large with thousands of process

chains. The point clouds are processed by a service

using artiﬁcial intelligence (AI) to classify points and

to detect fac¸ades, street surfaces, etc. Since this is

a time-consuming task and the workﬂows often take

several days to execute, the scalability and fault toler-

ance of our system are fundamental in this project.

In addition, the AI service requires a graphics pro-

cessing unit (GPU), which is a limited resource in

the Cloud and particularly expensive. Our capability-

based scheduling algorithm helps in this respect to

distribute the workﬂow tasks to the correct machines

(i.e. to only use GPU machines when necessary). As

our system supports elasticity and a dynamic number

of machines, we can also scale up and down on de-

mand, only create GPU machines when needed, and

release them as soon as possible. In other words, in

this project, our system saves time and money.

Based on the work presented in this paper, there

are a number of opportunities for future research. For

example, the iterative approach to transform work-

ﬂow graphs to process chains presented in Section 3

allows very complex workﬂows to be processed.

It supports workﬂows without a priori design-time

knowledge (Russell et al., 2016), which means the

system does not need to know the complete structure

of the workﬂow before the execution starts. This en-

ables us to dynamically change the structure of the

workﬂow during the execution as the number of in-

stances of a process chain can depend on the results of

a preceding one. The approach is also suitable to ex-

ecute workﬂows without a priori runtime knowledge,

meaning that the number of instances of a certain pro-

cess chain may even change while the process chain

is running. This enables cycles and recursion. Details

on this will be the subject of a future publication we

are currently working on.

Our experiments have revealed areas where our

approach and implementation can be further im-

proved. In the future, we will investigate different

scheduling heuristics such as Min-Min or Max-Min

DATA 2020 - 9th International Conference on Data Science, Technology and Applications

to improve the makespan of workﬂows. As mentioned

above, we also discovered small gaps in the results of

our experiments. They are most likely due to the way

we implemented our system. We will investigate them

and try to ﬁnd optimisations. Further, we will improve

our cloud manager so it creates multiple agents with

a given capability set in parallel.

REFERENCES

Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher,

B., and Mock, S. (2004). Kepler: an extensible sys-

tem for design and execution of scientiﬁc workﬂows.

In Proceedings. 16th International Conference on Sci-

entiﬁc and Statistical Database Management, 2004.,

pages 423–424. IEEE.

Apache Airﬂow (2020). Apache Airﬂow Website.

https://airﬂow.apache.org/. Last accessed: 2020-04-

14.

Bar-Yossef, Z., Jayram, T. S., Kumar, R., Sivakumar, D.,

and Trevisan, L. (2002). Counting distinct elements in

a data stream. In Rolim, J. D. P. and Vadhan, S., edi-

tors, Randomization and Approximation Techniques in

Computer Science, pages 1–10. Springer Berlin Hei-

delberg.

Berriman, G. B., Deelman, E., Good, J. C., Jacob, J. C.,

Katz, D. S., Kesselman, C., Laity, A. C., Prince, T. A.,

Singh, G., and Su, M.-H. (2004). Montage: a grid-

enabled engine for delivering custom science-grade

mosaics on demand. In Optimizing Scientiﬁc Return

for Astronomy through Information Technologies, vol-

ume 5493, pages 221–233. International Society for

Optics and Photonics.

Binato, S., Hery, W. J., Loewenstern, D. M., and Resende,

M. G. C. (2002). A Grasp for Job Shop Scheduling,

pages 59–79. Springer US.

Blythe, J., Jain, S., Deelman, E., Gil, Y., Vahi, K., Man-

dal, A., and Kennedy, K. (2005). Task scheduling

strategies for workﬂow-based applications in grids.

In Proceedings of the IEEE International Symposium

on Cluster Computing and the Grid (CCGrid), pages

759–767.

Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi,

S., and Tzoumas, K. (2015). Apache Flink: Stream

and batch processing in a single engine. Bulletin of

the IEEE Computer Society Technical Committee on

Data Engineering, 36(4):28–38.

Casanova, H., Legrand, A., Zagorodnov, D., and Berman, F.

(2000). Heuristics for scheduling parameter sweep ap-

plications in grid environments. In Proceedings of the

9th Heterogeneous Computing Workshop HCW, pages

349–363.

Chircu, V. (2018). Understanding the 8 fallacies of

distributed systems. https://dzone.com/articles/

understanding-the-8-fallacies-of-distributed-syste.

Last accessed: 2020-02-18.

Deelman, E., Peterka, T., Altintas, I., Carothers, C. D., van

Dam, K. K., Moreland, K., Parashar, M., Ramakrish-

nan, L., Taufer, M., and Vetter, J. (2018). The fu-

ture of scientiﬁc workﬂows. The International Jour-

nal of High Performance Computing Applications,

32(1):159–175.

Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S.,

Maechling, P. J., Mayani, R., Chen, W., Ferreira da

Silva, R., Livny, M., and Wenger, K. (2015). Pegasus:

a workﬂow management system for science automa-

tion. Future Generation Computer Systems, 46:17–

35.

Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P.,

Palumbo, E., and Notredame, C. (2017). Nextﬂow en-

ables reproducible computational workﬂows. Nature

biotechnology, 35(4):316–319.

Foster, I. and Kesselman, C., editors (1998). The Grid:

Blueprint for a New Computing Infrastructure. Mor-

gan Kaufmann Publishers Inc., San Francisco, CA,

USA.

Freund, R. F., Gherrity, M., Ambrosius, S., Campbell, M.,

Halderman, M., Hensgen, D. Z., Keith, E., Kidd, T.,

Kussow, M., Lima, J. D., Mirabile, F., Lantz, M.,

Rust, B., and Siegel, H. J. (1998). Scheduling re-

sources in multi-user heterogeneous computing envi-

ronments with SmartNet. Calhoun: The NPS Institu-

tional Archive.

Gherega, A. and Pupezescu, V. (2011). Multi-agent re-

source allocation algorithm based on the xsufferage

heuristic for distributed systems. In Proceedings of

the 13th International Symposium on Symbolic and

Numeric Algorithms for Scientiﬁc Computing, pages

313–320.

Giardine, B., Riemer, C., Hardison, R. C., Burhans, R.,

Elnitski, L., Shah, P., Zhang, Y., Blankenberg, D.,

Albert, I., Taylor, J., Miller, W., Kent, W. J., and

Nekrutenko, A. (2005). Galaxy: a platform for inter-

active large-scale genome analysis. Genome research,

15(10):1451–1455.

Graves, R., Jordan, T. H., Callaghan, S., Deelman, E., Field,

E., Juve, G., Kesselman, C., Maechling, P., Mehta, G.,

Milner, K., Okaya, D., Small, P., and Vahi, K. (2011).

Cybershake: A physics-based seismic hazard model

for southern california. Pure and Applied Geophysics,

168(3):367–381.

Hamad, S. A. and Omara, F. A. (2016). Genetic-based

task scheduling algorithm in cloud computing envi-

ronment. International Journal of Advanced Com-

puter Science and Applications, 7(4):550–556.

Hemamalini, M. (2012). Review on grid task scheduling in

distributed heterogeneous environment. International

Journal of Computer Applications, 40(2):24–30.

Hull, D., Wolstencroft, K., Stevens, R., Goble, C., Pocock,

M. R., Li, P., and Oinn, T. (2006). Taverna: a tool for

building and running workﬂows of services. Nucleic

acids research, 34:W729–W732.

Ibarra, O. H. and Kim, C. E. (1977). Heuristic algorithms

for scheduling independent tasks on nonidentical pro-

cessors. Journal of the ACM, 24(2):280–289.

Johnson, D. S. and Garey, M. R. (1979). Computers

and Intractability: A guide to the theory of NP-

completeness. WH Freeman.

Capability-based Scheduling of Scientiﬁc Workﬂows in the Cloud

Kane, D. M., Nelson, J., and Woodruff, D. P. (2010). An

optimal algorithm for the distinct elements problem.

In Proceedings of the 29th ACM SIGMOD-SIGACT-

SIGART Symposium on Principles of Database Sys-

tems, PODS ’10, pages 41–52. Association for Com-

puting Machinery.

amer, M. (2018). A Microservice Architecture for the

Processing of Large Geospatial Data in the Cloud.

PhD thesis, Technische Universit

at Darmstadt.

Li, K., Xu, G., Zhao, G., Dong, Y., and Wang, D. (2011).

Cloud task scheduling based on load balancing ant

colony optimization. In Proceedings of the 6th An-

nual Chinagrid Conference, pages 3–9.

Maheswaran, M., Ali, S., Siegal, H. J., Hensgen, D., and

Freund, R. F. (1999). Dynamic matching and schedul-

ing of a class of independent tasks onto heterogeneous

computing systems. In Proceedings. Eighth Heteroge-

neous Computing Workshop (HCW’99), pages 30–44.

Mell, P. M. and Grance, T. (2011). The NIST deﬁnition of

cloud computing. Technical report, National Institute

of Standards & Technology, Gaithersburg, MD, USA.

Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M.,

Greenwood, M., Carver, T., Glover, K., Pocock,

M. R., Wipat, A., and Li, P. (2004). Taverna: a tool

for the composition and enactment of bioinformatics

workﬂows. Bioinformatics, 20(17):3045–3054.

Page, A. J. and Naughton, T. J. (2005). Dynamic task

scheduling using genetic algorithms for heteroge-

neous distributed computing. In Proceedings of the

19th IEEE International Parallel and Distributed Pro-

cessing Symposium.

Russell, N., van van der Aalst, W. M., and ter Hofstede,

A. H. M. (2016). Workﬂow Patterns: The Deﬁnitive

Guide. MIT Press.

Singh, S. and Chana, I. (2016). A survey on resource

scheduling in cloud computing: Issues and challenges.

Journal of Grid Computing, 14:217–264.

Tawfeek, M. A., El-Sisi, A., Keshk, A. E., and Torkey, F. A.

(2013). Cloud task scheduling based on ant colony

optimization. In Proceedings of the 8th International

Conference on Computer Engineering Systems (IC-

CES), pages 64–69.

Thennarasu, S., Selvam, M., and Srihari, K. (2020). A

new whale optimizer for workﬂow scheduling in cloud

computing environment. Journal of Ambient Intelli-

gence and Humanized Computing.

Topcuoglu, H., Hariri, S., and Min-You Wu (2002).

Performance-effective and low-complexity task

scheduling for heterogeneous computing. IEEE

Transactions on Parallel and Distributed Systems,

13(3):260–274.

Ullman, J. (1975). NP-complete scheduling problems.

Journal of Computer and System Sciences, 10(3):384–

393.

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S.,

and Stoica, I. (2010). Spark: Cluster computing with

working sets. In Proceedings of the 2nd USENIX Con-

ference on Hot Topics in Cloud Computing. USENIX

Association.

DATA 2020 - 9th International Conference on Data Science, Technology and Applications