Bi-Objective CSO for Big Data Scientiﬁc Workﬂows Scheduling in the

Cloud: Case of LIGO Workﬂow

K. Bousselmi

1

, S. Ben Hamida

2

and M. Rukoz

2

1

Paris Nanterre University, 92001 Nanterre Cedex, France

2

Paris Dauphine University, PSL Research University, CNRS, UMR [7243], LAMSADE, 75016 Paris, France

Keywords:

Scientiﬁc Workﬂow, Data Intensive, Cat Swarm Optimization, Multi-Objective Scheduling, LIGO.

Abstract:

Scientiﬁc workﬂows are used to model scalable, portable, and reproducible big data analyses and scientiﬁc

experiments with low development costs. To optimize their performances and ensure data resources efﬁciency,

scientiﬁc workﬂows handling big volumes of data need to be executed on scalable distributed environments

like the Cloud infrastructure services. The problem of scheduling such workﬂows is known as an NP-complete

problem. It aims to ﬁnd optimal mapping task-to-resource and data-to-storage resources in order to meet end

user’s quality of service objectives, especially minimizing the overall makespan or the ﬁnancial cost of the

workﬂow. In this paper, we formulate the problem of scheduling big data scientiﬁc workﬂows as bi-objective

optimization problem that aims to minimize both the makespan and the cost of the workﬂow. The formulated

problem is then resolved using our proposed Bi-Objective Cat Swarm Optimization algorithm (BiO-CSO)

which is an extension of the bio-inspired algorithm CSO. The extension consists of adapting the algorithm

to solve multi-objective discrete optimization problems. Our application case is the LIGO Inspiral workﬂow

which is a CPU and Data intensive workﬂow used to generate and analyze gravitational waveforms from data

collected during the coalescing of compact binary systems. The performance of the proposed method is then

compared to that of the multi-objective Particle Swarm Optimization (PSO) proven to be effective for scientiﬁc

workﬂows scheduling. The experimental results show that our algorithm BiO-CSO performs better than the

multi-objective PSO since it provides more and better ﬁnal scheduling solutions.

1 INTRODUCTION

Scientiﬁc applications handling big volumes of data

usually consist of a huge number of computational

tasks with data dependencies between them, which

are often referred to as scientiﬁc workﬂows. They

are frequently used to model complex phenomena, to

analyze instrumental data, to tie together information

from distributed sources, and to pursue other scien-

tiﬁc endeavors (Deelman, 2010). Data intensive sci-

entiﬁc workﬂows need from few hours to many days

to be processed in a local execution environment. In

this paper, we consider the scientiﬁc application of

Laser Interferometer Gravitational Wave Observatory

(LIGO)

1

which is a network of gravitational-wave

detectors, with observatories in Livingston, LA and

Hanford, WA. The observatories’ mission is to detect

and measure gravitational waves predicted by general

relativity Einstein’s theory of gravity in which grav-

1

https://www.ligo.caltech.edu/

ity is described as due to the curvature of the fabric

of time and space. The data collected by the dif-

ferent LIGO observatories is about Tera-octets per

hour. To analyze this huge amount of data, scien-

tists modeled their workloads using scientiﬁc work-

ﬂows. The LIGO workﬂow may contain up to 100

sub-workﬂows and 200,000 jobs (Juve et al., 2013)

which requires many hours to be performed in a Grid

computing environment. To obtain simulation results

within acceptable timeframes, large scientiﬁc work-

loads are executed on distributed computing infras-

tructures such as grids and clouds (Brewer, 2000).

The Cloud Computing environment offers on-demand

access to a pool of computing, storage and net-

work resources through Internet that allows hosting

and executing complex applications such as scien-

tiﬁc workﬂows. As a subscription based computing

service, it provides a convenient platform for scien-

tiﬁc workﬂows due to features like application scal-

ability, heterogeneous resources, dynamic resource

provisioning, and pay-as-you-go cost model (Poola

Bousselmi, K., Ben Hamida, S. and Rukoz, M.

Bi-Objective CSO for Big Data Scientiﬁc Workﬂows Scheduling in the Cloud: Case of LIGO Workﬂow.

DOI: 10.5220/0009827106150624

In Proceedings of the 15th International Conference on Software Technologies (ICSOFT 2020), pages 615-624

ISBN: 978-989-758-443-5

Copyright

c

2020 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

615

et al., 2014). Scheduling data intensive scientiﬁc

workﬂows on the Cloud environment is usually an

NP-complete problem (Guo et al., 2012). It aims

to ﬁnd convenient resources for executing the work-

ﬂow tasks and storing its data in order to meet the

user’s quality of service objectives such as minimiz-

ing the makespan or the cost of the workﬂow. In

the Cloud Computing environment, the overall ob-

jective of the task scheduling strategy is to guar-

antee the service-level agreements (SLAs) of allo-

cated resources to make cost-efﬁcient decisions for

workload placement (Bousselmi et al., 2016b). In

real-life situations, there are three main QoS con-

straints that need to be considered for efﬁcient uti-

lization of resources: (1) minimizing the execution

cost of resources, (2) minimizing the execution time

of workloads, (3) reducing the energy consumption

and at the same time meeting the cloud workload

deadline (Singh and Chana, 2015). In our context,

we focus on the objectives of reducing both the cost

and the makespan of the workﬂow since they are the

most important quality metrics for the workﬂow’s end

users (Bousselmi et al., 2016b). These metrics are

conﬂicted because high-performance Cloud resources

with high storage capacities are often more expansive

than others.

The scheduling problem is more challenging for

data intensive scientiﬁc workﬂows as they produce a

huge amount of data that should be communicated to

other tasks in order to perform their execution. Since

data storage and data transfer through the Internet is

paying, the cost and time needed for the execution of

the overall workﬂow can be consequent. For instance,

both Amazon S3 and EBS use ﬁxed monthly charges

for the storage of data, and variable usage charges

for accessing the data. The ﬁxed charges are 0.15$

per GB-month for S3, and 0.10$ per GB-month for

EBS (Berriman et al., 2010). The variable charges are

0.01$ per 1000 PUT operations and 0.01$ per 10000

GET operations for S3, and 0.10$ per million I/O op-

erations for EBS (Berriman et al., 2010).

This work is an extension of an earlier conference

publication (Bousselmi et al., 2016b). In (Bousselmi

et al., 2016b), the scheduling problem of scientiﬁc

workﬂows was formulated as a mono-objective op-

timization problem and we used typical workﬂows

for the experiments. In this paper, our objective is

to propose a scheduling solution for data intensive

scientiﬁc workﬂows that considers the time and cost

of data storage and transfer. The proposed exten-

sion consists on formulating our scheduling problem

as a multi-objective optimization problem dedicated

to Big Data scientiﬁc workﬂows. Indeed, we model

1

https://pegasus.isi.edu/application-showcase/ligo/

our scheduling problem as a bi-objective optimiza-

tion problem that aims to reduce the overall cost and

the response time of the workﬂow. As an optimiza-

tion technique, we propose the extension of the cat

swarm optimization algorithm (CSO) algorithm to al-

low the resolution of a multi-objective optimization

problem. We chose the CSO algorithm since it has

proven its efﬁciency compared to other evolutionary

algorithms such as the particle swarm optimization al-

gorithm especially in solving common NP-hard opti-

mization problems (Chu et al., 2007) Thus, the prin-

cipal contribution of our paper is the proposition of a

new multi-objective optimization solution that allows

the scheduling of Big Data scientiﬁc workﬂows based

on the quality metrics of time and cost.

The remainder of this paper is organised as fol-

lows. In section 2, we summarize some works that

were interested in the scheduling of scientiﬁc work-

ﬂows in the cloud computing environment. In sec-

tion 3, we present our problem formulation as well

as the formulation of the scientiﬁc workﬂows and the

execution environment. Section 4 describes the CSO

algorithm and its components. In section 5, we detail

the proposed algorithm for the scheduling of scientiﬁc

workﬂows in the cloud. We describe our application

case, namely, the LIGO workﬂow in section 6. Then,

we report and discuss our experimental results before

concluding in the ﬁnal section.

2 RELATED WORKS

With the development of cloud technology and ex-

tensive deployment of cloud platform, the problem

of workﬂow scheduling in the cloud becomes an im-

portant research topic. The challenges of the prob-

lem lie in: NP-hard nature of the task-resource map-

ping with diverse Quality of Service (QoS) require-

ments (Wu et al., 2015). Several approaches were

proposed recently to resolve this problem with con-

sidering multiple QoS metrics such as the makespan,

cost and energy consumption, others focused only on

a single-objective problem by aggregating all the ob-

jectives in one analytical function such as in (Bous-

selmi et al., 2016b). (Durillo and Prodan, 2014) pro-

posed Multi-objective Heterogeneous Earliest Finish

Time (MOHEFT) which gives better trade-off solu-

tions for makespan and ﬁnancial cost when the results

are compared with SPEA2* Algorithms (Yu et al.,

2007). A multi-objective heuristic algorithm, Min-

min based time and cost trade-off (MTCT), was pro-

posed by Xu et al. in (Xu et al., 2016). As scheduling

approach, the Balanced and ﬁle Reuse-Replication

Scheduling (BaRRS) algorithm, was proposed to se-

ICSOFT 2020 - 15th International Conference on Software Technologies

616

lect the optimal solution based on makespan and cost

(Casas et al., 2017).

To reduce the cost and time of data transfer,

some scheduling strategies prefer employing the same

cloud provider for the data storage and VMs, since the

data transfer cost between inner services is free (Ber-

riman et al., 2010). However, executing too many

workloads on a single cloud will cause workloads

to interfere with each other and result in degraded

and unpredictable performance which, in turn, dis-

courages the users (Dillon et al., 2010). Only few

works were interested in the problem of reducing the

time and cost of data transfer while executing data

intensive workﬂows in multi-cloud environment. In

(Yao et al., 2017), the authors designed a multi-swarm

multi-objective optimization algorithm (MSMOOA)

for workﬂow scheduling in multi-cloud that optimizes

three objectives (makespan, cost and energy con-

sumption) with taking into consideration the structure

characteristic of cloud data center but they consid-

ered only computational-intensive workﬂows. Wang-

som and al. proposed in (Wangsom et al., 2019) a

multi-objective peer-to-peer clustering algorithm for

scheduling workﬂows where cost, makespan and data

movement are considered, but this work lacks details

in the reported results.

3 EXECUTION ENVIRONMENT

AND PROBLEM

FORMULATION

In this section, we explain how we formulate our

multi-objective combinatorial optimization problem

for scheduling big data scientiﬁc workﬂows in the

cloud. We start by deﬁning the model used for the

presentation of big data scientiﬁc workﬂows. Then,

a model of our execution environment, namely the

cloud computing, is proposed. Finally, we present our

problem formulation.

3.1 Big Data Scientiﬁc Workﬂows

Modeling

Scientiﬁc workﬂows are usually modeled using Di-

rected Acyclic Graphs (DAG). Since our objective is

to handle big data workﬂows, we model a workﬂow

as a DAG graph where tasks are represented by nodes

and dependencies between them are represented by

links. Dependencies typically represent the dataﬂow

in the workﬂow, where the output data produced by

one task is used as input of one or multiple other tasks.

A single workﬂow task T

i

can be modeled using the

triplet T

i

= (D, Dt

in

, Dt

out

) with:

• Dt

in

= {Dt

in

j

| j ∈ [0..(i − 1)]} represents the

amounts of data to be processed by T

i

with Dt

in

j

is the dataset from the task j.

• D = {D

k

|k = 1. . . r} is the vector of demands of

the T

i

in terms of resource capacities with D

k

is the

minimum capacity required of the resource type k

to process the task and r the total number of re-

source types.

• Dt

out

= {Dt

out

|l = 1, . . . ,s} represents the number

of datasets generated after the completion of T

i

.

3.2 Modeling the Cloud Computing

Environment

The Cloud Computing environment can be modeled

as a set of datacenters. Each datacenter is composed

of a set of cloud clusters where each cluster is com-

posed of a set of physical machines, network devices

and storage devices. The physical machines, that we

note PM, can be used for both computing and stor-

age. PM may be a dedicated computing machine,

a set of virtual machines or a shared machine. In

this scope, we consider only physical machines that

are composed of a set of virtual machines such as

PM = {V M

i

|i = 1, . . . , m}. The communication link

between two physical machines BW (PM

i

, PM

j

) is de-

ﬁned as the bandwidth demand or the trafﬁc load be-

tween them. In addition, we deﬁne the quality of ser-

vice metrics used to evaluate the performances of a

virtual machine V M

i

by the vector Q

i

= (T

exe

,C, A, R)

where:

• T

exe

=

∑

k

j=1

Dt

in

j

T

u

is the execution time of a sin-

gle task assessed as the sum of times needed to

process each input data ﬁle Dt

in

j

. T

u

is assessed in

million instructions per second (MIPS).

• The cost of processing a task on a V M

i

denoted

C = T

exe

∗ C

u

, where C

u

represents the cost per

time unit incurred by using V M

i

.

• A is the availability rate of V M

i

.

• R is the reliability value of V M

i

.

In this paper, we will consider only the ﬁrst two qual-

ity metrics, namely the execution time and the cost of

V M

i

.

3.3 Problem Formulation

Dispersion, heterogeneity and uncertainty of cloud re-

sources bring challenges to resource allocation, which

cannot be satisﬁed with traditional resource allocation

Bi-Objective CSO for Big Data Scientiﬁc Workﬂows Scheduling in the Cloud: Case of LIGO Workﬂow

617

policies. Resource scheduling aims to allocate appro-

priate resources at the right time to the right work-

loads, so that applications can utilize the resources

effectively which lead to maximization of scaling ad-

vantages (Singh and Chana, 2015). In this paper, our

objective is to propose a new multi-objective schedul-

ing approach for big data scientiﬁc workﬂows in the

Cloud. The proposed approach considers two QoS

metrics that can represent, according to our under-

standing, the most important QoS metrics for end

users, namely: the makepsan and the cost.

The execution of a big data workﬂow in the cloud

implies executing each task using the allocated virtual

machine with respect to the data-ﬂow dependencies

among different tasks. At the end of a task execution,

generated data is transferred to one or several follow-

ing tasks to perform their execution. These data are

often massive and require additional time and cost to

be transferred to other geographically distributed vir-

tual machines. Some ﬁnal or intermediary generated

data have to be stored also in dedicated storage ser-

vices to be analyzed later (Yuan et al., 2012).

We denote X

k

a realizable workﬂow schedule

where each task is assigned to a speciﬁc VM. Then,

the overall makespan of the schedule X

k

can be as-

sessed by the equation 1 as the sum of the data pro-

cessing and data transfer times between dependent

tasks, such as:

Makespan(X

k

) = T

(data−processing)

(X

k

)

+ T

(data−t rans f er)

(X

k

)

(1)

We argue that the time needed for data storage in

dedicated storage services (SS) can’t be assessed in

the makespan since these data could remain in SS ser-

vices even after the workﬂow execution end for an-

alyze requirements. Likewise, the overall cost of a

schedule solution X

k

is assessed as the sum of the data

processing cost on the allocated VMs, the overall cost

of data transfer between tasks and data storage on SS

services as follows:

Cost(X

k

) = C

(data−processing)

(X

k

)

+ C

(data−t rans f er)

(X

k

)

+ C

(data−st orage)

(X

k

)

(2)

Like in (Bousselmi et al., 2016a), to assess the

global quality metrics values of the workﬂow during

its execution, such as its makespan and its cost, we use

aggregation functions of different considered quality

metrics as shown in the table 1 with N is the total

number of tasks of the workﬂow, T

exe

i

, C

i

, DT

(i, j)

,

d

(i,i+1)

, T

rem

are respectively the data processing time

of a task i on a speciﬁc VM, its execution cost, the

data-ﬂow transmitted between two VMs i and j cor-

responding to two consecutive tasks, the throughput

of network bandwidth between two VMs i and j and

the time during which data is remaining stored on a

speciﬁc SS. In the case of parallel tasks of the work-

ﬂow, the operators of the aggregation function for the

execution time and data transfer time metrics are re-

placed by the maximum operator.

To sum up, our scheduling objective can be formu-

lated as a multi-objective optimization problem that

aims to reduce both the makespan and the cost of the

workﬂow using the following linear representation:

min makespan(X

k

)

min cost(X

k

)

(3)

The objective functions of our optimization problem

are reported by the formulas 1 and 2. In the next ses-

sion, we will introduce the CSO algorithm used for

the resolution of our formulated problem.

4 CAT SWARM OPTIMIZATION

(CSO) ALGORITHM

Biologically inspired computing has been given im-

portance for its immense parallelism and simplicity in

computation. In recent times, quite many biologically

motivated algorithms have been invented and are be-

ing used for handling many complex problems of the

real world (Das et al., 2008). Using a biologically in-

spired algorithm for the scheduling of scientiﬁc work-

ﬂows in the cloud became an attractive research track

due to complexity of the problem of ﬁnding the best

match of task-resource pair based on the user’s QoS

requirements.

Recently, a family of biologically inspired algo-

rithms known as Swarm Intelligence has attracted the

attention of researchers working on bio-informatics

related problems all over the world (Das et al., 2008).

Algorithms belonging to this ﬁeld are motivated by

the collective behavior of a group of social insects

(like bees, termites and wasps). For instance, the Cat

Swarm Optimization algorithm (CSO) aims to study

the behavior of cats to resolve complex combinatory

optimization problems. According to (Bahrami et al.,

2018), cats have high alertness and curiosity about

their surroundings and moving objects in their envi-

ronment despite spending most of their time in rest-

ing. This behavior helps cats in ﬁnding preys and

hunting them down. Compared to the time dedicated

to their resting, they spend too little time on chasing

preys to conserve their energy. The ﬁrst optimization

algorithm based of the study of cat swarms was pro-

posed in 2007 by (Chu et al., 2007). The steps of

the CSO algorithms are as follows. First, the CSO

algorithm creates an initial population of N cats and

ICSOFT 2020 - 15th International Conference on Software Technologies

618

Table 1: Aggregation functions for QoS metrics.

Quality Data Data Data Data Data

metric processing Transfer processing storage transfer

time time cost cost cost

Sequential

∑

N

i=1

T

exe

i

∑

N

i=1

DT

i,i+1

d

i,i+1

∑

N

i=1

C

i

∑

N

i=1

T

rem

i

∗C

i

n

∑

i=1

DT

i,i+1

d

i,i+1

Tasks

Parallel max

1≤i≤n

T

exe

i

max

1≤i≤n

DT

i,i+1

d

i,i+1

∑

N

i=1

C

i

max

1≤i≤n

T

rem

i

∗C

i

max

1≤i≤n

DT

i,i+1

∗C

i j

d

i,i+1

Tasks

deﬁnes their characteristics. Each cat represents a so-

lution of the formulated optimization problem and its

characteristics represent the parameters of this solu-

tion, namely its ﬁtness value, its mode (in “seeking”

or “tracing” mode), its position and its velocity. Sec-

ond, cats are organized randomly into two groups, one

group of cats is in ”Seeking mode” and the other in

”Tracing mode”. All cats are evaluated according to

the ﬁtness function and the cat with the best ﬁtness

value is retained. After that, if the cat is in ”Seek-

ing mode”, then the process of ”Seeking mode” is

applied, otherwise the process of ”Tracing mode” is

applied. Finally, if the stop condition has not been

reached yet, then the algorithm restarts from the sec-

ond step. The optimum solution is represented by the

cat having the best ﬁtness value at the end of the algo-

rithm. The details of the ”Seeking mode” and ”Trac-

ing mode” procedures could be found in (Chu et al.,

2007). Four essential parameters are to be deﬁned and

applied in the CSO algorithm, namely:

• SMP (Seeking Memory Pool): the search memory

pool that deﬁnes the size of the seeking memory

of each cat.

• SRD (Seeking Range of the selected Dimension):

the mutative ratio for the selected dimensions.

• CDC (Counts of Dimension to Change): the num-

ber of elements of the dimension to be changed.

• SPC (Self Position Consideration): a Boolean

number indicating whether the current position

is to be considered as a candidate position to be

moved on or not.

5 PROPOSED BI-OBJECTIVE

CSO (BiO-CSO) ALGORITHM

In this paper, we formulated our scheduling problem

as a multi-objective optimization problem that aims

to reduce both the overall makespan and cost of the

workﬂow execution in the cloud. To resolve this lat-

ter, we applied the discrete version of the CSO algo-

rithm proposed in (Bouzidi and Rifﬁ, 2013) since our

Table 2: Example of scheduling solution representation.

Workﬂow’s task Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 7 Task 8

VM identiﬁer 4 1 2 3 4 3 2 5

objective function uses discrete integers to model a

task-resource affectation.

We start by representing the scheduling solution

for the CSO algorithm. Then, we will detail the steps

of the proposed BiO-CSO algorithms and its tracing

and seeking mode procedures.

5.1 Mapping between Problem

Formulation and CSO

In our context, a scheduling solution is a set of N af-

fectations < task,V M > for all the tasks of the work-

ﬂow to the available cloud VMs that can perform

these tasks. A realizable scheduling solution should

respect the order of execution of the tasks so that two

parallel tasks can’t be assigned to the same VM. We

suppose that the number of available VMs is less than

or equal to the number of tasks. The table 2 shows an

example of realizable scheduling solution of a work-

ﬂow sample with 8 tasks. We suppose that for each

task, there are 5 candidates VMs that can perform it.

We suppose that the tasks 1,2, 3 and 4 can be per-

formed in parallel and then the tasks 5,6,7,8 are in

sequence.

As our scheduling problem is multi-objective, we

consider that the best scheduling solution is the one

having the best value of makespan and cost at the

same time than all the other solutions. All non-

dominated scheduling solutions are retained and con-

sidered as best solutions. To ﬁnd these solutions we

should ﬁrst map our problem formulation with the for-

malism of the CSO algorithm. In fact, in the CSO

algorithm, each cat represents a solution for the con-

sidered optimization problem. The overall objective

of the CSO is to ﬁnd the cat having the optimal solu-

tion from all the population of cats, i.e. the cat with

best ﬁtness value. Then, the mapping between our

problem formulation and the CSO algorithm can be

summarized in the following table 3.

Bi-Objective CSO for Big Data Scientiﬁc Workﬂows Scheduling in the Cloud: Case of LIGO Workﬂow

619

Table 3: Mapping between our problem formulation and

CSO.

Optimization problem formulation CSO algorithm

A scheduling solution A cat

The number of workﬂow tasks A cat’s position dimension

< taks,V M > affectations A cat’s position vector

The objective function The ﬁtness value of a cat

5.2 BiO-CSO Algorithm Details

The CSO algorithm was designed for mono-objective

optimization problems (Chu et al., 2007; Bouzidi and

Rifﬁ, 2013). We extended this algorithm to consider

multi-objective optimization problems. First, we de-

ﬁne a vector of best solutions to store a ﬁxed number

of non-dominated solutions along the evolution pro-

cess. Then, for each cat, we replace its ﬁtness value

by a structure that stores the two values of our ob-

jective functions (the makespan and the cost of the

scheduling solution described by this cat). Also, we

implement additional functions to compare the new

generated solutions in order to update the vector of

best solutions. The pseudo code of the proposed al-

gorithm is given by Algorithm 1 detailed hereafter.

Algorithm 1: BiO-CSO.

Input: MaxLoopIteration :maximum number of

iterations

Variables:

CATS: population of cats (scheduling solutions)

BestParetoSet: vector of stored non-dominated so-

lutions

1: Initialize the cats’ population CATS

2: Initiate the position and velocity for each cat

3: Pick randomly a mode for each cat (CatModes)

4: for all iteration I < MaxLoopIteration do

5: for all cat in CATS do

6: ﬁtnessValue = EvaluateFitness(cat) // evalu-

ate the makespan and cost

7: Apply BiO-Replacement(ﬁtnessValue, Best-

ParetoSet) // Apply cat mode procedure

8: if catModes (cat) == SeekingMode then

9: Apply seekingModeProce-

dure(cat,CATS)

10: else

11: Apply TracingModeProce-

dure(cat,CATS)

12: end if

13: Re-pick randomly a mode for each cat Cat-

Modes

14: end for

15: end for

BiO-CSO works as follows. First, the algorithm

creates N cats and initialize their properties (step 1

and 2), each cat has an M dimensional position vector

that represent a realizable scheduling solution with M

is the total number of tasks per workﬂow. The val-

ues of the position vector represent the identiﬁers of

VMs affected respectively to each task of the work-

ﬂow. The ﬁtness value of a cat is the Bi-value vec-

tor that represents the makespan and the cost of the

represented scheduling solution. Second, BiO-CSO

randomly chooses a sequence of cats using the MR

parameter value, set them in the ‘Tracing mode’ and

set the rest of cats in ‘Seeking mode’ (step 3). Third,

it evaluates the ﬁtness values (step 6) of each cat (the

makespan and cost of the scheduling solution repre-

sented by a cat) and keep the cat with the best ﬁt-

ness values. The retained solutions may be several

as our problem is multi-objective. Then, we keep all

the non-dominated solution in a vector representing

the best solutions (step 7). The update of this vector

is described below (algo 2). If the cat is in ’Seeking

mode’, then the ’Seeking mode’ procedure is applied

as described in the algorithm 3, otherwise, apply the

procedure of ‘Tracing mode’ as described in the al-

gorithm 4. Forth, the algorithm re-selects a sequence

of cats, sets them to tracing mode and puts the rest of

the cats in ’Seeking mode’. Finally, if the termination

condition is reached (the maximum number of itera-

tions is reached or a desirable best solution is found),

then terminate the algorithm. If not, return to step 4.

Bi-Objective Replacement: As for most of

the multi-objective optimization algorithms (Reyes-

Sierra et al., 2006), the proposed BiO-CSO is Pareto-

based. Its purpose is to generate a Pareto solution set

with optimized maksespan and cost. The algorithm

maintains a list of best Pareto solutions that consists

of a subset of non-dominated designs found so far.

Since the number of best solutions can quickly grow

very large, the size of the Pareto list is limited. Each

cat is inserted into the Pareto set only if it is non dom-

inated as described in algorithm 2.

Tracing Mode and Seeking Mode Procedures: The

Seeking mode corresponds to the resting state of the

cats while being in continuous research of a better po-

sition. In this mode, the BiO-CSO algorithm performs

multiple copies of the cat‘s position (the scheduling

solution), then, modiﬁes this solution using the pa-

rameter CDC for all the position copies (CDC deﬁnes

the number of values of the position to be changed).

After that, it calculates the distance between the new

position’s ﬁtness value and the old one. According to

the distance value, the algorithm chooses the new po-

sition to assign to the cat according to its probability

value. In tracing mode, cat tries to trace targets by

ICSOFT 2020 - 15th International Conference on Software Technologies

620

updating their velocity and positions. The details of

these procedures are given in the algorithms 3 and 4.

Algorithm 2: BiO-Replacement.

Inputs:

BestParetoSet: vector of stored non-dominated so-

lutions

X: a cat (a solution)

1: if X is dominated by an other cat Y in BestPare-

toSet then

2: Replace X parameters by Y parameters

3: else

4: if X is non-dominated solution and BestPare-

toSet is not full then

5: Add X to BestParetoSet

6: else

7: Find a solution from BestParetoSet to be re-

placed by the new one X

8: end if

9: end if

Algorithm 3: Seeking Mode Procedure.

Inputs:

cat: a solution

SMP: number of cat

copies

1: Perform SMP cat position copies in Position-

sCopies

2: for all position in PositionCopies do

3: modify cat position

4: end for

5: for all cat corresponding to a position in Posi-

tionCopies do

6: Compute cat ﬁtness

7: Calculate the distance of the new solution to

previous one

8: Move cat to the new picked position with a

given probability

9: end for

Algorithm 4: Tracing Mode Procedure.

Inputs: CATS: a set of solu-

tions

1: for all cat in CATS do

2: Update cat velocity

3: Update cat position

4: end for

6 BIG DATA SCIENTIFIC

WORKFLOWS: CASE OF LIGO

The LIGO concept built upon early work by many sci-

entists to test a component of Albert Einstein’s theory

of relativity, the existence of gravitational waves [10].

LIGO operates two gravitational wave observatories

in the USA: the LIGO Livingston Observatory in Liv-

ingston, Louisiana, and the LIGO Hanford Observa-

tory, Washington. These sites are separated by 3,002

kilometers (1,865 miles) straight line distance through

the earth, but 3,030 kilometers (1,883 miles) over the

surface [10]. Other similar observatories were built

later in other countries in the world. The workﬂow of

LIGO Inspiral Analysis is one of the LIGO project de-

ﬁned workﬂows using scientiﬁc workﬂows to detect

and analyze gravitational waves produced by various

events in the universe. The LIGO Inspiral Analysis

workﬂow [22] is a CPU and Data intensive workﬂow

used to analyze the data obtained from the coalesc-

ing of compact binary systems such as binary neutron

stars and black holes. The time-frequency data from

each of the three LIGO detectors is split into smaller

blocks for analysis. For each block, the workﬂow

generates a subset of waveforms belonging to the pa-

rameter space and computes the matched ﬁlter output.

If a true inspiral has been detected, a trigger is gener-

ated that can be checked with triggers for the other de-

tectors. A simpliﬁed representation of the workﬂow is

shown in ﬁgure 1 with sub-workﬂows outlined below.

This workﬂow may contain up to 100 sub-workﬂows

and 200,000 jobs (Juve et al., 2013). The execution

of a sample instance of the LIGO Inspiral workﬂow

with 3981 tasks on a local machine runs for 1 day and

15 hours and uses more than 1 Go of input data and

more than 2 Go of output data [9].

Figure 1: Structure of the LIGO Inspiral Workﬂow.

Bi-Objective CSO for Big Data Scientiﬁc Workﬂows Scheduling in the Cloud: Case of LIGO Workﬂow

621

In order to establish a conﬁdent detection or mea-

surement, a large amount of auxiliary data will be

acquired (including data from seismometers, micro-

phones, etc.) and analyzed (for example, to eliminate

noise) along with the strain signal that measures the

passage of gravitational waves (Deelman et al., 2002).

The raw data collected during experiments is a collec-

tion of continuous time series at various sample rates.

The amount of data that will be acquired and cata-

loged each year is on the order of tens to hundreds

of terabytes (Deelman et al., 2002). These data are

treated in real time by the LIGO workﬂow, so, sev-

eral instances of this workﬂow should run simultane-

ously to achieve the objectives of the project. Con-

sequently, the execution environment should provide

efﬁcient tools and enough resources for the execution

and the data storage of the LIGO workﬂow. Therefore

we have chosen this workﬂow as an application case

for our proposed algorithm BiO-CSO for its excessive

use of computing and storage resources.

7 RESULTS AND DISCUSSION

In this section, we present the execution environment

for our proposed BiO-CSO algorithm and the used

parameters for our algorithm and the PSO algorithm.

Then, we will expose our experimentation results and

discuss the performances of our proposition.

7.1 Execution Environment and

Algorithm Parameters

As a simulation environment, we used the Work-

ﬂowSim (Chen and Deelman, 2012) on which we im-

plemented our proposed algorithm BiO-CSO. Work-

ﬂowSim extends the CloudSim (Calheiros et al.,

2011) simulation toolkit by introducing the support

of workﬂow preparation and execution with an im-

plementation of a stack of workﬂow parser, work-

ﬂow engine and job scheduler. It supports a multi-

layered model of failures and delays occurring in

the various levels of the workﬂow management sys-

tems. CloudSim is an extensible simulation toolkit

that enables modeling and simulation of Cloud com-

puting systems and application provisioning environ-

ments (Calheiros et al., 2011). We tested the LIGO

Inspiral workﬂow with different scales using respec-

tively 30, 100 and 1000 tasks. To compare our algo-

rithm performances, we implemented also the Multi-

objective version of the Particle swarm optimization

(PSO) algorithm. PSO is an optimization algorithm

1

https://pegasus.isi.edu/portfolio/ligo/

which is based on swarm intelligence (Tripathi et al.,

2007). It is inspired by the ﬂocking behavior of birds,

which is very simple to simulate and has been found

to be quite efﬁcient in handling the single objective

optimization problems. The simplicity and efﬁciency

of PSO motivated researchers to apply it to the MOO

problems since 2002 (Calheiros et al., 2011). For the

MOPSO algorithm, we consider c1=c2=2, and the in-

ertia weight ’w’ varying from 0.4 at the beginning to

0.9 at the end of execution of the algorithm. The table

4 illustrates the general parameters of the CSO algo-

rithm. The initial population size of cats is set to 32.

Table 4: Generic parameters of CSO algorithm.

Parameter Value

SMP 5

SRD 20

CDC 80%

MR 10%

c1 2.05

r1 [0,1]

7.2 Experimentation Results

The experiments were repeated 20 times by varying

the value of MP parameter of CSO for each workﬂow

instance. Average values of the output metrics are re-

ported in this section. This manipulation allowed to

generate more values of best solutions for our consid-

ered objective functions. Thus, all non-dominated so-

lutions obtained over each run are recorded in a gen-

eral Pareto set. Figure 2 shows the obtained Pareto set

for LIGO workﬂow with 30, 100 and 1000 tasks re-

spectively for the proposed BiO-CSO algorithm and

the MOPSO one. In the ﬁgure 2, it is clear that the

makespan and deployment cost functions are strongly

correlated with a negative slope. With respect to the

Pareto front, BiO-CSO demonstrates the ability to

generate better diversity of solutions with a uniform

distribution than MOPSO. In fact, the CSO algorithm

was designed to avoid the problem of premature con-

vergence thanks to the weighted position update strat-

egy.

It can also be seen that the Pareto fronts obtained

using BiO-CSO are superior for all the different scales

of LIGO workﬂow. In ﬁgure 2, the non-dominated so-

lutions generated by the BiO-CSO algorithm are bet-

ter spread over the solutions space than those of the

MOPSO algorithm which offers more choices to the

workﬂow end user to make cost-efﬁcient decisions.

In ﬁgure 3, we can see that the BiO-CSO al-

gorithm achieves better average values for both

Makespan and Cost for all the LIGO workﬂow in-

stances. For the Makespan, the BiO-CSO allows

ICSOFT 2020 - 15th International Conference on Software Technologies

622

Figure 2: Non-Dominated solution sets generated by BiO-

CSO and MOPSO for LIGO workﬂow.

to gain an improvement of 16, 03%, 47, 21%, and

11, 23% compared to the MOPSO algorithm for the

different scaled of LIGO workﬂow. The cost of BiO-

CSO was better by 18, 41%, 27, 75%, and 34, 82%

over the MOPSO algorithm for the LIGO workﬂow

with 30, 100 and 1000 tasks respectively.

Table 5: Average values for computation and data transfer

time.

Data transfer time Computation time

Workﬂow BiO-CSO MOPSO BiO-CSO MOPSO

LIGO 30 261,91 295,05 96,87 109,13

LIGO 100 274,39 382,42 81,96 114,23

LIGO 1000 9255,56 11022,28 1895,72 2257,58

The table 5 illustrates the average values for com-

putation and data transfer time obtained by BiO-CSO

and MOPSO for the LIGO workﬂow with 30, 100 and

1000 tasks. It shows that our algorithm BiO-CSO

achieves better average values than MOPSO for all

the LIGO workﬂow scales with a gain ranging from

11% to 28%. For example, for the LIGO 1000, the

BiO-CSO data transfer time is 16% better the the

Figure 3: Cost and makespan average values obtained by

BiO-CSO and MOPSO for LIGO workﬂow.

MOPSO one. It can be seen also that the data transfer

time represents roughly 75% of the workﬂow execu-

tion time due to the huge amount of data treated.

8 CONCLUSION

In this paper, we addressed the problem of Big Data

Scientiﬁc workﬂow scheduling in the Cloud. We

started by formulating the problem as a bi-objective

optimization problem minimizing both the makespan

and the cost of the workﬂow. We then proposed the

BiO-CSO algorithm which is an extension of the CSO

one to deal with bi-objective optimization. The ob-

jectives conﬂict is resolved using the Pareto analysis.

The experiments were conducted on the LIGO work-

ﬂow and discussed in comparison to the current al-

gorithm frequently used for workﬂow scheduling e.g.

PSO. From the experimental results we conclude that

BiO-CSO is able to produce better schedules based

on multiple objectives. Future work will investigate

the extension of BiO-CSO to deal with further objec-

tives such as minimizing energy consumption and its

application on different types of Big Data scientiﬁc

workﬂows.

Bi-Objective CSO for Big Data Scientiﬁc Workﬂows Scheduling in the Cloud: Case of LIGO Workﬂow

623

REFERENCES

Bahrami, M., Bozorg-Haddad, O., and Chu, X. (2018). Cat

swarm optimization (cso) algorithm. In Advanced Op-

timization by Nature-Inspired Algorithms, pages 9–

18. Springer.

Berriman, G. B., Juve, G., Deelman, E., Regelson, M., and

Plavchan, P. (2010). The application of cloud comput-

ing to astronomy: A study of cost and performance.

In 2010 Sixth IEEE International Conference on e-

Science Workshops, pages 1–7. IEEE.

Bousselmi, K., Brahmi, Z., and Gammoudi, M. M. (2016a).

Energy efﬁcient partitioning and scheduling approach

for scientiﬁc workﬂows in the cloud. In 2016

IEEE International Conference on Services Comput-

ing (SCC), pages 146–154. IEEE.

Bousselmi, K., Brahmi, Z., and Gammoudi, M. M. (2016b).

Qos-aware scheduling of workﬂows in cloud comput-

ing environments. In 2016 IEEE 30th International

Conference on Advanced Information Networking and

Applications (AINA), pages 737–745. IEEE.

Bouzidi, A. and Rifﬁ, M. E. (2013). Discrete cat swarm op-

timization to resolve the traveling salesman problem.

International Journal, 3(9).

Brewer, E. A. (2000). Towards robust distributed systems.

In PODC, volume 7.

Calheiros, R. N., Ranjan, R., Beloglazov, A., De Rose,

C. A., and Buyya, R. (2011). Cloudsim: a toolkit for

modeling and simulation of cloud computing environ-

ments and evaluation of resource provisioning algo-

rithms. Software: Practice and experience, 41(1):23–

50.

Casas, I., Taheri, J., Ranjan, R., Wang, L., and Zomaya,

A. Y. (2017). A balanced scheduler with data reuse

and replication for scientiﬁc workﬂows in cloud com-

puting systems. Future Generation Computer Sys-

tems, 74:168–178.

Chen, W. and Deelman, E. (2012). Workﬂowsim: A toolkit

for simulating scientiﬁc workﬂows in distributed envi-

ronments. In 2012 IEEE 8th International Conference

on E-Science, pages 1–8. IEEE.

Chu, S.-C., Tsai, P.-W., et al. (2007). Computational in-

telligence based on the behavior of cats. Interna-

tional Journal of Innovative Computing, Information

and Control, 3(1):163–173.

Das, S., Abraham, A., and Konar, A. (2008). Swarm in-

telligence algorithms in bioinformatics. In Computa-

tional Intelligence in Bioinformatics, pages 113–147.

Springer.

Deelman, E. (2010). Grids and clouds: Making workﬂow

applications work in heterogeneous distributed envi-

ronments. The International Journal of High Perfor-

mance Computing Applications, 24(3):284–298.

Deelman, E., Kesselman, C., Mehta, G., Meshkat, L., Pearl-

man, L., Blackburn, K., Ehrens, P., Lazzarini, A.,

Williams, R., and Koranda, S. (2002). Griphyn and

ligo, building a virtual data grid for gravitational wave

scientists. In Proceedings 11th IEEE International

Symposium on High Performance Distributed Com-

puting, pages 225–234. IEEE.

Dillon, T., Wu, C., and Chang, E. (2010). Cloud comput-

ing: issues and challenges. In 2010 24th IEEE in-

ternational conference on advanced information net-

working and applications, pages 27–33. Ieee.

Durillo, J. J. and Prodan, R. (2014). Multi-objective work-

ﬂow scheduling in amazon ec2. Cluster computing,

17(2):169–189.

Guo, L., Zhao, S., Shen, S., and Jiang, C. (2012). Task

scheduling optimization in cloud computing based on

heuristic algorithm. Journal of networks, 7(3):547.

Juve, G., Chervenak, A., Deelman, E., Bharathi, S., Mehta,

G., and Vahi, K. (2013). Characterizing and proﬁling

scientiﬁc workﬂows. Future Generation Computer

Systems, 29(3):682–692.

Poola, D., Garg, S. K., Buyya, R., Yang, Y., and Ra-

mamohanarao, K. (2014). Robust scheduling of scien-

tiﬁc workﬂows with deadline and budget constraints

in clouds. In 2014 IEEE 28th international confer-

ence on advanced information networking and appli-

cations, pages 858–865. IEEE.

Reyes-Sierra, M., Coello, C. C., et al. (2006). Multi-

objective particle swarm optimizers: A survey of the

state-of-the-art. International journal of computa-

tional intelligence research, 2(3):287–308.

Singh, S. and Chana, I. (2015). Qrsf: Qos-aware resource

scheduling framework in cloud computing. The Jour-

nal of Supercomputing, 71(1):241–292.

Tripathi, P. K., Bandyopadhyay, S., and Pal, S. K. (2007).

Multi-objective particle swarm optimization with time

variant inertia and acceleration coefﬁcients. Informa-

tion sciences, 177(22):5033–5049.

Wangsom, P., Lavangnananda, K., and Bouvry, P. (2019).

Multi-objective scheduling for scientiﬁc workﬂows on

cloud with peer-to-peer clustering. In 2019 11th Inter-

national Conference on Knowledge and Smart Tech-

nology (KST), pages 175–180. IEEE.

Wu, F., Wu, Q., and Tan, Y. (2015). Workﬂow scheduling

in cloud: a survey. The Journal of Supercomputing,

71(9):3373–3418.

Xu, H., Yang, B., Qi, W., and Ahene, E. (2016). A multi-

objective optimization approach to workﬂow schedul-

ing in clouds considering fault recovery. KSII Trans-

actions on Internet & Information Systems, 10(3).

Yao, G.-s., Ding, Y.-s., and Hao, K.-r. (2017). Multi-

objective workﬂow scheduling in cloud system based

on cooperative multi-swarm optimization algorithm.

Journal of Central South University, 24(5):1050–

1062.

Yu, J., Kirley, M., and Buyya, R. (2007). Multi-objective

planning for workﬂow execution on grids. In Proceed-

ings of the 8th IEEE/ACM International conference on

Grid Computing, pages 10–17. IEEE Computer Soci-

ety.

Yuan, D., Yang, Y., Liu, X., Zhang, G., and Chen, J.

(2012). A data dependency based strategy for inter-

mediate data storage in scientiﬁc cloud workﬂow sys-

tems. Concurrency and Computation: Practice and

Experience, 24(9):956–976.

ICSOFT 2020 - 15th International Conference on Software Technologies

624