Unifying Data and Replica Placement for Data-intensive Services in

Geographically Distributed Clouds

Ankita Atrey

, Gregory Van Seghbroeck

, Higinio Mora

, Filip De Turck

and Bruno Volckaert

IDLAB-imec, Technologie Park, Ghent University, Ghent, Belgium

University of Alicante, Alicante, Spain

Keywords:

Data Placement, Replica Placement, Geographically Distributed Clouds, Location-Based Services, Online

Social Networks, Scalability, Overlapping Clustering.

Abstract:

The increased reliance of data management applications on cloud computing technologies has rendered re-

search in identifying solutions to the data placement problem to be of paramount importance. The objective

of the classical data placement problem is to optimally partition, while also allowing for replication, the set of

data-items into distributed data centers to minimize the overall network communication cost. Despite signiﬁ-

cant advancement in data placement research, replica placement has seldom been studied in unison with data

placement. More speciﬁcally, most of the existing solutions employ a two-phase approach: 1) data placement,

followed by 2) replication. Replication should however be seen as an integral part of data placement, and

should be studied as a joint optimization problem with the latter. In this paper, we propose a uniﬁed paradigm

of data placement, called CPR, which c

ombines data placement and replication of data-intensive services into

geographically distributed clouds as a joint optimization problem. Underneath CPR, lies an overlapping cor-

relation clustering algorithm capable of assigning a data-item to multiple data centers, thereby enabling us to

jointly solve data placement and replication. Experiments on a real-world trace-based online social network

dataset show that CPR is effective and scalable. Empirically, it is ≈ 35% better in efﬁcacy on the evaluated

metrics, while being up to 8 times faster in execution time when compared to state-of-the-art techniques.

1 MOTIVATION

With the emergence of Cloud computing, Big Data,

and Internet of Things (IoT), the rate at which data

is being generated is increasing exponentially (ins,

2017; gro, 2018). Although advancements in modern

hardware, cloud computing, and big data technolo-

gies have enabled development of multiple distributed

systems (such as Hadoop (White, 2012) and Apache

Spark (Zaharia et al., 2016)) that have signiﬁcantly

enriched the ﬁeld of scalable data management, ef-

fective strategies for data partitioning and placement

remain cardinal to the performance of such systems.

Consistent with the requirement of data-intensive ser-

vices to access multiple datasets within each trans-

action (Golab et al., 2014; Yu and Pan, 2015; Zhao

et al., 2016a; Zhao et al., 2016b; Shabeera et al.,

2017), specialized solutions for data placement have

recently been proposed based on hypergraph parti-

tioning (Yu and Pan, 2017) and spectral clustering on

hypergraphs (Atrey et al., 2018). This is because for

data-intensive services, traditional solutions (as em-

ployed by Hadoop or Spark) based on uniform par-

titioning of data-items using hashing may result in a

huge volume of data migrations (Golab et al., 2014;

Atrey et al., 2018), thereby leading to network con-

gestion and eventually reduced system throughput.

Online social networks (OSNs) are one of the

most prevalent instances of data-intensive services

in the real-world (Yu and Pan, 2017). Consider a

location-based OSN service as presented in Fig. 1. A

sample social network is represented using a graph

where each vertex corresponds to a user and undi-

rected edges between two vertices represent friend-

ship. In this network, users {v

} are friends

of the user v

. Similarly {v

} are friends of v

The list of all the friends of every user is also por-

trayed in a table in Fig. 1. There exists a notion of

a data-item corresponding to each user of the social

network, which represents the most recent snapshot

(e.g. proﬁle picture, videos, textual posts etc.) of

her proﬁle. As shown in Fig. 1, the data-item cor-

responding to the user v

is denoted as d(v

), that

of user v

is denoted as d(v

) and so on. Addition-

Atrey, A., Van Seghbroeck, G., Mora, H., De Turck, F. and Volckaert, B.

Unifying Data and Replica Placement for Data-intensive Services in Geographically Distributed Clouds.

DOI: 10.5220/0007613400250036

In Proceedings of the 9th International Conference on Cloud Computing and Services Science (CLOSER 2019), pages 25-36

ISBN: 978-989-758-365-0

User

List

of Friends

, V

Virginia

Tokyo

Frankfurt

Sydney

𝑑(𝑣

)

𝑑(𝑣

)

…

d(V

)

d(V

)

d(V

)

d(V

)

d(V

)

d(V

)

Virginia

Tokyo

Frankfurt

Sydney

d(V

)

(1) Data

Placement

d(V

)

d(V

)

d(V

)

d(V

)

d(V

)

d(V

)

Virginia

Tokyo

Frankfurt

Sydney

d(V

)

d(V

)

(1 + 2) CPR

(2) Replication

d(V

)

d(V

)

d(V

)

d(V

)

d(V

)

d(V

)

Set of data-items

Figure 1: Location Based OSN Service: The standard data placement process (in green): where the data-items are ﬁrst placed

in data centers (black dots) and then replicated (red dots); and the proposed CPR paradigm (in magenta): where both data

placement and replication is performed in a single step.

ally, each user can register a check-in, which is as-

signed to a data center nearest (in geographical dis-

tance) to her check-in location. Each user check-in

requires retrieval of data from her friends, constituting

a data request pattern triggered by this check-in. For

example, while registering a check-in in Virginia the

user v

may want to tag/mention some of her friends.

This would require the data-items corresponding to

her friends {v

} to be available at the Virginia

data center, thereby triggering a data request for trans-

ferring data-items {d(v

),d(v

)} to Virginia.

Motivated by the use-case discussed above, the

problem of combined data placement and replication

(CPR) for data-intensive services in data centers that

are distributed geographically across the world is the

topic of research tackled in this paper.

A careful analysis of the objectives of the general-

ized data placement problem (Golab et al., 2014) indi-

cates that both data and replica placement are tightly

coupled, and should be considered as part of a single

optimization problem. In fact, data placement without

replication is a specialized instance of the generic data

placement problem. Additionally, support for repli-

cation is important to ensure fault tolerance. Thus,

replica placement or replication is an integral part of

the data placement problem. Having said that, despite

continued and rigorous advancement of the state-of-

the-art in the ﬁeld of data placement (Golab et al.,

2014; Yu and Pan, 2017; Atrey et al., 2018), to the

best of our knowledge and as detailed in Sec. 2, none

of the existing techniques are capable of jointly per-

forming data and replica placement. More speciﬁ-

cally, most of the techniques in the literature employ

an ad hoc two-phase approach – data placement fol-

lowed by replication (Fig. 1) – each independently

catering to a sub-part of the overall optimization ob-

jective, which results in solutions of inferior quality.

To bridge this gap, a uniﬁed paradigm of com-

bining data and replica placement, CPR, for data-

intensive services in geo-distributed clouds is pro-

posed in this paper. We propose an overlapping corre-

lation clustering algorithm to solve the joint optimiza-

tion problem of data and replica placement. Speciﬁ-

cally, overlapping clustering enables a data-item to be

assigned to multiple data centers, thereby facilitating

both data placement and replication in a single step.

Key contributions of this work are as follows:

• We study a novel variant of the data placement

problem, CPR, for data-intensive services in geo-

distributed data centers, which aims at unifying

data and replica placement as a joint optimization

problem (Sec. 3).

• We propose a novel algorithm using overlapping

correlation clustering on hypergraphs, which can

assign the same data-item to different data centers

(Sec. 4). This facilitates jointly solving the data

placement and replication problem, as opposed to

the existing two-phase approach employed by the

state-of-the-art. The proposed algorithm solves a

multi-objective optimization, where among others,

it facilitates optimization of latency, storage cost,

inter data center trafﬁc, and data center span.

• Through extensive experiments on a real-world

trace-based social network dataset (Sec. 5), we

show that the proposed overlapping clustering al-

gorithm is scalable and effective.

CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

2 RELATED WORK

The data placement problem, being a fundamental

question in computer science, has witnessed a huge

amount of research over the years, with techniques

being developed for different execution environments

such as: grid (Kosar and Livny, 2004; Kosar and

Livny, 2005; Ding and Lu, 2009), distributed (Cher-

venak et al., 2007; Golab et al., 2014), and cloud

computing (Yu et al., 2012; Guo and Wang, 2013; Li

et al., 2017; Ferdaus et al., 2017) environments. From

the perspective of the type of workloads, traditionally

the focus was on scientiﬁc workﬂows (Yuan et al.,

2010; Liu and Datta, 2011; Ebrahimi et al., 2015) and

relational workloads such as database joins (Golab

et al., 2014), however, of late the focus is moving to-

wards workloads arising from niche applications such

as OSN services (Jiao et al., 2014; Han et al., 2017)

and data intensive services in geo-distributed clouds

(Agarwal et al., 2010; Yu and Pan, 2015; Yu and

Pan, 2016; Zhang et al., 2016; Yu and Pan, 2017).

Since the focus of this paper is on combining data

placement and replication for data-intensive services

in geo-distributed data centers, we present a review of

existing research that overlap with our work.

Any successful solution to the data placement

problem in geo-distributed data centers should pro-

vide two capabilities, namely – capturing and improv-

ing (1) data-item – data-item associations (i.e., the

number of times two data-items were requested to-

gether); and (2) data-item – data center associations

(i.e., the number of times a data-item was requested

at a given data center). On the one hand, a frequent

pattern mining based technique proposed by (Nish-

tala et al., 2013), and a hierarchical clustering algo-

rithm on correlations between data-items proposed

by (Zhao et al., 2016b; Zhao et al., 2016a), facil-

itate management of data-item – data-item associa-

tions. On the other hand, methods proposed by (Agar-

wal et al., 2010; Huguenin et al., 2012; Rochman

et al., 2013; Zhang et al., 2016) facilitate capturing

of data-item – data center associations. Speciﬁcally,

(Agarwal et al., 2010) proposed a system Volley that

analyzes logs of data center requests to perform au-

tomatic data placement in geographically distributed

data centers. (Rochman et al., 2013) design robust

data placement algorithms to ensure that a large frac-

tion of region speciﬁc requests is served at a lower

cost, while managing the highly dynamic nature of

user requests. (Zhang et al., 2016) propose an inte-

ger programming based algorithm for minimizing the

data communication cost while honoring the data cen-

ter storage capacities. However, these methods do not

possess both the aforementioned capabilities.

Of late, literature has seen an increased use of

hypergraph-based techniques for data placement in

geo-distributed clouds, which also constitute the cur-

rent state-of-the-art. Yu et al. (Yu and Pan, 2015;

Yu and Pan, 2017) propose data placement strategies

using hypergraph modeling and publicly available

partitioning heuristics (Catalyurek, 2011) for data-

intensive services. While hypergraph-based modeling

facilitates capturing of both data-item – data-item and

data-item – data center associations, the hypergraph

partitioning heuristics available in (Catalyurek, 2011)

facilitate these techniques to scale to large datasets.

Recently, Atrey et al. (Atrey et al., 2018) presented

a spectral clustering algorithm that employed the use

of low-rank approximations of the hypergraph lapla-

cian to obtain superior efﬁciency and scalability while

retaining the same efﬁcacy as portrayed by (Yu and

Pan, 2017). Hypergraph based partitioning solutions

(Catalyurek et al., 2007) have also been used in grid

and distributed computing environments.

There also exists research pertaining to other

aspects of geo-distributed data placement, such as

placement in multi-clouds and the design of special-

ized replication strategies. The technique proposed

by (Jiao et al., 2014) facilitates data placement in a

multi-cloud environment and solves a multi-objective

optimization to minimize the carbon footprint and in-

ter cloud trafﬁc. (Han et al., 2017) introduce an al-

gorithm to perform data migration decisions for OSN

services in a multicloud environment, which is capa-

ble of adapting to the changing data trafﬁc. Location-

aware replication strategies

capable of optimizing on

metrics such as location of geo-distributed data cen-

ters, and inter data center communication costs were

proposed by (Shankaranarayanan et al., 2014). How-

ever, none of these techniques capture both data-item

– data-item and data-item – data center associations.

Thus, we choose the techniques presented by (Yu and

Pan, 2017) and (Atrey et al., 2018) as the representa-

tive state-of-the-art methods for comparison.

Having said that, to the best of our knowledge,

none of the existing state-of-the-art methods de-

scribed above are capable of unifying data and replica

placement as a joint optimization problem. Speciﬁ-

cally, as discussed in Sec. 1 the techniques existing

in the literature employ a two-phase approach, where

replication is performed as an independent step af-

ter obtaining the data-item assignments from a data

placement algorithm. This ad hoc two-phase ap-

proach has several disadvantages such as sub-optimal

replica placement quality, and decreased efﬁciency

owing to solving two independent optimization prob-

The reader is referred to (Grace and Manimegalai,

2014) for a survey of replica placement algorithms.

Unifying Data and Replica Placement for Data-intensive Services in Geographically Distributed Clouds

lems instead of one. To this end, the research pre-

sented in this paper proposes a uniﬁed data place-

ment paradigm – CPR, capable of jointly perform-

ing data and replica placement of data-intensive ser-

vices into geographically distributed clouds, through

a novel approach of data partitioning using overlap-

ping correlation clustering on Hypergraphs. More

fundamentally, overlapping clustering allows a data-

item to be assigned to multiple data centers (clusters)

at the same time, thereby facilitating both data and

replica placement in a single step. In other words,

the proposed overlapping clustering based algorithm

provides a uniﬁed solution to the combined data and

replica placement problem for data-intensive services.

3 PROBLEM STATEMENT

Given a set of data-items, data centers, data request

patterns, and the replication factor, the objective of

the combined (or generalized) data and replica place-

ment problem (CPR) is to intelligently place the data-

items, allowing for replication wherever applicable,

across data centers so as to minimize the overall com-

munication cost resulting from migration/replication

of data-items corresponding to different data requests.

Note that analogous to most of the techniques in

data placement literature (Golab et al., 2014; Ferdaus

et al., 2017; Zhao et al., 2016b; Yu and Pan, 2017;

Atrey et al., 2018), we consider the system work-

load represented by data request patterns to be static.

Thus, the proposed algorithm is ofﬂine, and has to be

re-executed from scratch to accommodate changes in

the system workload. Designing algorithms that can

accommodate changes in the workload in an online

manner would constitute as future work.

Next, we introduce some basic concepts of data

placement in the context of OSN services, followed

by a formal description of the CPR problem for data-

intensive services in geo-distributed data centers.

A location based online social network (Fig. 1)

possesses two aspects: (1) a social network connect-

ing users with their friends, and (2) a capability for

the users to register check-ins at potentially different

locations across the globe.

Deﬁnition 1 (Social Network. (G(V,E))). A social

network with n individuals and m social ties can be

denoted as a graph G(V,E), where V is the set of

Migration or replication of data-items may involve ad-

ditional overheads such as data-item retrieval delays, packet

loss etc. For the sake of brevity, the focus of this paper is

on minimizing the communication cost alone, however, the

proposed data-placement algorithm is generic, and not re-

stricted in its scope based on this assumption.

nodes representing the users of the social network,

|V |= n, and E is the set of edges (representing friend

relationships) between any two nodes, E ⊆ V × V ,

|E|= m.

In the context of data placement, a data-item is an

atomic unit of data storage and transfer. Thus, for the

OSN use-case:

Deﬁnition 2 (Data-items (D).). A data-item is de-

ﬁned as the most recent snapshot of a user’s proﬁle

(e.g. proﬁle picture, posts, comments etc.). The set

D contains n data-items corresponding to each user

v ∈ V of the social network, where the data-item for a

user v is denoted as d(v).

Moving ahead, a check-in depicts a social net-

work user visiting any location in the world. Each

user check-in is composed of two parts: (1) a location

where the check-in was recorded, and (2) a data re-

quest pattern triggered by the check-in. As discussed

in Sec. 1, the location of a user check-in is decided

as the location of a data center closest (in distance) to

the actual physical location of the user check-in.

Deﬁnition 3 (Data-centers (L).). A data center con-

stitutes a set of resources to store the data-items and

perform different computational tasks on the stored

data-items. Each data center is hosted at a location

∈ L, where |L|= l denotes the set of data center

locations.

Further, the data request corresponding to a user

check-in requires retrieval of the data-items of her

friends (Sec. 1). Usually for large scale systems such

as OSNs, the data-items (proﬁles of OSN users) are

distributed across data centers and might require mi-

gration/replication from one data center to another.

The data-items that are potential candidates for mi-

grations constitute a data request, which is formally

deﬁned as follows.

Deﬁnition 4 (Data-request Patterns (R ).). A data re-

quest pattern R(v) ∈ R corresponding to a check-in

by a user v at the data center location L

is com-

prised of the set of data-items corresponding to all

the friends of v. Mathematically, R(v) = {d(u) | u ∈

Adj(v)}. Further, R(v) denotes the set of data-items

that are required to be collocated in the same data

center L

, and those that are not stored in L

are com-

municated (either migrated or replicated) from the

data centers in which they are stored to L

. The set

of data request patterns denoted as R represents the

system workload.

As an example, the data request pattern for a

check-in by the user v

(Fig. 1) is denoted as R(v

) =

{d(v

),d(v

)}. Given this information, a check-in is

formally deﬁned as follows:

CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

Deﬁnition 5 (Check-ins. (C )). A check-in is a tuple

∀k

1≤k≤ρ

, C

= (R(v),L

) ∈ C consisting of a data re-

quest pattern R(v) ∈ R triggered by v and a location

∈ L of a data center capable of serving user re-

quests. The set C contains ρ user check-ins.

In other words, the check-in C

by a user v at a

location L

signiﬁes a request for the data-items con-

tained in R(v) triggered from the data center located at

. For example, if the user v

in Fig. 1 was the ﬁrst

to register a check-in among all other users, which

was recorded at the L

=Tokyo data center, then C

(R(v

),L

), where R(v

) = {d(v

),d(v

)}.

Note that a user can register multiple check-ins

at the same location, and to better capture data-

item – data-item and data-item – data center associ-

ations, each individual check-in is treated as differ-

ent from the other. For example, if two data-items

d(v

) and d(v

) are requested together seven times

they would possess a stronger data-item – data-item

association than data-items d(v

) and d(v

) that co-

exist in data request patterns just twice. Similarly, if

a user v

visited Sydney ﬁve times the data-items in

R(v

) would possess a stronger data-item – data cen-

ter association with Sydney when compared to that

of any other data center which was visited less often.

To capture this, for each check-in by the user v

=Sydney there would be 5 different check-ins de-

noted as C

,...,C

k+5

, each composed of the data re-

quest pattern R(v

) and the location L

. Moreover,

this also substantiates the reason behind not indexing

each user check-in uniquely using data request pat-

terns R and locations L

Having deﬁned the basic concepts and their nota-

tions, we formally deﬁne the CPR problem as:

Problem. Given a set of n data-items D correspond-

ing to the set of social network users V , ρ user check-

ins C

= (R(v),L

) ∈ C | v ∈ V, L

∈ L representing

the system workload, each comprising a data request

pattern R(v) being originated from a data center lo-

cated at L

, a set of l data centers with locations in

L, with the per unit cost of outgoing trafﬁc from each

data center Γ(L

) | L

∈ L, the per unit storage cost

of each data center S(L

) | L

∈ L, the inter data

center latency (directed) for each pair of data cen-

ters κ(L

) | L

∈ L, the average number of

data centers spanned by the data-items correspond-

ing to each request pattern R(v) being N (R(v)), and

the replication factor r, perform combined data and

replica placement to minimize the optimization objec-

tive O, which is deﬁned as the weighted average

Γ(·),κ(·,·),S(·), and N (·).

The weights determine the relative importance of these

metrics towards the overall optimization objective, and

would be discussed in Sec. 4.1.

4 OVERLAPPING CLUSTERING

ON HYPERGRAPHS

Given the set of data-items D and the set of user

check-ins C representing the system workload, the

ﬁrst step is to construct a hypergraph. This results

in a higher-order representation of the interaction be-

tween the data-items and the data centers in the form

of the hypergraph incidence matrix Π, and the hy-

peredge weight matrix W

representing the relative

importance of the constructed hyperedges. The next

step is to partition the set of data-items D into l data

centers such that each data-item d(v) ∈ D is assigned

to r < l (allowing for replication) data centers, which

is achieved using the proposed overlapping clustering

algorithm. An overview of the proposed technique is

presented in Fig. 2.

4.1 Hypergraph Construction

Literature on data placement of data intensive services

(Yu and Pan, 2017; Atrey et al., 2018) has provided

sufﬁcient evidence in support of hypergraphs as the

most suitable choice for modeling the interactions be-

tween both data-items – data-items and data-items –

data centers. Hyperedges allow to model relation-

ships between several vertices as opposed to just a

pair of vertices in traditional graphs. Being a more so-

phisticated construct a hypergraph H(V

) serves

as a generalization over a graph G(V,E). With this

capability to capture multi-way relationships, hyper-

graphs provide a powerful representation to model

data-item – data-item and data-item – data center as-

sociations.

The system workload represented using user

check-ins gives rise to two types of hyperedges E

(1) hyperedges corresponding to data request patterns

R that connect all the data-items (data-item – data-

item association) in a data request triggered from a

user check-in; and (2) hyperedges R

that connect the

data-items requested in a check-in with the data center

location (data-item – data center association) where

the check-in was registered. Thus, the set of vertices

in the constructed hypergraph consist of the set of

data-items D and data center locations L, totaling to

|= n

= n + l vertices, and |E

|= m

= r + nl hy-

peredges. Eq. 1 formally deﬁnes these two sets.

= D ∪ L

= R ∪ R

(1)

Given that there are two different types of hyper-

edges, there are two types of weights corresponding

to either type, with each focused towards optimizing

Unifying Data and Replica Placement for Data-intensive Services in Geographically Distributed Clouds

Construct

Hypergraph

Incidence

Matrix (𝚷)

Overlapping

Correlation

Clustering

CPR

Output

Set of User

Check-ins (𝒞)

Input

Workload

Virginia

Frankfurt

User Relationships

(OSN)

Calculate

Hyperedge

Weights

Hyperedge

Weight

Matrix (𝐖

𝚷

)

Set of Data-

items (𝒟)

Figure 2: Overview of the proposed overlapping correlation clustering technique for combined data and replica placement.

a different aspect of the problem. More speciﬁcally,

the weight W

corresponds to the data request pattern

hyperedges, and its aim is to facilitate minimization of

N (R

): the average number of data centers accessed

by a data request pattern R

, which is achieved by en-

forcing the data-items that are requested together to

be placed together. On the other hand, the weights

) correspond to the data-item – data

center hyperedges, and their aim is to minimize inter

data center latency κ(L

), storage cost S(L

), and

cost of outgoing trafﬁc Γ(L

) respectively, by giving

higher priority to placing data-items at data center lo-

cations from where they have been requested more

frequently. The resultant hyperedge weight matrix

is then constructed as the weighted sum of the four

weights discussed above, which is mathematically de-

ﬁned as:

= W · (W

). (2)

where, W

is a diagonal matrix of size m

× m

, and

W is the weight vector for deciding the priorities of

the previously discussed hyperedge weighting strate-

gies

The hypergraph H(V

) is represented using a

× m

dimensional hypergraph incidence matrix Π,

which possesses m

hyperedges, and each hyperedge

is a n

-dimensional binary column vector. It is for-

mally deﬁned as:

∀he

∈ E

,he

= [he

1,i

,he

2,i

,...,he

Π = [he

,he

,...,he

]. (3)

An entry he

j,i

= 1 indicates that the j

vertex in the

hypergraph vertex set is participating in the i

hyper-

edge, while he

j,i

= 0 indicates otherwise.

Overlapping correlation clustering requires a sim-

ilarity matrix denoting similarities between each ver-

tex pair in the (hyper)graph as input. To this end,

we construct the normalized hypergraph matrix N

which requires two additional operations on the hy-

pergraph incidence matrix Π. We compute two diag-

onal matrices – the vertex degree matrix (D

vΠ

) and

For additional details about the hypergraph construc-

tion and the hyperedge weight calculation steps the reader

is referred to (Yu and Pan, 2017; Atrey et al., 2018).

the hyperedge degree matrix (D

heΠ

) of dimensional-

ity n

× n

and m

× m

respectively. The vertex de-

gree matrix captures the number of hyperedges each

vertex of the hypergraph is a part of, while the hyper-

edge degree matrix measures the number of vertices

contained in each hyperedge. Mathematically,

vΠ

= diag(

∑

Π). (4)

heΠ

= diag(

∑

). (5)

where,

∑

X represents the row-wise sum of the input

matrix X and X

represents the transpose of the ma-

trix X.

With this, the normalized hypergraph matrix N

is mathematically deﬁned as:



−1/2

vΠ

· Π · W

· D

−1

heΠ

· Π

· D

−1/2

vΠ



(6)

where, D

vΠ

is a n

× n

diagonal vertex degree matrix,

heΠ

is a m

× m

diagonal hyperedge degree matrix,

and W

is a m

× m

diagonal hyperedge weight ma-

trix. Thus, N

becomes a n

× n

matrix.

4.2 Overlapping Correlation Clustering

We begin with a description of correlation clustering:

where given a complete graph with edges labeled as

positive or negative, the objective is to identify a par-

titioning of the graph such that it minimizes the sum

of the number of positively labeled edges cut and the

number of negatively labeled edges not cut by the

partition. In the current scenario, the input is a nor-

malized hypergraph similarity matrix N

represent-

ing the pair-wise similarity between data-items, and

a set L of l labels representing the data center loca-

tions. The task of correlation clustering is to ﬁnd a

mapping F : V

→ L for partitioning the set of data-

items into l data centers, that minimizes the following

loss function:

Correlate

,F ) =

∑

(u,v)∈V

×V

F (u)=F (v)

(1 − N

(u,v))

∑

(u,v)∈V

×V

F (u)6=F (v)

(u,v). (7)

CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

Algorithm 1: Overlapping Clustering Algorithm.

Input: Π, W

, l, r, Φ

Output: Partitioning of the hypergraph vertex set P (V

)

into l clusters allowing r replicas

1: D

(Π) ← diag(

∑

Π); D

(Π) ← diag(

∑

)

2: Compute normalized hypergraph N

as described in

Eq. 6

3: Randomly initialize the label sets of size r for each

data-item u ∈ V

4: while L

Overlap

,F ) decreases do

5: for each u ∈ V

6: ﬁnd the label set F that minimizes

Overlap

(F|F )

7: Update F (u) ← F

8: end for

9: end while

10: return P (V

) deﬁned by F

As discussed in Sec. 1, the goal of overlapping

clustering is to partition the set of data-items D into

l data centers, with each data-item being assigned to

more than one data center to appropriately allow for

replication. To achieve this, instead of mapping each

data-item to a single label (corresponding to a data

center), it is mapped to a set of labels thereby al-

lowing each data-item to be associated with multiple

data centers. Given the label set deﬁnition as the set

of all subsets of data center locations L except the

empty set: L

= 2

\ {

0}, and a similarity function

over the data-item label sets S(·), the underlying op-

timization objective reduces to identifying a mapping

F : V

→ L

under which the similarity between any

pair of data-items ∀u,v ∈ V

, N

(u,v) agrees as much

as possible with the similarity between their corre-

sponding label sets S(F (u),F (v)).

Similar to the loss function for correlation cluster-

ing L

Correlate

, the loss function for overlapping corre-

lation clustering is deﬁned as:

Overlap

,F )

∑

(u,v)∈V

×V

|S(F (u),F (v)) − N

(u,v)|.

∑

u∈V

∑

v∈V

\{u}

|S(F (u),F (v)) − N

(u,v)|. (8)

where S(·) is deﬁned as the set-intersection indicator

function:

S(X,Y ) =

(

1, if X ∩Y 6=

0, otherwise.

(9)

Formally, the goal of overlapping clustering

is to ﬁnd a mapping F

∗

in order to minimize

Overlap

,F ), which is mathematically denoted as:

∗

= argmin

Overlap

,F ). (10)

Overlapping correlation clustering was shown to

be NP-Hard in (Bonchi et al., 2013), thus, there does

not exist any polynomial time algorithm capable of

solving it. To this end, we propose a greedy algo-

rithm that iteratively reﬁnes the quality of the label

sets of one vertex (data-item) in the hypergraph at

a time. More speciﬁcally, given the label sets of all

the other vertices in the hypergraph, the greedy algo-

rithm applies a local optimization (on one vertex) to

improve the cost of the overall solution until conver-

gence. Algorithm 1 presents the pseudo-code of the

proposed overlapping clustering algorithm.

We begin by computing the normalized hyper-

graph matrix N

as described in Eq. 6 (lines 1–2).

Next, each vertex u ∈ V

is initialized with a random

label set of size r (equal to the replication factor),

thereby facilitating each data-item to be assigned to

r data centers (line 3). Post the initialization, we ap-

ply the greedy local optimization approach described

above (lines 4–9). More speciﬁcally, the label set of

each node u ∈ V

is iteratively improved, keeping the

label sets of all the other nodes ﬁxed, till the over-

all loss L

Overlap

,F ) converges. To understand the

loss with respect to each node u, we rewrite Eq. 8 as:

L(V

,F ) =

∑

u∈V

Overlap

(F (v) | F ). (11)

where

Overlap

(F (v) | F )

∑

v∈V

\{u}

|S(F (u),F (v)) − N

(u,v)| (12)

5 EXPERIMENTS

In this section, the proposed CPR paradigm and over-

lapping clustering algorithm is evaluated through ex-

periments on a large scale location-based OSN.

5.1 Dataset

Following the literature on data placement in geo-

distributed cloud services (Yu and Pan, 2017; Atrey

et al., 2018), we extract a trace from the Gowalla

social network dataset, available publicly from the

SNAP (sna, 2017) repository. The social network

consists of 196591 vertices corresponding to the users

of the social network, and 950327 edges, each repre-

senting friend relationship between two users. Addi-

tionally, the dataset contains 6442890 user check-ins

registered from February 2009 to October 2010, trig-

gering a total of 102314 data request patterns.

http://snap.stanford.edu/data/loc-gowalla.html

Unifying Data and Replica Placement for Data-intensive Services in Geographically Distributed Clouds

Table 1: (a) Trafﬁc and Storage costs, and (b) Inter data center latency based on Geo-distributed Amazon Clouds.

(a) Costs (in $)

Region

Storage Outgoing

($/GB-month) Trafﬁc ($/GB)

Virginia 0.023 0.02

California 0.026 0.02

Oregon 0.023 0.02

Ireland 0.023 0.02

Frankfurt 0.025 0.02

Singapore 0.025 0.02

Tokyo 0.025 0.09

Sydney 0.025 0.14

Sao Paulo 0.041 0.16

(b) Latency (in ms)

Region Virginia California Oregon Ireland Frankfurt Singapore Tokyo Sydney Sao Paulo

Virginia 3.523 72.738 86.981 80.546 88.657 216.719 145.255 229.972 119.531

California 71.632 5.842 19.464 153.202 166.609 174.010 102.504 157.463 192.670

Oregon 88.683 19.204 5.551 136.979 159.523 161.367 89.095 162.175 182.716

Ireland 80.524 153.220 136.976 5.005 19.560 239.023 212.388 309.562 191.292

Frankfurt 88.624 166.590 159.542 19.533 4.425 325.934 236.537 323.483 194.905

Singapore 216.680 173.946 161.423 238.130 325.918 5.870 73.807 175.328 328.080

Tokyo 145.261 102.523 89.157 212.388 236.558 73.785 6.846 103.907 256.763

Sydney 229.748 157.843 161.932 309.562 323.152 175.355 103.900 4.889 322.494

Sao Paulo 119.542 192.700 181.665 191.559 194.900 327.924 256.665 322.523 6.076

5.2 Experimental Setup

All experiments are performed using code written in

C++ on an Intel(R) Xeon(R) E5-2698 28-core ma-

chine with 2.3 GHz CPU and 256 GB RAM run-

ning Linux Ubuntu 16.04. Owing to their non-

deterministic nature, results corresponding to the ran-

dom, hypergraph partitioning, and spectral clustering

methods are averaged over 10 runs.

We simulate a real-world geo-distributed cloud

environment based on the AWS global infrastruc-

ture(aws, 2017a). To ensure consistency with previ-

ous research (Yu and Pan, 2015; Atrey et al., 2018),

we use the l = 9 oldest and prominent AWS data cen-

ter regions, namely: Virginia, California, Oregon, Ire-

land, Frankfurt, Singapore, Tokyo, Sydney, and Sao

Paulo. To closely mirror the actual AWS setup, the

costs involved for storage and outgoing trafﬁc are as

advertised by Amazon. Moreover, the inter data cen-

ter latencies between the chosen regions are measured

by the packet transfer latency using the Linux ping

command(aws, 2017b). These data center character-

istics are presented in Table 1.

An analysis of user check-ins revealed the exis-

tence of disparity in their check-in behavior. Some

data centers (ex: Virginia and Frankfurt) register a

very high number of check-ins, while others (ex:

SaoPaulo and Sydney) receive very few. This effects

the amount of storage required at each data center,

which is dependent upon both the number of check-

ins registered in a region and the size of data request

pattern triggered by each check-in. Based on this, the

storage size for each data center region ∀L

∈ L is

calculated as S

∑

|R(v)|| ∃C

= (R(v),L

),L

∈ L.

Let S =

∑

j=1

be the total storage size, then the

data center storage size follows a multinomial distri-

bution and is calculated as: Φ ∼ [

,...,

]. The

expected storage size at each data center calculated

using Φ serves as the load-balancing factor, and is in-

put to the data placement algorithm to facilitate load-

balancing among the 9 data center regions.

To summarize, for the Gowalla dataset the data

placement task reduces to partitioning 196591 data-

items corresponding to the social network users into

9 data centers based on the 102314 data request pat-

terns triggered from user check-ins.

Baselines: We compare the proposed data place-

ment algorithm for effectiveness, efﬁciency, and scal-

ability with four baselines.

• Random: partitions the set of data-items D ran-

domly into |L| data centers.

• Nearest: assigns each data-item to the data cen-

ter from where it has been requested the highest

number of times.

• Hypergraph Partitioning (Hyper): is the data

placement algorithm proposed by (Yu and Pan,

2015; Yu and Pan, 2017), which uses the hyper-

graph partitioning algorithms available in the Pa-

ToH toolkit (Catalyurek, 2011).

• Spectral Clustering (Spectral): is the data place-

ment algorithm proposed by (Atrey et al., 2018),

which uses fast approximate eigen decomposition

methods for efﬁciently performing spectral clus-

tering on hypergraphs.

As discussed in Sec. 2, Hyper (Yu and Pan, 2017)

and Spectral (Atrey et al., 2018) serve as the repre-

sentative state-of-the-art methods for data placement

of data-intensive services in geo-distributed data cen-

ters. To ensure load balancing, all the techniques par-

tition the data-items according to the data center stor-

age size distribution Φ. Thus, the results for the bal-

ance evaluation metric are close to 1 for all the tech-

niques considered in this study.

Parameters. The weight vector W (Eq. 2 facili-

tates optimization of different objectives by priori-

tizing different hyperedge weights (Sec. 4.1) which

represent different preferences or importance towards

the considered evaluation metrics. To this end,

we experiment with varying settings for W: W

{100,1,1,1} to minimize the data center span N (·);

: {1,100,1,1} for minimizing the inter data cen-

ter trafﬁc Γ(·); W

: {1,1, 100, 1} to minimize the in-

ter data center latency κ(·); and W

: {1,1,1,100} for

minimizing the storage cost S(·). Note that in all the

weight-vector settings, the value 100 is just used to in-

CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

dicate higher relative importance of the corresponding

metric. The portrayed results are not dependent on the

speciﬁc value of 100, and can be reproduced with any

value as long as it is >> 1. Further, Spectral uses 100

smallest eigen-vectors of the hypergraph laplacian for

spectral clustering. Following best practices in data

storage management (rep, 2018), the replication fac-

tor r was set to 3.

Evaluation Metrics. We consider two categories of

evaluation metrics. The ﬁrst type is concerned with

the efﬁciency of the studied algorithms, while the

other is concerned with their efﬁcacy.

• Efﬁciency: We evaluate the efﬁciency of the

methods using their execution time, i.e., the time

required to produce the data placement output.

• Efﬁcacy: of the studied methods is measured

across the following metrics.

– Span (N (·)): of a data request pattern R(v) is

deﬁned as the average number of data centers

required to be accessed to fetch the data-items

requested in R(v). The span for the entire work-

load is calculated as the average of the data cen-

ter spans of each request pattern R(v) ∈ R .

– Trafﬁc (Γ(·)): The total trafﬁc cost of a data re-

quest pattern R(v) is deﬁned as the sum of out-

going trafﬁc prices of the data centers involved

in outgoing requests for the data-items in R(v).

The trafﬁc cost of the entire workload is calcu-

lated as the sum of trafﬁc costs of each request

pattern R(v) ∈ R .

– Latency (κ(·)): The inter data center latency of

a data request pattern R(v) is calculated as the

sum of access latencies required to fetch all the

data-items requested in R(v) from the data cen-

ter where they are placed to the data center from

where the request was triggered. The latency of

the entire workload is calculated as the sum of

the latencies of each request pattern R(v) ∈ R .

– Storage (S(·)): The sum of the total cost on

storing all of the data-items corresponding to

every data request pattern R(v) ∈ R in data cen-

ters L prescribed by the data placement algo-

rithm.

– Balance: is calculated as the pearson’s cor-

relation coefﬁcient between the expected stor-

age size distribution Φ, and the actual storage

size distribution obtained after performing data

placement. If the value is close to 1, it means

that the two distributions are highly similar,

while they are dissimilar if the value is close

to −1.

– Objective. (Obj.): is deﬁned as the weighted

sum of the considered performance metrics,

where the weights are described using the

weight vector W.

Note that the results portrayed corresponding to

each evaluation metric (barring Balance) have been

normalized in the scale of [0,1] by dividing each value

by the highest observed value in that particular met-

ric. For example, let nmax = max

∀R(v)∈R

(N (R(v)))

be the highest observed span value, then the span

for each data request pattern R(v) is normalized as:

N (R(v))/nmax | ∃R(v) ∈ R . A similar operation is

performed for other evaluation metrics as well. Nor-

malization ensures that all the values lie in a common

range, thereby ensuring equal and fair contribution

of each evaluation metric towards Obj. Additionally,

note that the optimization problem underneath CPR

is concerned with the minimization of the evaluation

metrics (barring Balance), hence, the smaller the por-

trayed values the better the performance is.

5.3 Evaluation Results: Quality Metrics

Figs. 3– 6 present the results on the considered eval-

uation metrics corresponding to different weight vec-

tor settings: W

–W

. It is evident that the proposed

overlapping clustering algorithm (Overlap) performs

the best (achieving the least value) on the overall op-

timization objective (Obj) across all the weight vector

settings, while being signiﬁcantly better than the ran-

dom and nearest methods. Additionally, Overlap also

outperforms Hyper and Spectral by being up to 30–

40% and 20–30% better respectively.

Redirecting our focus to other evaluation metrics,

it can be noticed that Nearest outperforms Hyper,

Spectral, and Overlap in some cases, however, the

latter are still signiﬁcantly better than the Random

method. For instance consider Fig. 3, it can be ob-

served that Nearest is better on the inter data center

trafﬁc and latency metrics. This is because accord-

ing to the weight vector setting W

, minimizing the

data center span holds the highest priority while traf-

ﬁc and latency metrics have lower weights in the op-

timization objective. A similar behavior is observed

for the other three weight vector settings: W

, W

and W

as well (Figs. 4– 6). To understand this ob-

served behavior better, let us analyze the results pre-

sented in Fig. 6. It is not hard to infer that storage cost

might be inversely related to other parameters such as

inter data center latency and trafﬁc. Therefore, pref-

erentially optimizing to achieve lower storage costs

) thereby also obtaining better performance on

Obj, might lead a technique to suffer on other met-

rics, i.e., a lower storage cost might lead to higher

latencies or trafﬁc cost. Despite this behavior, most

importantly Overlap signiﬁcantly outperforms all the

Unifying Data and Replica Placement for Data-intensive Services in Geographically Distributed Clouds

0.2

0.4

0.6

0.8

Obj

Span

Traffic

Latency

Storage

Balance

Normalized Evaluation Metrics

Random

Nearest

Hyper

Spectral

Overlap

Figure 3: Overlap results in reducing the data center span

N (·) by ≈ 35% when compared to Spectral with W

{100,1,1, 1}.

considered baselines on the corresponding evaluation

metric that the weight-vector setting is tuned to opti-

mize. More fundamentally, in addition to being better

on Obj., Overlap outperforms the other methods in

minimizing the data center span N (·), inter data cen-

ter trafﬁc cost Γ(·), inter data center latency κ(·), and

storage cost S (·), when a higher preference is given

to these metrics under the weight-vector settings W

, W

, and W

respectively.

Moving ahead, we analyze the reason behind the

sub-optimal performance of the Nearest method. The

main limitation is that Nearest is inclined to assign

each data-item to a data center that receives the high-

est number of access requests for that data-item,

which consequently results in minimizing (on an av-

erage) the geographical distance between the data-

item and the source location of the data request. Note

that this optimization strategy is oblivious to the fact

that the storage or trafﬁc costs might not be corre-

lated with the distance, thereby leading to sub-optimal

performance in real-world settings that require multi-

objective optimization. We also refer the reader to

Table 2, which presents a quantitative summary of the

performance of all the considered baselines indicating

how worse each baseline is relative to Overlap.

Based on the above analysis, it is clear that Hy-

per, Spectral, and Overlap possess the capability to

adapt the optimization based on the input weight vec-

tor setting. This is because of their higher-order mod-

eling capabilities courtesy hypergraphs, which ren-

ders them better suited for performing multi-objective

optimizations. Further, since Overlap models data

placement and replication as a joint optimization

problem (CPR), it achieves better performance on the

evaluation metrics when compared to both Hyper and

Spectral that solve each problem independently.

0.2

0.4

0.6

0.8

Obj

Span

Traffic

Latency

Storage

Balance

Normalized Evaluation Metrics

Random

Nearest

Hyper

Spectral

Overlap

Figure 4: Overlap results in reducing the inter data cen-

ter trafﬁc Γ(·) by ≈ 26% when compared to Spectral with

= {1,100,1, 1}.

5.4 Evaluation Results: Efﬁciency and

Scalability

As analyzed in Sec. 5.3, Hyper, Spectral, and Overlap

stand out as techniques possessing good quality on the

evaluation metrics, with Overlap being the technique

possessing the best quality (≈ 30–40% better) among

them. In this section, we study their execution time

performance on the Gowalla dataset. Fig. 7 clearly

portrays the superior efﬁciency of Overlap when com-

pared to Hyper and Spectral, where on average it is ≈

4–5 faster when compared to Hyper, and ≈ 2–3 faster

when compared to Spectral across different weight

vector settings. Given that the scale of real-world so-

cial networks of today is humongous, the ability to

scale to large datasets is a paramount property for any

data placement algorithm. Thus, this capability to ef-

ﬁciently and gracefully scale to large datasets serves

as one of the major advantages of Overlap over Hyper

and Spectral.

In summary, through extensive experiments we

verify that the proposed overlapping clustering algo-

0.2

0.4

0.6

0.8

Obj

Span

Traffic

Latency

Storage

Balance

Normalized Evaluation Metrics

Random

Nearest

Hyper

Spectral

Overlap

Figure 5: Overlap results in reducing the inter data cen-

ter latency κ(·) by ≈ 38% when compared to Spectral with

= {1,1,100, 1}.

CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science

0.2

0.4

0.6

0.8

Obj

Span

Traffic

Latency

Storage

Balance

Normalized Evaluation Metrics

Random

Nearest

Hyper

Spectral

Overlap

Figure 6: Overlap results in reducing the storage cost

S(·) by ≈ 24% when compared to Spectral with W

{1,1,1, 100}.

rithm is efﬁcient, scalable, and effective. Addition-

ally, since there is just one algorithm (unlike previ-

ous works) that can jointly solve the data and replica

placement problem, it offers a better and uniﬁed sys-

tem design. Further, the capability to adapt to the

change in weight vector settings W facilitates han-

dling of a variety of real-world scenarios as described

by different weight vectors.

6 CONCLUSIONS

In this paper, we addressed the problem of com-

bined data and replica placement of data-intensive

services into geo-distributed clouds. Although repli-

cation is an integral part of data placement, we iden-

tiﬁed that instead of posing it as a joint optimization

problem, most of the techniques in the literature have

treated them as independent problems, and have em-

ployed a two-phase approach: performing data place-

ment followed by replication. Consequently, a uniﬁed

paradigm, CPR, capable of combining data place-

ment and replication was devised, thereby enabling

the two problems to be studied in unison. Speciﬁ-

cally, the proposed overlapping correlation clustering

algorithm on hypergraphs with its ability to partition

the set of data-items by assigning a data-item to mul-

tiple data centers facilitated this joint optimization.

Experiments on a real-world trace-based social net-

Table 2: Quantifying the performance of the considered

baselines relative to the proposed overlapping correlation

clustering algorithm on the evaluation metrics.

Algorithm

Degradation in performance of Baselines relative to Overlap

Span Trafﬁc Latency Storage Obj

Random 377.78% 255.24% 510.87% 100.87% 274.68%

Nearest 106.29% 67.70% 160.50% 163.64% 139.69%

Hyper 50.91% 35.78% 51.85% 31.67% 37.98%

Spectral 35.26% 25.65% 37.69% 23.28% 24.66%

500

1000

1500

2000

2500

3000

3500

Running Time (in secs.)

Hyper

Spectral

Overlap

Figure 7: Comparing the execution times of the pro-

posed overlapping clustering algorithm with spectral clus-

tering(Atrey et al., 2018) and hypergraph partitioning algo-

rithm(Yu and Pan, 2017).

work dataset portrayed the effectiveness, efﬁciency,

and scalability of the proposed algorithm.

Currently, the proposed algorithm learns a data

and replica placement strategy from a historical snap-

shot of the social network trace. In the future, the fo-

cus would be to make CPR adaptive for managing up-

dates in the data (including changes to the data request

patterns) in an online manner, and dynamically updat-

ing the placement output. Additionally, the aim is to

generalize the notion of CPR to broader and generic

classes of the data placement problem.

ACKNOWLEDGMENTS

This research is partly funded by VLAIO, under grant

number 140055 (SBO Decomads).

REFERENCES

(2017a). AWS Global Infrastructure. https://aws.amazon.

com/about-aws/global-infrastructure/.

(2017b). Latency Between AWS Global Re-

gions. http://zhiguang.me/2016/05/10/

latency-between-aws-global-regions/.

(2017). SNAP Datasets. https://snap.stanford.edu/data/.

(2017). The Exponential Growth of Data.

https://insidebigdata.com/2017/02/16/

the-exponential-growth-of-data/.

(2018). Cisco Visual Networking Index: Fore-

cast and Trends (2017–2022). https:

//www.cisco.com/c/en/us/solutions/collateral/

service-provider/visual-networking-index-vni/

white-paper-c11-741490.html.

(2018). HDFS Architecture Guide. https://hadoop.apache.

org/docs/r1.2.1/hdfs design.html#References.

Unifying Data and Replica Placement for Data-intensive Services in Geographically Distributed Clouds

Agarwal, S., Dunagan, J., Jain, N., Saroiu, S., Wolman,

A., and Bhogan, H. (2010). Volley: Automated Data

Placement for Geo-distributed Cloud Services. In

NSDI.

Atrey, A., van Seghbroeck, G., Volckaert, B., and Turck,

F. D. (2018). Scalable data placement of data-

intensive services in geo-distributed clouds. In

CLOSER, pages 497–508.

Bonchi, F., Gionis, A., and Ukkonen, A. (2013). Overlap-

ping correlation clustering. Knowl. Inf. Syst., 35(1):1–

32.

Catalyurek, U. V. (2011). PaToH (Par-

titioning Tool for Hypergraphs).

http://bmi.osu.edu/umit/PaToH/manual.pdf.

Catalyurek, U. V., Boman, E. G., Devine, K. D.,

Bozdag, D., Heaphy, R., and Riesen, L. A. (2007).

Hypergraph-based Dynamic Load Balancing for

Adaptive Scientiﬁc Computations. In IPDPS, pages

1–11.

Chervenak, A., Deelman, E., Livny, M., Su, M., Schuler,

R., Bharathi, S., Mehta, G., and Vahi, K. (2007). Data

Placement for Scientiﬁc Applications in Distributed

Environments. In GRID.

Ding, Y. and Lu, Y. (2009). Automatic data placement and

replication in grids. In HiPC, pages 30–39.

Ebrahimi, M., Mohan, A., Kashlev, A., and Lu, S. (2015).

BDAP: A Big Data Placement Strategy for Cloud-

Based Scientiﬁc Workﬂows. In BigDataService,

pages 105–114.

Ferdaus, M. H., Murshed, M., Calheiros, R. N., and Buyya,

R. (2017). An algorithm for network and data-aware

placement of multi-tier applications in cloud data cen-

ters. JNCA, 98:65 – 83.

Golab, L., Hadjieleftheriou, M., Karloff, H., and Saha,

B. (2014). Distributed Data Placement to Minimize

Communication Costs via Graph Partitioning. In SS-

DBM, pages 1–12.

Grace, R. K. and Manimegalai, R. (2014). Dynamic replica

placement and selection strategies in data grids— A

comprehensive survey. JPDC, 74(2):2099 – 2108.

Guo, W. and Wang, X. (2013). A data placement strategy

based on genetic algorithm in cloud computing plat-

form. In WISA, pages 369–372.

Han, S., Kim, B., Han, J., K.Kim, and Song, J. (2017).

Adaptive Data Placement for Improving Performance

of Online Social Network Services in a Multicloud

Environment. In Scientiﬁc Programming, pages 1–17.

Huguenin, K., Kermarrec, A. M., Kloudas, K., and Ta

ıani,

F. (2012). Content and Geographical Locality in User-

generated Content Sharing Systems. In NOSSDAV,

pages 77–82.

Jiao, L., Li, J., Du, W., and Fu, X. (2014). Multi-objective

data placement for multi-cloud socially aware ser-

vices. In INFOCOM, pages 28–36.

Kosar, T. and Livny, M. (2004). Stork: making data place-

ment a ﬁrst class citizen in the grid. In ICDCS, pages

342–349.

Kosar, T. and Livny, M. (2005). A framework for reliable

and efﬁcient data placement in distributed computing

systems. JPDC, 65(10):1146–1157.

Li, X., Zhang, L., Wu, Y., Liu, X., Zhu, E., Yi, H., Wang, F.,

Zhang, C., and Yang, Y. (2017). A Novel Workﬂow-

Level Data Placement Strategy for Data-Sharing Sci-

entiﬁc Cloud Workﬂows. IEEE TSC, PP(99):1–14.

Liu, X. and Datta, A. (2011). Towards Intelligent Data

Placement for Scientiﬁc Workﬂows in Collaborative

Cloud Environment. In IPDPSW, pages 1052–1061.

Nishtala, R., Fugal, H., Grimm, S., Kwiatkowski, M., Lee,

H., Li, H. C., McElroy, R., Paleczny, M., Peek, D.,

Saab, P., Stafford, D., Tung, T., and Venkataramani,

V. (2013). Scaling Memcache at Facebook. In NSDI,

pages 385–398.

Rochman, Y., Levy, H., and Brosh, E. (2013). Re-

source placement and assignment in distributed net-

work topologies. In INFOCOM, pages 1914–1922.

Shabeera, T., Kumar, S. M., Salam, S. M., and Krishnan,

K. M. (2017). Optimizing vm allocation and data

placement for data-intensive applications in cloud us-

ing aco metaheuristic algorithm. IJEST, 20(2):616 –

628.

Shankaranarayanan, P. N., Sivakumar, A., Rao, S., and

Tawarmalani, M. (2014). Performance Sensitive

Replication in Geo-distributed Cloud Datastores. In

DSN, pages 240–251.

White, T. (2012). Hadoop: The Deﬁnitive Guide. O’Reilly

Media, Inc.

Yu, B. and Pan, J. (2015). Location-aware associated data

placement for geo-distributed data-intensive applica-

tions. In INFOCOM, pages 603–611.

Yu, B. and Pan, J. (2016). Sketch-based data placement

among geo-distributed datacenters for cloud storages.

In INFOCOM, pages 1–9.

Yu, B. and Pan, J. (2017). A Framework of Hypergraph-

based Data Placement among Geo-distributed Data-

centers. IEEE TSC, PP(99):1–14.

Yu, T., Qiu, J., Reinwald, B., Zhi, L., Wang, Q., and Wang,

N. (2012). Intelligent database placement in cloud en-

vironment. In ICWS, pages 544–551.

Yuan, D., Yang, Y., Liu, X., and Chen, J. (2010). A

data placement strategy in scientiﬁc cloud workﬂows.

FGCS, 26(8):1200 – 1214.

Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust,

M., Dave, A., Meng, X., Rosen, J., Venkataraman, S.,

Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S.,

and Stoica, I. (2016). Apache spark: A uniﬁed engine

for big data processing. CACM, 59(11):56–65.

Zhang, J., Chen, J., Luo, J., and Song, A. (2016). Efﬁ-

cient location-aware data placement for data-intensive

applications in geo-distributed scientiﬁc data centers.

Tsinghua Science and Technology, 21(5):471–481.

Zhao, Q., Xiong, C., and Wang, P. (2016a). Heuristic data

placement for data-intensive applications in heteroge-

neous cloud. Journal of Electrical and Computer En-

gineering, 2016:1–8.

Zhao, Q., Xiong, C., Zhang, K., Yue, Y., and Yang, J.

(2016b). A data placement algorithm for data inten-

sive applications in cloud. JGDC, 9(2):145–156.

CLOSER 2019 - 9th International Conference on Cloud Computing and Services Science