Workload Management for Dynamic Partitioning Schemes in Replicated

Databases

M. Louis-Rodr´ıguez

, J. Navarro

, I. Arrieta-Salinas

, A. Azqueta-Alzuaz

, A. Sancho-Asensio

and J. E. Armend´ariz-I˜nigo

Universidad P´ublica de Navarra, 31006 Pamplona, Spain

La Salle - Ramon Llull University, 08022 Barcelona, Spain

Keywords:

Distributed Databases, Distributed Transactions, Fine-grained Partitioning, Lookup Tables, Cloud Computing.

Abstract:

Recent advances on providing transactional support on the cloud rely on keeping databases properly parti-

tioned in order to preserve their beloved high scalability features. However, the dynamic nature of cloud

environments often leads to either inefﬁcient partitioning schemes or unbalanced partitions, which prevents

the resources from being utilized on an elastic fashion. This paper presents a load balancer that uses ofﬂine

artiﬁcial intelligence techniques to come out with the optimal partitioning design and replication protocol for

a cloud database providing transactional support. Performed experiments proof the feasibility of our approach

and encourage practitioners to progress on this direction by exploring online and unsupervised machine learn-

ing techniques applied to this domain.

1 INTRODUCTION

Cloud-based storage was initially aimed to overcome

the scalability limitations of transactional databases

(Gray et al., 1996) and meet the ever-growing stor-

age demands of daily software applications. To

achieve such commitment, the properties of tradi-

tional databases were relaxed until achieving what

was coined as the NoSQL paradigm (Stonebraker,

2010).

Thoroughly, novel NoSQL systems resign from

the ACID (i.e., atomicity, consistency, isolation, and

durability) features provided by classic databases

and implement what is known as BASE (i.e., basi-

cally available, soft-state, and eventually consistent)

properties (Brewer, 2012), which allow them to—

ideally—scale up to inﬁnity (Corbett et al., 2012;

Chang et al., 2006; DeCandia et al., 2007). In order

to meet the BASE principles, these systems often rely

on (1) weak consistency models (Vogels, 2009), (2)

in-memory key-value structures (Chang et al., 2006),

and (3) super light concurrency control and replica-

tion protocols, which permit to reduce interdepen-

dences between data stored in the repository and thus

boost its scalability.

However, the overwhelming amount of critical ap-

plications that cannot resign from their transactional

nature (e.g., electronic transferences, billing systems)

has driven practitioners to provide transactional sup-

port built on top of existing cloud repositories (Curino

et al., 2011; Levandoski et al., 2011). Actually, these

systems pursue the idea of deploying classic database

replication protocols (Wiesmann and Schiper, 2005)

over a cloud storage infrastructure (Das et al., 2010)

and hence offer transactional facilities while inherit-

ing the properties of the cloud. Typically, this is ad-

dressed by building a load balancer (Curino et al.,

2010; Curino et al., 2011; Levandoski et al., 2011)

that broadcasts transactions over a ﬁxed set of par-

titions that are statically running a given replication

protocol. Nevertheless, despite its broad adoption,

this approach is not aligned with the cloud philoso-

phy in the sense that it is not able to adapt itself to its

intrinsic elastic nature, which paradoxically leads to

underused or poorly scalable services.

The purpose of this paper is to propose a load

balancer for transactional cloud databases that re-

leases them from the aforementioned static conﬁg-

uration barriers. More speciﬁcally, this paper ex-

tends the proposal presented in (Arrieta-Salinas et al.,

2012) (which describes a cloud database with trans-

actional support) with a load balancing system tar-

geted at (1) monitoring the key performance param-

eters (e.g., throughput, transactions per second, ac-

273

Louis-Rodríguez M., Navarro J., Arrieta-Salinas I., Azqueta-Alzuaz A., Sancho-Asensio A. and E. Armendáriz-Iñigo J..

Workload Management for Dynamic Partitioning Schemes in Replicated Databases.

DOI: 10.5220/0004375902730278

In Proceedings of the 3rd International Conference on Cloud Computing and Services Science (CLOSER-2013), pages 273-278

ISBN: 978-989-8565-52-5

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

cess pattern) of the database, (2) inferring the optimal

partitioning scheme for the database through machine

learning techniques, and (3) proposing the most suit-

able replication protocol to be run on each partition

according to the current user demands.

The remainder of this paper is organized as fol-

lows. Section 2 stresses the need for applyinga proper

partitioning schema and using the appropriate replica-

tion protocol on a transactional cloud database. Sec-

tion 3 presents the proposed system architecture. Sec-

tion 4 details the experiments performed. Finally,

Section 5 concludes the paper.

2 PARTITIONING THE CLOUD

As clariﬁed in (Brewer, 2012), a distributed system

may implement a weak form of consistency (Vogels,

2009), a relaxed degree of availability (White, 2012),

and reasonable network partition tolerance (DeCandia

et al., 2007) while still keeping itself scalable. Actu-

ally, most of the existing cloud data repositories be-

have in this direction (DeCandia et al., 2007; White,

2012). However,those applications that demand strict

transactional support are not able to straightforwardly

ﬁt in this idea, since they generally require strong con-

sistency to provide correct executions (Birman, 2012)

while claiming for the appealing characteristics of-

fered by the cloud, therefore refusing to relax avail-

ability constraints. In this context, tolerance to net-

work partitions must be smartly addressed to truly

beneﬁt from the cloud features, which suggests prac-

titioners to be very cautious when deﬁning a parti-

tioning scheme. This section (1) stresses the criti-

calness of managing data partitions, (2) describes the

proposed graph-based partitioning technique that has

been applied in the presented load balancer and (3)

suggests supervised machine learning techniques as

an effective way to address this matter.

2.1 Motivation and Related Work

Partitioning is a very effective way to achieve high

scalability while leveraging data consistency in a dis-

tributed database. Transactions that are executed

within one single data partition require no interaction

with the rest of partitions, hence reducing the com-

munication overhead (Aguilera et al., 2009; Curino et

al., 2010; Das et al., 2010). However, conﬁguring

the partitioning scheme to minimize multi-partition

transactions while avoiding extra costs derived from

resources misusage requires a judicious criterion that

must carefully address the workload pattern to deter-

mine the optimal data partitions.

However, if we miss the perspective here and do

not replicate these partitions, the system will face the

single point of failure issue or suffer from perfor-

mance limitations due to the bottleneck effect. Con-

sequently, besides deciding the most suitable parti-

tioning strategy, it is important to study the work-

load nature that generated each partition to determine

the most appropriate replication protocol. Roughly

speaking, in an update-intensive data partition, the

ideal candidate will be an update-everywhere repli-

cation protocol (Wiesmann and Schiper, 2005); oth-

erwise, the candidate will probably be a primary copy

replication protocol (Daudjee and Salem, 2006).

Overall, there is a strong connection between the

amount of partitions, the database workload, and

the replication protocol running on each partition.

The following subsection discusses a graph-based ap-

proach used to infer these three features.

2.2 Graph-based Partitioning

We propose a structure based on undirected graphs,

which can be used for determining the best partition-

ing scheme. This proposalwill be explained by means

of the example depicted in Figure 1, which deﬁnes a

database consisting of two tables

PERSON

and

DEGREE

and a sample workload of four transactions.

This workload will drive the construction of the

undirected graph structure shown in Figure 1. More

speciﬁcally, each tuple of

PERSON

and

DEGREE

is rep-

resented by a node, whereas each edge between two

nodes reﬂects that they are accessed within the same

transaction. The weights of edges are increased ac-

cording to the number of transactions accessing the

connecting nodes. In addition, there is a counter asso-

ciated to each node representing the number of trans-

actions that access it.

The ﬁrst statement of transaction 1 accesses the

second tuple of

PERSON

and the ﬁrst tuple of

DEGREE

Hence, in Figure 1 we draw two nodes identiﬁed by

the

ﬁeld of each tuple (i.e., node 2 and node 4) and

plot an edge between them with an initial weight of

one. In addition, we set to one the counters of nodes 2

and 4 to reﬂect the number of times they have been ac-

cessed. The next operation of transaction 1 modiﬁes

node 1, thus we connect it with the previous nodes (2

and 4) using edges of weight one and set the weight

of node 1 to one. Finally, the last operation of trans-

action 1 is a

SELECT

that accesses node 3. We have

to connect node 3 with nodes 1, 2 and 4 with edges

of weight one and increase the counter of node 3 in

one unit. Likewise, we proceed in the same way with

the rest of transactions. For the sake of clarity, in Fig-

ure 1 we have surrounded the nodes accessed by each

CLOSER2013-3rdInternationalConferenceonCloudComputingandServicesScience

274

Figure 1: Example of a graph representation and its Distribution Data Table.

transaction with different dotted line formats.

Once all transactions have been executed and the

graph has been built, data must be partitioned so

that the nodes are distributed according to their re-

lationships within transactions. Thus, we are pur-

suing a twofold goal: minimizing the number of

multi-partition transactions (which implies cutting the

edge with the minimum weight) while minimizing the

size of each group of nodes, which leads to a NP-

Complete problem. However, there exist algorithms

that permit to circumvent this in an efﬁcient way like

MeTis and its variant hMeTis (Karypis and Kumar,

1998; Karypis, 2011).

In the previous example, the best cut will consist

in two partitions: one partition would include nodes

1 to 4, and the other one nodes 5 and 6. This solu-

tion represents a cut of three edges, all of them with a

weight of one. If we take a look at the executed trans-

actions we can see that most of them access

PERSON

;

thus, it makes sense to place these nodes together to

minimize multi-partition transactions.

Once the partitioning scheme has been deter-

mined, the system will use the information provided

by the aforementioned graph to properly forward

client requests to their corresponding partition. With

the aim of speeding up this task, we have developed

a Distribution Data Table (DDT) which behaves like

the structure presented in (Tatarowicz et al., 2012). A

sample of the DDT is included in Figure 1. This struc-

ture, organized as a lookup table, contains a series of

record intervals of each database table and the parti-

tion number where each interval is stored. The DDT,

which is fully stored in main memory, is scanned

when a client request arrives. Moreover, it can be par-

tially cached in the client side to avoid continuously

requesting the same information.

The approach herein presented analyzes the whole

workload prior performing any kind of partition rely-

ing on the fact that transactions will tend to have the

same access pattern. In order to cope with those sit-

uations where this assumption is too ambitious, the

following subsection presents a prospective view on

this area and discusses some strategies based on ma-

chine learning techniques.

2.3 Smart Partitioning on the Cloud

Data mining techniques can be divided into two kinds

of families based on the desired outcome of the algo-

rithm: (1) those that assume an a priori underlying

structure and thus require the use of existing informa-

tion to obtain its knowledge (referred to as supervised

learning) and (2) those that do not assume any under-

lying structure (referred to as unsupervised learning).

Moreover, the two aforementioned families can

be further classiﬁed into (1) ofﬂine and (2) online

methods. Ofﬂine algorithms require all data to be

analyzed in order to build a comprehensive system

model. For instance, in (Curino et al., 2010) a C4.5

decision tree (Quinlan, 1993) is used to build an un-

derstandable model of the tuple dependencies existing

on a database partition. On the other hand, online ap-

WorkloadManagementforDynamicPartitioningSchemesinReplicatedDatabases

275

Client

Application Logic

Driver

Replication Protocol n

GCS

Replication Protocol 2

GCS

Replication Protocol 1

GCS

Client

Application Logic

Driver

Metadata Manager

Workload Manager

Transaction Manager

Replication Manager

Metadata

Repository

Replication clusters

Data flow

Control flow

Client requests

Monitoring info

Cheetah System

Workload Analyzer

Partition Manager

Migration Manager

Dolphin System

Statistics

Figure 2: System model.

proaches build a dynamic system model that attempts

to adapt itself to the environment speciﬁcities. For ex-

ample, in (Hulten et al., 2001), an online decision tree

has been designed to predict the class distribution on

a time-changing environment.

Formally speaking, the technique presented in the

previous subsection uses an ofﬂine approach to estab-

lish the best partitioning and replication schema. In

fact, the whole workload schema is previously ana-

lyzed in order to build the aforesaid graphs and par-

titions. Although this ensures reaching an optimal

solution, its application on some real environments

is doubtful in the sense that transactions are rarely

known in advance. This issue becomes even more rel-

evant in cloud environments, where loads drastically

change according to the elastic user demands.

Ofﬂine techniques can be exported to online envi-

ronments if the variations of the system behavior are

unusual. In such situations, a windowing technique

can be used to continuously train the system and thus

get nearly-online results. However, cloud databases

cannot fully beneﬁt from this approach since the char-

acteristics of the workload can vary sharply, Hence,

we propose to explore online machine learning tech-

niques in order to come out with the best possible par-

titioning layout and replication protocol at any time.

3 SYSTEM MODEL

The experiments presented in this paper have been

conducted over an extended version of the architec-

ture proposed in (Arrieta-Salinas et al., 2012), which

is an alternative approach to (Das et al., 2010; Levan-

doski et al., 2011; Curino et al., 2011) to provide

transactional support on the cloud. As in other cloud

systems, the core of our architecture (as shown in Fig-

ure 2) is a metadata manager that forwards all trans-

actions executed by client applications and manages

the replicas accordingly (Arrieta-Salinas et al., 2012).

Taking into account (1) the importance of choosing

a proper partitioning schema, (2) the signiﬁcance of

selecting a suitable replication protocol in each par-

tition, and (3) the unavoidable existence of multi-

partition transactions in cloud environments, already

stressed in Section 2, this work extends the metadata

manager (Cheetah) presented in (Arrieta-Salinas et

al., 2012) and adds a new module coined as Dolphin

with the following entities.

The Workload Analyzer examines each transaction

to identify the accessed tuples by parsing each state-

ment of a transaction through the

WHERE

clause, and

generates a log ﬁle. The Partition Manager uses this

log to build the graph that determines the relationships

of tuples vs. transactions and creates the respective

partitions as explained in Section 2.2. The Partition

Manager is also in charge of periodically evaluating

the system load information provided by the Statis-

tics module to deﬁne the number of partitions and

the amount of replicas per partition. Apart from this,

the Partition Manager chooses the replication proto-

col for each data partition depending on the predom-

inant type of operations (reads or updates) as pointed

out in Section 2.1 and selects the number of hierarchy

levels by inspecting the timeliness characteristics of

queries and the total number of replicas. The Migra-

tion Manager distributes partitions across replicas by

pointing out their position in the hierarchy level and,

if a replica is at the core, which replication protocol

must be run. Finally, the Statistics Module collects all

kind of system information to study the performance

in terms of scalability, handled TPS, number of repli-

cas per partition, monetary costs, etc.

4 EXPERIMENTAL EVALUATION

We have built a prototype (using Java 1.6) that cov-

ers the basic functionality of all system components.

We have also developedan implementation of a JDBC

driver that has allowed us to run a popular set of

benchmarks named OLTPBenchmark (Curino et al.,

2012), in order to assess the performance of the

developed prototype. In particular, we have used

the OLTPBenchmark implementation of the Yahoo!

Cloud Serving Benchmark (YCSB) (Cooper et al.,

2010), which deﬁnes a single table of records com-

CLOSER2013-3rdInternationalConferenceonCloudComputingandServicesScience

276

(a) Workload A. (b) Workload B.

Figure 3: Maximum throughput depending on the scan workload.

posed by a primary key and ten text ﬁelds. In the

experiments performed, this table has been ﬁlled with

500000 records for a total size of 500MB.

Our testing conﬁguration consists of six comput-

ers in a 100 Mbps switched LAN, where each ma-

chine is equipped with an Intel Core 2 Duo processor

at 2.13 GHz, 2 GB of RAM and a 250 GB hard disk.

All machines run the Linux distribution OpenSuse

v11.2 (kernel version 2.6.22.31.8-01),with a Java Vir-

tual Machine 1.6.0 executingthe application code. An

additional computer with the same conﬁguration as

that of the replicas is used for running both clients

and the metadata manager. For the sake of simplicity,

instead of developing a distributed implementation of

the metadata manager to provide fault tolerance and

scalability, we have developed a centralized compo-

nent. All the data stored at the metadata manager is

kept in main memory. Moreover, each machine used

as a replica holds a local PostgreSQL 8.4.7 database,

whose conﬁguration options have been tuned so that it

behaves as an in-memory only database (i.e., it acts as

a cache, without storing the database on disk). Spread

4.0.0 (Stanton, 2005) has been used as Group Com-

munication System, whereas point-to-point commu-

nications have been implemented using TCP.

In the experiments, we have tested two graph-

based heuristic algorithms to determine the parti-

tioning schema, MeTis and hMeTis (Karypis, 2011;

Karypis and Kumar, 1998). Both techniques try to

ﬁnd the optimal solution focusing on ﬁnding parti-

tions by cutting as few edges as possible. The dif-

ference between them is that hMeTis is an optimiza-

tion of MeTis that uses hypergraphs and performs

more iterations in less time. Moreover, we have com-

pared these two heuristic algorithms with two other

approaches: Blocks and Round Robin. The former

tries to split the database into uniform blocks accord-

ing to the available capacity of the nodes, whereas

the latter splits each consecutive record in a differ-

ent partition across all nodes, so that there are no

two consecutive rows in the same partition. For all

the partitioning algorithms, we have performed an of-

ﬂine execution to determine the partitioning schema

and the replication protocols to be run on the core

layer of each partition. Besides, the algorithm reports

the number of replicas per partition and their location

along the replication hierarchy tree.

Due to the early stage of the development,we have

been able to run only read-only transactions. From the

set of transactions deﬁned in the YCSB, we have cho-

sen scan transactions, which read the set of records

whose keys belong to a given interval. As we are

only dealing with read-only transactions, all partition-

ing schemes deﬁne a primary copy replication proto-

col. The experiments have been performed using two

workload types: i) workload A, which performs a scan

over at most 10 records following a Zipﬁan distribu-

tion and ii) workload B, which is the same as work-

load A but scans at most 100 records.

The ﬁrst parameter we have measured is the num-

ber of multi-partition transactions, which should be

kept as low as possible in order to optimize system

performance. We have obtained the following amount

of multi-partition transactions for workload A: 1 with

Blocks, 112 with MeTis, 1451 with hMeTis, and 7271

with Round Robin. For workload B we have obtained

the following amount of multi-partition transactions:

1 with Blocks, 3360 with MeTis, 2182 with hMeTis,

and 20339 with Round Robin. It is straightforward

that the round robin algorithm shows the worst be-

havior due to the nature of the transactions performed,

as we must travel across several partition to perform

a scan. There is not such a big difference between

MeTis and hMeTis; however, we can see that with

workload B the hMeTis technique gives a better per-

formance. Nonetheless, the Blocks approach outper-

forms them all, since with this conﬁguration the prob-

ability of a scan accessing two partitions is rare.

Furthermore, we have analyzed the system

throughput and added an ideal case and a centralized

solution. Figure 3 shows that the performance is more

or less the same for all the algorithms but the Round

Robin. We have also noticed the fast degradation of

system throughput (it does not reach 100 TPS even in

the centralized case), due to the bottleneck caused by

having a centralized metadata manager. A possible

WorkloadManagementforDynamicPartitioningSchemesinReplicatedDatabases

277

solution would be to implement a distributed meta-

data manager using a a Paxos-like protocol (Lam-

port, 1998). Apart from this, holding the complete

structure that represents the object interdependencies

that determine the best partitioning policy is very ex-

pensive in terms of memory usage; thus, the main

memory at the metadata manager becomes easily sat-

urated. This problem could be alleviated by applying

aging or windowing strategies to prune such structure.

5 CONCLUSIONS

This paper presents a load balancer that uses artiﬁ-

cial intelligence techniques to obtain the optimal par-

titioning design for a cloud database with transac-

tional support. The experiments performed have em-

pirically veriﬁed that data partitioning can be seen as

a multi-objective optimization problem, since it is tar-

geted to come out with the maximum number of par-

titions with the minimum number of multi-partition

transactions. Furthermore, the proposed architecture

allows to heuristically deﬁne the most suitable repli-

cation protocol running on each partition according to

the system workload.

In addition, we have explored the feasibility of

further improving the obtained results by means of

online data mining techniques, which would allow the

system to automatically discover new workload pat-

terns and therefore optimize the partitioning schema

according to dynamic user demands.

ACKNOWLEDGEMENTS

The research leading to these results has received

funding from the Spanish National Science Founda-

tion (MINECO) (grant TIN2012-37719-C03-03) and

from Generalitat de Catalunya for its support under

grant 2012FI B 01058 for Andreu Sancho-Asensio.

REFERENCES

Aguilera, M. K., et al. (2009). Sinfonia: A new paradigm

for building scalable distributed systems. ACM Trans.

Comput. Syst., 27(3).

Arrieta-Salinas, I., et al. (2012). Classic replication tech-

niques on the cloud. In ARES 2012, pages 268–273.

Birman, K. P. (2012). Guide to Reliable Distributed

Systems–Building High-Assurance Applications and

Cloud-Hosted Services. Texts in computer science.

Springer.

Brewer, E. A. (2012). Pushing the CAP: Strategies for con-

sistency and availability. IEEE Comput., 45(2):23–29.

Chang, F., et al. (2006). Bigtable: A distributed storage

system for structured data. In OSDI 2006, pages 205–

218.

Cooper, B.F., et al. (2010). Benchmarking cloud serving

systems with YCSB. In SoCC 2010, pages 143–154.

Corbett, J.C., et al. (2012). Spanner: Google’s globally-

distributed database. In OSDI 2012, pages 251–264.

Curino, C., et al. (2010). Schism: A workload-driven ap-

proach to database replication and partitioning. Proc.

VLDB Endow., 3(1-2):48–57.

Curino, C., et al. (2011). Relational cloud: A database ser-

vice for the cloud. In CIDR 2011, pages 235–240.

Curino, C., et al. (2012). OLTPBenchmark. Accessible in

URL: http://oltpbenchmark.com.

Das, S., et a.l (2010). ElasTraS: An elastic transactional

data store in the cloud. CoRR, abs/1008.3751.

Daudjee, K. and Salem, K. (2006). Lazy database repli-

cation with snapshot isolation. In VLDB 2006, pages

715–726.

DeCandia, G., et al. (2007). Dynamo: Amazon’s highly

available key-value store. In SOSP 2007, pages 205–

220.

Gray, J., et al. (1996). The dangers of replication and a

solution. In SIGMOD 1996, pages 173–182.

Hulten, G., et al. (2001). Mining time-changing data

streams. In SIGKDD 2001, pages 97–106.

Karypis, G. (2011). METIS: A software package for

partitioning meshes, and computing ﬁll-reducing or-

derings of sparse matrices (v.5.0). In: http://

glaros.dtc.umn.edu/gkhome/metis/metis/overview.

Karypis, G. and Kumar, V. (1998). hMeTiS a hyper-

graph partitioning package (v.1.5.3). In: http://

glaros.dtc.umn.edu/gkhome/metis/hmetis/overview.

Lamport, L. (1998). The part-time parliament. ACM Trans.

Comput. Syst., 16:133-169.

Levandoski, J.J., et al. (2011). Deuteronomy: Transaction

support for cloud data. In CIDR 2011, pages 123–133.

Quinlan, J. R. (1993). C4.5: Programs for Machine Learn-

ing. Morgan Kaufmann Publishers.

Stanton, J. R. (2005). The Spread Toolkit. Accessible in

URL: http://www.spread.org.

Stonebraker, M. (2010). SQL databases v. NoSQL

databases. Commun. ACM, 53(4):10–11.

Tatarowicz, A.L., et al. (2012). Lookup tables: ﬁne-grained

partitioning for distributed databases. In ICDE 2012,

pages 102 –113.

Vogels, W. (2009). Eventually consistent. Communications

of the ACM, 52(1):40–44.

White, T. (2012). Hadoop–The Deﬁnitive Guide: Storage

and Analysis at Internet Scale (3. ed.). O’Reilly.

Wiesmann, M. and Schiper, A. (2005). Comparison of

database replication techniques based on total or-

der broadcast. IIEEE Trans. Knowl. Data Eng.,

17(4):551–566.

CLOSER2013-3rdInternationalConferenceonCloudComputingandServicesScience

278