Recovery in CloudDBAppliance’s High-availability Middleware

Hugo Abreu

1,2

, Luis Ferreira

1,2

, F

abio Coelho

1,2

, Ana Nunes Alonso

1,2

and Jos

e Pereira

1,2

INESC TEC, Porto, Portugal

Universidade do Minho, Braga, Portugal

Keywords:

CloudDBAppliance, High-availability, Recovery.

Abstract:

In the context of the CloudDBAppliance (CDBA) project, fault tolerance and high-availability are provided in

layers: within each appliance, within a data centre and between datacentres. This paper presents the recovery

mechanisms in place to fulﬁll the provision of high-availability within a datacentre. The recovery mechanism

takes advantage of CDBA’s in-middleware replication mechanism to bring failed replicas up-to-date. Along

with the description of different variants of the recovery mechanism, this paper provides their comparative

evaluation, focusing on the time it takes to recover a failed replica and how the recovery process impacts

throughput.

1 INTRODUCTION

CloudDBAppliance aims at delivering a database ap-

pliance (software and hardware), leveraging the ca-

pabilities of newer hardware by leveraging NUMA

awareness for terabyte-scale in-memory processing

with high-availability. High-availability requires a

recovery mechanism for failed or even new repli-

cas. This paper focuses on the intra-datacentre re-

covery mechanism for CloudDBAppliance. Simply

put, the goal is to ensure that if an appliance fails,

a standby is able to take over, powering a trans-

parent execution scenario for the client, thus hiding

faults and maintaining the perceived quality of ser-

vice. The operational database is the fulcrum of the

high-availability effort, as it holds the data that needs

to be persisted. The state of other components such

as of the in-memory many-core analytics framework

(also a part of the project) is considered volatile and

can be straightforwardly rebuilt from the data in the

operational database as needed, in the event of a fail-

ure.

In the context of the CDBA project, ensuring high

availability within a datacentre requires a replication

mechanism that, in case of failure, enables the failover

of the operational database to a consistent and up-to-

date standby replica, running in a different appliance.

There are a number of different models for database

replication that mainly differ on: whether transactions

are executed at each replica, or just at one while others

apply updates; and how (if at all) replicas are allowed

to diverge.

Recovery and replication capabilities work in

tandem to provide high-availability: the replica-

tion mechanism ensures there are multiple consistent

replicas of the data so that the system tolerates the loss

of some and the recovery mechanism makes it possi-

ble to bring failed or clean replicas up to a consistent

state.

In this paper we present the architecture of the re-

covery mechanism for CloudDBAppliance along with

the motivating design constraints. Section 2 covers

background on recovery and fault-tolerance mech-

anisms. Section 3 introduces the overall architec-

ture for the recovery and replication middleware, with

Section 4 showing some results on the selected recov-

ery approaches. Section 5 concludes the paper and

overviews the major takeaways.

2 BACKGROUND

Recovery, as discussed throughout this paper, refers to

the process of bringing failed or clean replicas up-to-

date, so that the set of available replicas is consistent

according to some criteria. Transactional recovery, in

the sense of recovering a single database from an in-

consistent (as in the C in ACID) state due to, for ex-

ample, a transaction being aborted, is out of the scope

of this discussion.

How to perform the recovery of failed replicas or

Abreu, H., Ferreira, L., Coelho, F., Alonso, A. and Pereira, J.

Recovery in CloudDBAppliance’s High-availability Middleware.

DOI: 10.5220/0008318304470453

In Proceedings of the 8th International Conference on Data Science, Technology and Applications (DATA 2019), pages 447-453

ISBN: 978-989-758-377-3

447

how to integrate clean replicas into a highly available

system should be a concern when designing replica-

tion mechanisms, even if this is often not the case

(Vilaca et al., 2009) in the literature (e.g., (Elnikety

et al., 2005; Kemme and Alonso, 2000)). However,

CDBA’s goal of providing highly available database

appliances, requires a suitable recovery mechanism to

be designed and implemented.

The replication approach (active vs passive) and

whether the replication protocol is synchronous or

asynchronous have a considerable impact on the de-

sign of the recovery protocol, namely to determine

when a recovered replica can be considered to be

available. For example, asynchronous replication pro-

tocols may accept a level of outdatedness in some

replicas which may enable less costly recovery strate-

gies or allow recovered replicas to be considered

available sooner without disrupting the expected con-

sistency level.

An important design decision is whether recov-

ery is to be done online (Kemme et al., 2001) or

ofﬂine (Amir, 1995). Ofﬂine recovery means that

the system somehow becomes unavailable whenever

a replica needs to be recovered. Online recovery, on

the other hand, means that the system remains avail-

able during a recovery process. Again, CDBA’s goal

of providing highly available database appliances re-

quires recovery to be done online.

The main goal of the recovery process requires

some kind of state transfer to the recovering replica.

Several approaches can be undertaken ranging from

transferring state in bulk or using an incremental ap-

proach, from a single or multiple donors. These ap-

proaches deﬁne trade-offs on how quickly a replica

is recovered and the impact on the performance of

the overall system. Different approaches may be se-

lected based on a multitude of factors: how far behind

the recovering replica is; whether its state is consis-

tent, even if outdated; the number of available donors,

etc. The impact of these factors on recovery has been

evaluated and analysed in (Vilaca et al., 2009). It

may also be possible to take advantage of workload

patterns and/or data partitioning to improve recovery,

namely when applying missing updates to a recover-

ing replica (Jim

enez-Peris et al., 2002).

Intra-datacentre recovery, in the context of CDBA,

presents different challenges from the target models

generally considered when designing recovery proto-

cols. The main differences are: the substantial com-

puting power of each replica, due to the use of the

state-of-the-art, many-core Bullion hardware, as op-

posed to considering commodity or COTS servers;

a very low-latency, high-bandwidth communication

network connecting replicas, as opposed to higher la-

tency LAN or WAN networks; and the number of

available replicas, which is restricted to two, the min-

imum to provide high-availability, as opposed to quo-

rum based solutions, which require a minimum of 3

replicas. These differences have a fundamental im-

pact on the design of the replication mechanism and

consequently, the recovery mechanism, as these open

up the possibility of exploring unusual trade-offs on

throughput and availability.

2.1 Replication in CloudDBAppliance

Due to the interdependence of the replication and re-

covery mechanisms, this section presents an overview

of CDBA’s intra-datacentre replication protocol. Fur-

ther details can be found in (Ferreira et al., 2019).

The recovery mechanisms considered were de-

signed and implemented as part of the replication

middleware layer that provides isolation between

users and the underlying operational database. To

achieve this, the middleware layer is placed as a top

tier layer, intercepting SQL statements and perform-

ing all required steps to accommodate the replication

mechanisms. The replication middleware removes

non-determinism from requests and ensures these are

totally-ordered. The middleware approach simpliﬁes

integration and extends the possibility of considering

this replication mechanism beyond the CloudDBAp-

pliance project, by offering a completely decoupled

solution. This can be done by embedding the mid-

dleware in the JDBC driver used for communication

between clients and database servers.

The replication middleware relies on a JDBC-

enabled API, that contains key interfaces between a

client proxy and server side stub. Key interfaces were

selected by reusing V-JDBC for JDBC request in-

terception and hand-over between client proxies and

server. As detailed and assessed in (Ferreira et al.,

2019), choosing V-JDBC allowed for a ﬂexible en-

vironment where the transport protocol can be cus-

tomised according to the application itself.

The replication architecture, depicted in Figure 1

is based on a set of reliable distributed logs. These

enable decoupling clients from the replication man-

ager instances and the operational databases. The log

allows requests to be stored reliably and with total

order guarantees. Brieﬂy, when a client application

sends a SQL request, the request is sent to a write-

proxy that acts as a handler for the distributed log

structure. After being reliably stored, the distributed

log structure is used by the replication manager in-

stances that pull the requests and push these for exe-

cution at its local operational database instance. The

distributed log, conceptually considered as part of the

ADITCA 2019 - Special Session on Appliances for Data-Intensive and Time Critical Applications

448

Failure

Detector

Operational Database

Replication Manager

Application

primary

SQL

data/acks

V-JDBC

client

V-JDBC Server

Failure

Detector

Operational Database

Replication Manager

standby

V-JDBC Server

…

Figure 1: Replication in CloudDBAppliance.

replication mechanism, allows for a customisable de-

ployment, making it possible to launch as many in-

stances as required. Storing requests reliably in the

distributed log makes it possible to keep processing

in the critical write path to a minimum and, therefore,

minimizing the performance impact of the replication

middleware on the operational database. Addition-

ally, read and write paths are decoupled.

Also, judiciously partitioning requests among dif-

ferent logs, makes it possible to create independent

request queues, establishing distinct shards, and thus

enabling requests from different shards to be pro-

cessed in parallel leading to performance gains.

3 FAILOVER AND RECOVERY

FOR HIGH AVAILABILITY

The recovery mechanism and the replication mecha-

nism overviewed in Section 2 are complementary.

The failover and recovery mechanisms depends

on the replication manager’s failure detector compo-

nent. Brieﬂy, the failure detector exchanges heart-

beats through a reliable communication channel and

monitors that channel for the presence of heartbeats

from the other replication manager instance. The ab-

sence of heartbeats from a given replication manager

instance triggers recovery and/or failover procedures

as appropriate: a failed primary replica triggers the

failover procedure; as a failed replica reboots and re-

integrates the system, a recovery process takes place

before it is able to enter the standby state.

Next, a description of the recovery process is pro-

vided.

1. The replication manager requests a state transfer

from the live appliance to the restarted replica, in

order to bring it up to a consistent state.

2. The transferred state is installed in the recovering

replica.

3. The recovering replica checks the distributed log

for outstanding requests and determines the point

from which it can enter the standby state.

4. The recovering replica reaches the designated out-

datedness point and enters the standby state, end-

ing the recovery process.

Because replication is asynchronous, the point of

outdatedness at which the recovering replica can enter

the standby state can be conﬁgurable. One possibility

is to allow the recovering replica to enter the standby

state as soon as step 3 is performed.

In fact, different requirements can be deﬁned to

end the recovery process depending on whether it was

the standby or the primary that failed. The failover

mechanism additionally requires the standby replica

to determine from the log the point (state) from which

it can start answering clients assuming the role of pri-

mary. Missing updates are reconstructed from the in-

formation in the distributed log. Thus, both the intra-

datacentre failover and recovery mechanisms take ad-

vantage of the persistence of the distributed log.

Because of the decoupling between clients and

replicas, neither recovery nor failover are visible, in

general, to client applications. This property depends

on the ability of the recovering replica to catch up.

The low-latency, high-bandwidth network connection

and the high computing capacity of the Bullion are

instrumental to providing it. Results in Section 4.2

support this ability.

4 EVALUATION

The evaluation for the recovery process of the intra-

datacentre high-availability mechanisms assesses the

time to identify, request, transfer and reintegrate state

on the new or faulty replica. The most relevant metric

to observe is the Mean Time To Recovery (MTTR).

One major factor of the MTTR metric is expected to

Recovery in CloudDBAppliance’s High-availability Middleware

449

Bullion 1

Bullion 2

Distributed Log

TPC-C

Distributed Log

Operational

Database

JDBC

proxy

JDBC

proxy

TPC-C

JDBC

proxy

Operational

Database

JDBC

proxy

Figure 2: Conﬁguration for the evaluation campaign.

be the amount of data to transferred, which depends

on overall database size and the type of failure. For

example, if the failed replica has consistent data snap-

shots available, it may only need to pull and synchro-

nise over any missing snapshots. Otherwise, recover-

ing the replica requires a full state transfer. However,

remote ﬁle sharing protocols can be used to comple-

ment and improve data transfer. We focus on tech-

niques for state transfer: full state transfer using a

backup mechanism and using rsync to minimize the

transferred data. The impact of the recovery process

on transactional activity is also analysed.

4.1 Benchmarking Campaign

Experiments were conducted in the scenario depicted

in Figure 2. Each appliance is a standalone state-of-

the-art Bullion machine, tailored for CloudDBAppli-

ance.

The hardware speciﬁcation for the machines con-

sidered is detailed in Table 1.

The industry-standard TPC-C benchmark was

used to induce a transactional load on the system. The

TPC-C speciﬁcation models a real-world scenario

where a company, comprised of several warehouses

and districts, processes orders placed by clients. The

workload is deﬁned over 9 tables operated by a trans-

action mix comprised of ﬁve different transactions,

namely: New Order, Payment, Order Status, Deliv-

ery and Stock-Level. Each transaction is composed

of several read and update operations, where 92% are

update operations, which characterizes this as a write

heavy workload. The benchmark is divided into a

load and an execution stage. During the ﬁrst stage,

the database tables are populated and during the sec-

ond stage, the transaction mix is executed over that

dataset. TPC-C deﬁnes how these tables are popu-

lated and also deﬁnes their size and scaling require-

ments, which is indexed to the number of conﬁgured

warehouses in the system. The outcome of this bench-

mark is a metric deﬁned as the maximum qualiﬁed

throughput of the system, tpmC, or the number of

New Order transactions per minute.

In order to evaluate the scalability of the recovery

mechanism, the TPC-C benchmark was in two load

conﬁgurations, featuring 100 and 1000 warehouses.

In these experiments, execution starts with two

live appliances executing the requests of the running

benchmark. At some point, a fault is induced. Be-

cause our focus is on evaluating the impact of differ-

ent techniques for state transfer, in these experiments,

recovery is manually triggered. This makes it pos-

sible to evaluate different recovery scenarios, induc-

ing different degrees of divergence between the live

replica and the failed replica. As such, we measured

the MTTR from the time the recovery process is trig-

gered until the state has been installed on the replica.

4.2 Results

Figure 3 reports on the MTTR and the impact

of recovering a failed standby on the benchmark’s

throughput using the default backup process. Us-

ing the default backup process creates a full copy

of all data handled by a replica. This process will

create a base copy, incrementally expanded as the

database grows. A recovery mechanism using the de-

fault backup process must do a full state transfer in

the event of a failure.

The recovery process was triggered 20 seconds

after the replica failed. As the failed replica tries

to resume activity, its replication manager requests a

ADITCA 2019 - Special Session on Appliances for Data-Intensive and Time Critical Applications

450

Table 1: Hardware description for the evaluation campaign.

Machine Login node Bullion 1 Bullion 2

CPU Vendor Intel Intel Intel

CPU Model Xeon E5-2667 v3 3.2GHz Xeon Platinum 8150 3.00GHz Xeon Platinum 8153 2.00GHz

CPU Number 16 192 256

Memory 65.69GB 4.2 TB 3.16 TB

Storage type SSD SSD (shared)

Storage capacity 250 GB 4.9 TB (shared)

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

18:04 18:05 18:06 18:07 18:08 18:09 18:10 18:11 18:12 18:13 18:14

THROUGHPUT (TPMC)

TIME

State transfer complete

Failed

Recovery Started

State installed, ready to receive

requests

Figure 3: Recovery using the default backup process for 100 warehouses.

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

14:41 14:42 14:43 14:44 14:45 14:46 14:47 14:48 14:49 14:50 14:51

THROUGHPUT (TPMC)

TIME

State transfer complete

Failed

Recovery Started

State installed, ready to receive

requests

Figure 4: Recovery using rsync for 100 warehouses.

state transfer from the live replica, which completed

during the following 3 minutes. After installing the

transferred state, the replica determines the last (glob-

ally identiﬁed) transaction to be processed and then

resumes normal transaction execution from the dis-

tributed log. Overall, the Mean Time To Recovery

was approximately 4 minutes. During the entire re-

covery process, the observed throughput did not suf-

fer a signiﬁcant impact, as depicted by the blue line in

Figure 3

Figure 4 showcases the same conﬁguration as in

Figure 3, but the rsync protocol is used instead to

bring the recovering database instance up-to-date with

the donor. The rationale is that using rsync may

avoid a full state transfer. Again, overall, the through-

put performance during the recovery process did not

present a considerable impact. However, using rsync

actually caused state transfer to take longer, over 5

minutes, actually having a negative impact on the

MTTR, which increased to just below 7 minutes. Be-

cause the Bullions communicate through a high speed

connection, the overhead of having to determine ex-

Recovery in CloudDBAppliance’s High-availability Middleware

451

20000

40000

60000

80000

100000

120000

140000

21:59 22:00:00 22:01:00 22:02:00 22:03:00 22:04:00 22:05:00 22:06:00 22:07:00 22:08:00 22:09:00 22:10:00 22:11:00 22:12:00

THROUGH PU T TP MC

TIME

State Transfer Complete

Failed

Recovery

Started

State

installed,

ready to receive

requests

Figure 5: Recovery using the default backup process for 1000 warehouses.

actly which data need to transferred, particularly con-

sidering the operational database instance remains on-

line during the process, actually outweighs the impact

of sending more data through the network.

Figure 5 depicts the same scenario as Figure 3 but

for a conﬁguration with 1000 warehouses. The main

goal was to evaluate what the impact on the MTTR

would be, when considering a larger dataset. Also, the

recovery process was only triggered after a consider-

able delay (over 5 minutes) to make sure the recov-

ering replica is signiﬁcantly behind the live replica.

While it might be expected for state transfer to take

signiﬁcantly longer than the previous experiment, re-

sults show that using the default recovery mechanism

is scalable providing that the replicas communicate

through a high-speed network connection. Again, re-

garding throughput, the impact was minimal, as in

average the throughput remained stable at around 96

thousand tpmC, even if a small throughput decay was

registered during state transfer and installation.

5 CONCLUSION

This paper introduced CloudDBAppliance’s intra-

datacentre recovery mechanism, essential to its intra-

datacentre high-availability framework. It takes ad-

vantage of the middleware architecture of the repli-

cation mechanism and of the persistence provided

by the distributed log. This framework takes advan-

tage of differentiating characteristics of the Cloud-

DBAppliance project, stemming from its state-of-the-

art hardware. The recovery process was evaluated

using two alternative state transfer techniques. Re-

sults show that the impact of the recovery process

on a transaction workload induced by running the

TPC-C benchmark is negligible. Moreover, results

show the recovery mechanism scales well regarding

an increased transactional workload and increased di-

vergence of the recovering replica. The proposed

replication, failover and recovery mechanisms pro-

vide high-availability while still decoupling clients

from database instances, enabling its applicability to

other scenarios. Regarding future work, we plan to

accomdate the inter-datacentre conﬁguration, where

the recovery mechanisms need to accomodate updates

via the WAN.

ACKNOWLEDGMENTS

The research leading to these results has received

funding from the European Union’s Horizon 2020

– The EU Framework Programme for Research and

Innovation 2014-2020, under grant agreement No.

732051 and by the ERDF – European Regional De-

velopment Fund through the Operational Programme

for Competitiveness and Internationalization - COM-

PETE 2020.

REFERENCES

Amir, Y. (1995). Replication using group communication

over a partitioned network. PhD thesis, Citeseer.

Elnikety, S., Pedone, F., and Zwaenepoel, W. (2005).

Database replication using generalized snapshot isola-

tion. In 24th IEEE Symposium on Reliable Distributed

Systems (SRDS’05), pages 73–84. IEEE.

Ferreira, L., Coelho, F., Alonso, N., and Pereira, J. (2019).

Towards intra-datacentre high-availability in cloud-

bappliance. In CLOSER (1).

Jim

enez-Peris, R., Pati

no-Mart

ınez, M., and Alonso, G.

(2002). Non-intrusive, parallel recovery of replicated

data. In 21st IEEE Symposium on Reliable Distributed

Systems, 2002. Proceedings., pages 150–159. IEEE.

Kemme, B. and Alonso, G. (2000). A new approach to de-

veloping and implementing eager database replication

ADITCA 2019 - Special Session on Appliances for Data-Intensive and Time Critical Applications

452

protocols. ACM Transactions on Database Systems

(TODS), 25(3):333–379.

Kemme, B., Bartoli, A., and Babaoglu, O. (2001). Online

reconﬁguration in replicated databases based on group

communication. In 2001 International Conference on

Dependable Systems and Networks, pages 117–126.

IEEE.

Vilaca, R. M. P., Pereira, J. O., Oliveira, R. C., Armendariz-

Inigo, J. E., and de Mendivil, J. R. G. (2009). On

the cost of database clusters reconﬁguration. In 2009

28th IEEE International Symposium on Reliable Dis-

tributed Systems, pages 259–267. IEEE.

Recovery in CloudDBAppliance’s High-availability Middleware

453