Modern Federated Database Systems: An Overview

Leonardo Guerreiro Azevedo, Elton Figueiredo de Souza Soares, Renan Souza

and Marcio Ferreira Moreno

IBM Research, Brazil

Keywords:

Federated Database, Polyglot Database, Multistore, Polystore, Multidatabase, Heterogeneous Data Stores,

NoSQL, Dbaas, Distributed File System, Data Processing Frameworks.

Abstract:

Usually, modern applications manipulate datasets with diverse models, usages, and storages. “One size ﬁts

all” approaches are not sufﬁcient for heterogeneous data, storages, and schemes. The rise of new kinds of data

stores and processing, like NoSQL data stores, distributed ﬁle systems, and new data processing frameworks,

brought new possibilities to meet this scenario’s requirements. However, semantic, schema and storage het-

erogeneity, autonomy, and distributed processing are still among the main concerns when building data-driven

applications. This work surveys the literature aiming at giving an overview of the state of the art of modern

federated database systems. It presents the background, characterizes existing tools, depicts guidelines one

should follow when creating solutions, and points out research challenges to consider in future work. This

work gives fundamentals for researchers and practitioners in the area.

1 INTRODUCTION

Several modern applications manipulate diverse

datasets with different models and usages, e.g., med-

ical informatics, intelligent transportation, etc. “One

size ﬁts all” is not effective in such scenarios. The

use of a single database and a unique data model for

all data in different data models may degrade perfor-

mance and executing ETL (Extract-Transform-Load)

processes to load all data in a single database may be

very expensive (Stonebraker et al., 2007). Besides,

manual data curation and maintenance of the ETL

pipelines (due to adaptations caused by, e.g., domain

evolution) are labor-intensive (Tan et al., 2017) (Bon-

diombouy and Valduriez, 2016) (Stonebraker, 2015).

The problem of accessing heterogeneous data

sources has been studied in the context of multi-

database and data integration systems (Kolev et al.,

2016a). Several new data management solutions have

emerged, such as distributed ﬁle systems (e.g., GFS

and HDFS

), NoSQL data stores (e.g., MongoDB,

Allegrograph, Neo4J, Titan, Dynamo, BigTable, Re-

dis) and new data processing frameworks (e.g.,

Spark) as well as hybrid (multimodal, e.g., OrientDB,

ArangoDB, or NewSQL, e.g., Google F1, LeanX-

cale). The RDBMS (Relational Database Manage-

Google File System.

Hadoop Distributed File System.

ment System) has been evolved to manage different

kinds of data (e.g., multimedia objects, XML docu-

ments, spatial data), like IBM DB2

which was built

on a standard SQL engine, but it has evolved to be

a hybrid data management system for structured and

unstructured data. Usually, using one single DBMS

results in loss of performance and ﬂexibility for spe-

ciﬁc applications. For instance, a column-oriented

DBMS is one order of magnitude better for On-

line Analytical Processing (OLAP) workloads than an

RDBMS (

Ozsu and Valduriez, 2020), while SDBMS

(Stream Database Management System) is more efﬁ-

cient for stream data, which RDBMS does not even

support (Nayak et al., 2013). Thus, a variety of data-

processing architectures may be required for special-

ized markets (Stonebraker et al., 2007).

Schema, semantic, and data sources heterogene-

ity, autonomy, and distributed processing are still

concerns (Tan et al., 2017). A federated system

arises as a solution. It is a middleware that provides

a seamless interface to heterogeneous data systems

with an independent data model and (perhaps) data

schemes (Stonebraker, 2015).

This work overviews the state-of-the-art of the

new generation federation systems.

It is divided as follows. Section 2 presents the

main concepts. Section 3 characterizes existing tools.

https://www.ibm.com/analytics/db2

276

Azevedo, L., Soares, E., Souza, R. and Moreno, M.

Modern Federated Database Systems: An Overview.

DOI: 10.5220/0009795402760283

In Proceedings of the 22nd International Conference on Enterprise Information Systems (ICEIS 2020) - Volume 1, pages 276-283

ISBN: 978-989-758-423-7

Section 4 presents guidelines and research challenges.

Finally, Section 5 concludes.

2 MAIN CONCEPTS

The shift in the federation database has arisen due

to the storage requirements of modern applications,

which resulted in the development of several distinct

storage technologies to meet speciﬁc needs. Now, it is

a requirement for these technologies to work together.

This section overviews the state-of-the-art.

2.1 Storage Solutions

There are three main layers of storage: distributed

storage; database management; and, distributed pro-

cessing (Bondiombouy and Valduriez, 2016).

Distributed Storages. Include ﬁles and objects stor-

ages. File storage works on unstructured data (i.e.,

sequences of bytes), organizing them as ﬁxed-length

or variable-length records. The system organizes

ﬁles hierarchically, and stores ﬁle metadata (e.g., ﬁle

name, owner, access permission) separate from con-

tent. For shared-nothing, examples are GFS, HDFS,

and GlusterFS; for shared disk, an example is Global

File System 2 (GFS2). Object storage stores data

as an object which has a unique identiﬁer (oid),

properties and metadata. Examples are Lustre and

XtreemFS. Also, Ceph and Ozone are systems that

combine block and object storage.

NoSQL (Not Only SQL) Systems. Emphasize scal-

ability, fault-tolerance, and availability, sometimes at

the expense of consistency. The main categories are

key-value, wide column, document, and graph, as

well as hybrid (multimodel or NewSQL) (

Ozsu and

Valduriez, 2020). SolidIT

presents a comparison of

systems.

Key-value Systems: store data as key-value pairs

where the key identiﬁes the record and the value

is a schemaless data. Their typical operations

are put(key, value), get(key) and delete(key).

Examples are Redis, Dynamo, Memcached, and Riak.

Extended Key-Value systems store records (a set of

key-value pairs) in collections (or domains), e.g., the

domain Customers where each customer has Cus-

tomer Id, ﬁrst name, last name, etc. Examples are

Amazon SimpleDB and Oracle NoSQL Database.

Wide Column Systems: store data as a table but allow-

ing nested values in a schemaless way where a column

may have column values. Each column has a name, a

https://db-engines.com/en/ranking

value, and a timestamp (used for versioning). Exam-

ples are Google Bigtable, Apache HBase, Cassandra,

and Accumulo.

Document Systems: are advanced key-value systems

where values are of the document type, such as JSON,

YAML, or XML. It stores records in collections (sim-

ilar to tables). Records in a collection may have

different schemes. Besides simple key-value oper-

ations, document stores offer an API or query lan-

guage. Examples are MongoDB, CouchDB, Couch-

base, RavenDB, and Elasticsearch.

Graph Database Systems: manipulate data as graphs.

Their use has grown to manage data with inherent

graph-like nature, e.g., Web, geographical systems,

transportation, telephones, social and biological net-

works. The graph database model represents schema

and instances as a (labeled)(directed) graph or gener-

alization of the graph structure (e.g., hypergraphs or

hypernodes) and graph integrity constraints (Angles

and Gutierrez, 2008). Data are represented as nodes

and edges (which connect two nodes). E.g., horse

and apple nodes and a likes edge to represent horse

likes apple. Nodes and edges may have properties,

e.g., name and birthday properties for horse and color

for apple. Often, these systems provide query lan-

guages that allow for graph traversals and other typi-

cal graph operations, like breadth and depth search.

Examples are Neo4J, Inﬁnite Graph, Titan, Graph-

Base, Trinity, and Sparksee.

Triplestores: or RDF stores, are the matter of choice

for storing and querying semantic datasets (Hasl-

hofer et al., 2011)(Iancu and Georgescu, 2018), which

are often described using RDF (Resource Description

Framework), a standard model for data interchange.

In RDF, datasets are represented as triples (subject,

predicate, object). That is a value (object) of a prop-

erty (predicate) of a resource (subject) (Zulkeﬂi et al.,

2013). E.g., (LeonardoDaVinci, hasCreated,

TheMonalisa). Each part is represented as a Uniform

Resource Identiﬁer (URI). RDFS (RDF Schema) and

OWL (Web Ontology Language) are RDF serializ-

able vocabularies, commonly used to represent on-

tologies, which deﬁne classes and attributes of URIs

and their relationships (Iancu and Georgescu, 2018).

Moreover, triplestores are capable of processing a

large amount of RDF data (Modoni et al., 2014), han-

dling semantic queries and using inference for uncov-

ering new information out of the existing relations.

Examples are AllegroGraph, GraphDB, MarkLogic,

Mulgara, Proﬁum Sense, Blazegraph, Virtuoso, Mar-

motta, Stardog, Apache Jena, RDF4 (former Sesame),

Oracle Database 12c (Iancu and Georgescu, 2018).

Hybrid Data Stores: combines capabilities typically

found in different data stores and DBMS. They

Modern Federated Database Systems: An Overview

277

may be multimodel NoSQL systems, which com-

bines multiple data models (examples are OrientDB,

ArangoDB, and Microsoft Azure Cosmos DB), and

NewSQL DBMSs, which combines the scalability of

NoSQL with the strong consistency and usability of

relational DBMS (examples are Google F1, LeanX-

cale, Apache Ignite, among others). Hybrid Trans-

action and Analytics Processing (HTAP) is a class of

New SQL aiming at performing OLAP and OLTP in

the same data allowing real-time analysis and avoid-

ing ETL processing.

Data Processing Frameworks. Handle a high vol-

ume of data in real-time (Zheng et al., 2015). They

focus on data analysis to increase understanding, pat-

tern discovery, and gain insights. They handle data

in batches, in a continuous stream or both ways (Gu-

rusamy et al., 2017). Typically, those systems support

operators that are automatically parallelized (Bon-

diombouy and Valduriez, 2016). Examples are (Gu-

rusamy et al., 2017): (i) Batch-only: Hadoop MapRe-

duce; (ii) Stream-only: Apache Storm and Apache

Samza; (iii) Hybrid: Apache Spark and Apache Flink.

2.2 Taxonomies

There are two main taxonomies to classify federated

data systems.

Tan et al. proposed a taxonomy of four categories

considering heterogeneity in data stores and query in-

terfaces (Tan et al., 2017):

• Federated Database System. Homogeneous

data stores and single standard query inter-

face. They feature mediator-wrapper architecture

and employ schema-mapping and entity-merging

techniques for data integration. Semantics hetero-

geneity is a challenge. Example: Multibase.

• Polyglot System. Homogeneous data stores and

multiple query interfaces. Different query inter-

faces provide semantics, which signiﬁcantly sim-

pliﬁes query formulation. Example: Spark SQL

allows access to data in relational and procedural

modes.

• Multistore System. Heterogeneous data stores

and single query interface, categorized as:

– Systems that integrate distributed ﬁle systems

with RDBMs, such as HadoopDB, Polybase,

and JEN.

– Systems that integrate NoSQL systems with

RDBMSs, such as BigIntegrator, Forward, and

D4M.

– Systems focused on optimizing data placement

across data stores for query performance, such

as ESTOCADA, Odyssey, and MISO.

– Systems that adopt ontologies and apply

semantic approaches (schema-mapping and

entity-resolution techniques) to mediate rela-

tional and non-relational data sources, such as

TATOOINE and OPTIQUE.

• Polystore System. Heterogeneous data stores

and multiple query interfaces, categorized as:

– Systems focused on query answering, such as:

BigDAWG, Myria and Apache Drill.

– Systems that concentrate on multi-platform

data-ﬂow scheduling and analytics, such as

QoX, Musketeer, and Rheem.

– Systems focused on data ingestion and deriva-

tion with heterogeneous data stores, such as

AWESOME.

Bondiombouy and Valduriez’s classiﬁcation is based

on the data coupling of the systems (Bondiombouy

and Valduriez, 2016). They base their work in cloud

data stores and call them as multistore. Multistore is a

system that provides integrated access to several data

stores, such as NoSQL, RDBMS, or HDFS, some-

times through a data processing framework. Multi-

stores can be classiﬁed as:

• Loosely Coupled System. Autonomous local

data stores accessed by a common language or

by their local language. Examples: BigIntegra-

tor, Forward, and Qox.

• Tightly Coupled System. Local data stores ac-

cessed by the multistore system using a single lan-

guage for querying structured and unstructured

data. They aim at efﬁcient querying for (big) data

analytics and/or self-tuning. Examples: Polybase,

HadoopDB, Estocada, Odyssey, and JEN.

• Hybrid Systems. Some data stores are loosely

coupled, while others are tightly coupled. Exam-

ples: Spark SQL, CloudMdsQL, and BigDAWG.

2.3 Frameworks for Multidatabase

Federation Characterization

Besides the taxonomies, Tan et al. and Bondiom-

bouy and Valduriez proposed frameworks for feder-

ated data systems characterization.

Tan et al. proposed a framework inspired

on (Sheth and Larson, 1990) and composed by

ﬁve dimensions: (i) Heterogeneity; (ii) Autonomy;

(iii) Transparency; (iv) Flexibility; (v) Optimality.

Heterogeneity: in data-integration systems, implies

the design intent is threefold: (i) Uniform access and

management of stores’ data; (ii) Advantage of com-

ponent processing engines; (iii) Minimal loss of ex-

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

278

pressiveness of the underlying query interfaces. Het-

erogeneity may be classiﬁed in:

1. Data-store Heterogeneity: different modeling

techniques, e.g., column store, key-value stores.

2. Processing-engine Heterogeneity: different pro-

cessing capabilities, e.g., processing engines mod-

eled around relations, arrays, and graphs.

3. Query-interface Heterogeneity: different data

models in query engines, supporting various for-

mal algebras, towards expressiveness in a seman-

tic context, e.g., a system with relational and array

query interface allows expressing simple linear-

algebra operations on relational data.

Autonomy: relates to regulations and constraints:

1. Association Autonomy: local data stores decide

when to associate and dissociate itself from the

federation.

2. Execution Autonomy: local data stores may sup-

port federation and native applications. They de-

cide prioritization when required.

3. Evolution Autonomy: local stores’ databases

evolve independently from the federation, which

has to adapt to the changes.

Transparency: concerns to integration details of

storage and layout:

1. Location Transparency: local stores provide

mechanisms to hide location details.

2. Transformation Transparency: local stores pro-

vide mechanisms to hide details of data types and

structures. Users can focus on logical-level trans-

formations, and the transparent system infers and

map data types, and adjust data structures.

Flexibility: concerns the capacity to manage arbitrary

formats and to support ﬂexible workﬂows and user-

deﬁned functions.

1. Schema Flexibility: allows user-deﬁned schemata

and dynamic schema discovery, making it possi-

ble to automate transformations.

2. Interface Flexibility: the query interface allows

user-deﬁned functions and extensions instead of

being around a ﬁxed algebra.

3. Architectural Flexibility: provide a modularized

architecture allowing customization to different

scenarios, making possible extending query inter-

face, query optimizer, backend engines, etc.

Optimality: concerns opportunities for optimization

through improvements in data placement and feder-

ated query plan generation.

1. Federated Plan Optimization: federated query

plans for sub-queries or data transformation and

migration to achieve better performance.

2. Data-placement Optimization: place the data in

the best ﬁtting engine using, e.g., rule-based and

cost-based methods.

Tan et al. analyze polystore systems whose de-

sign and implementation emphasize query-processing

and query-answering challenges, such as BigDAWG,

CloudMdsQL, Myria, and Apache Drill.

The framework proposed by Bondiombouy and

Valduriez to compare multistore systems is based on

two dimensions: (i) Functionality: concerns the di-

mensions objective, data model, query language, and

data stores that are supported; (ii) Implementation

Techniques: concern the dimensions special modules,

schema management, query processing, and query

optimization (Bondiombouy and Valduriez, 2016).

Related to Functionality, they point out:

• The major Objective of multistore is the ability to

integrate relational data (stored in RDBMS) with

other kinds of data in different data stores;

• Each multistore supports different kinds of Data

Stores (e.g., RDBMS, NoSQL, BigTable, HDFS,

Array DBMS, DSMS).

• In terms of the Data Model, most systems pro-

vide a relational abstraction. BigIntegrator, Poly-

base, and HadoopDB have relational data mod-

els. Forward and CloudMdsQL are JSON-based.

QoX has a more general graph abstraction to cap-

ture analytic data ﬂows. SparkSQL has a nested

model. BigDAWG and Estocada have no unique

model since they allow access data stores with

their native (or island) languages.

• Considering Query Language, most systems pro-

vide a SQL-like language, like BigIntegrator,

HadoopDB (HiveQL), SparkSQL, CloudMdsQL

(with native subqueries). Polybase uses SQL.

QoX is XML-Based. Estocada and BigDAWG

use native query languages.

Related to Implementation Techniques:

• Special Modules: reﬁne the generic architecture

or bring new functionalities. Examples for the

ﬁrst are importer, absorber and ﬁnalizer, query

processor, query planner. Examples for the sec-

ond are dataﬂow engine, HDFS bridge, storage

advisor.

• For Schema Management, most multistore man-

age a Global-as-View (GAV) or a Local-as-view

(LAV)

global schema approach, which indicates

how the elements of the global schema can be de-

rived, when needed, from the elements of the data

In LAV, each data source schema is treated as a view deﬁ-

nition of the global schema. In GAV, the global schema is

deﬁned as a set of views over the data source schemes.

Modern Federated Database Systems: An Overview

279

source schemes (Lenzerini, 2002). E.g., QoX, Es-

tocada, SparkSQL, and CloudMdsQL do not sup-

port global schemes, although they provide mech-

anisms to deal with the data stores’ local schemes.

• The Query Processing techniques usually are ex-

tensions of known techniques from distributed

database systems, e.g., data/function shipping,

query decomposition (based on the data stores’

capabilities, bind join, select pushdown).

• The Query Optimizations are usually supported

by a (simple) cost model or heuristics.

3 TOOLS

This section characterizes existing polystore tools.

• BigDAWG (Big Data Analytics Working

Group) (Elmore et al., 2015; Tan et al., 2017).

– Description: Polystore system for large-scale

analytics, real-time streaming support, smaller

analytics at interactive speeds, data visual-

ization, and query processing over multiple

databases. Each storage engine may have a dif-

ferent data model.

– Owner/License: Intel Science and Technology

Center for Big Data (ISTC)

– BSD-3

– Goal: data integration with federated archi-

tecture over collections of vertically integrated

database engines.

– Internal Data Representation and Platform

for Data Operations: it employs no internal

data model or intermediate algebra for query

translation and data transformation.

– Context Segregation: BigDAWG separates

data in islands where each island has a data

model, logical structure, query language or al-

gebra, and one or more backend engines for

data storage and query execution, e.g., a rela-

tional island may be composed by PostgreSQL

and MySQL DBMSs.

– Queries Speciﬁcation: in the scope of the is-

land, e.g., a SQL query to a relational island.

– Query Execution: queries are decomposed in

subqueries executed by the database engines

connected to the island. The queries over more

than one island are expressed using a SCOPE

operator. A CAST operator is used to change

the semantic context in a cross-island query.

https://bigdawg.mit.edu

https://github.com/bigdawg-istc/bigdawg/blob/master/

license.txt

– Heterogeneity: handled by wrappers’ (shims).

– Main Components:

∗ Query-planning module (planner/optimizer);

∗ Performance-monitoring module (monitor);

∗ Data-migration module (migrator): moves

data across database engines when needed;

∗ Query-execution module (executor).

– Demonstration: BigDAWG was demonstrated

using MIMIC II (or “Multiparameter Intelli-

gent Monitoring in Intensive Care II”) use case,

which includes the relational, array, stream, and

key-value databases.

• CloudMdsQL (Tan et al., 2017; Kolev et al., 2016a;

Kolev et al., 2016b).

– Deﬁnition: CloudMdsQL (Cloud Multidata

store Query Language) is a functional SQL-

like language, designed for querying multiple

heterogeneous databases within a single query

containing nested subqueries. The CloudMd-

sQL technology is now at LeanXcale, in a pro-

prietary product. Only the compiler remains

available for research.

– Owner/License: LeanXcale

– proprietary.

– Goal: develop a functional SQL-like language

for heterogeneous databases within a single

query containing nested subqueries.

– Highlight: CloudMdsQL exploits local data

stores by allowing part of the query to be

expressed and processed using local native

queries (e.g., a breadth-ﬁrst search in a graph

database) which can be called as functions, and

at the same time be optimized using a cost

model like pushing down select predicates, us-

ing binding join, performing join order, or plan-

ning intermediate data shipping.

– Internal Data Representation and Platform

for Data Operations: it uses a table-based

common data model that supports other data

types, like arrays and JSON objects to handle

non-ﬂat and nested data with basic operators

over them.

– Context Segregation: by database engines.

– Query Speciﬁcation: SQL based on embed-

ded functional subqueries written in the native

query languages of the database engines. The

language also addresses distributed processing

frameworks (e.g., Apache Spark), allowing us-

age of user-deﬁned map/ﬁlter/reduce operator

as subqueries.

– Query Execution: a subquery is deﬁned as a

named table expression where the user deﬁnes

https://www.leanxcale.com

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

280

the columns and types of the table and expres-

sion in SQL SELECT (which the compiler can

analyze and possibly rewrite) or a native ex-

pression (which is directly delegated to the cor-

responding data store). E.g., two named ta-

ble expressions may query a relational database

and a document store, and be joined to produce

the result. The query compiler decomposes the

query into a query execution plan (QEP) in a

directed acyclic graph of relational operators.

Leaf nodes correspond to subqueries to be exe-

cuted by the wrappers over the data stores.

– Heterogeneity: is handled through a medi-

ator/wrapper architecture and the table-based

common model.

– Main Components:

∗ Query Planner: performs lexical and syntax

analysis besides query rewrite plans;

∗ Capability Manager: validates rewritten sub-

queries against datastore capabilities;

∗ Query Optimizer: uses cost functions and

database statistics in a cost model to select

the best plan. Besides, users may deﬁne cost

and selectivity functions. The optimizer may

rewrite the QEP generated by the query com-

piler. CloudMdsQL uses bind join to perform

semi-joins across heterogeneous data stores.

∗ QEP Builder: generates plans and serializes

them in JSON.

∗ Query Execution Controller: parses QEP,

identiﬁes sub-plans, and invokes wrappers.

∗ Finalizer: translates subquery plans into na-

tive queries that can be executed by the en-

gine.

∗ X-Ray (Guimar

aes and Pereira, 2015): Moni-

tors execution. .

• Myria

(Tan et al., 2017; Halperin et al., 2014).

– Deﬁnition: cloud service for big data manage-

ment and analytics.

– Goal: simplify data upload and data science

tasks with efﬁcient query execution to process

and explore the data.

– Owner/License: the University of Washing-

ton

– BSD 3

– Internal Data Representation and Platform

for Data Operations: relational data represen-

tation with a relational-algebra compiler.

https://myria.cs.washington.edu/

http://www.washington.edu/

https://github.com/uwescience/myria/blob/master/

LICENSE

– Context Segregation: in the level of the en-

capsulated data stores.

– Query Speciﬁcation: in MyriaL, an

imperative-declarative hybrid language, or

via a Python API. It supports user-deﬁned

function (UDP) and user-deﬁned aggregates

(UDA) via an exposed Python API.

– Query Execution: the Relational Algebra

Compiler (RACO) parses queries in Myria al-

gebra and transforms them into the speciﬁc API

calls or the query primitives supported by the

local database engines. RACO uses rule-based

optimization to generate federated query plans

that take advantage of the performance charac-

teristics of the supported database engines.

– Main Components:

∗ MariaX: query-execution engine that uses a

parallel, pipelined, possibly cyclic graph of

dataﬂow operators with built-in support for

asynchronous evaluation of recursive queries.

∗ RACO: query optimizer and federated query

executor that uses relational algebra extended

with imperative constructs capturing the se-

mantics of array, graph, and key-value data

models.

• Apache Drill (Tan et al., 2017; Hausenblas and

Nadeau, 2013).

– Deﬁnition: distributed, massively parallel

query engine.

– Owner/License: the Apache Software Founda-

tion

– Apache License

– Goal: answer fast to ad-hoc queries over a

huge amount of unstructured and weakly struc-

tured data spread across servers.

– Internal Data Representation and Plat-

form for Data Operations: JSON-based data

model.

– Context Segregation: in the level of the en-

capsulated data stores.

– Query Speciﬁcation: supports ANSI SQL and

MongoDB QL, and user-deﬁned functions.

– Query Execution: queries are parsed and

transformed into a logical plan, which is trans-

formed and optimized into a physical plan.

– Main Components:

∗ Drillbit: daemon service running cluster

nodes aiming at maximizing data locality.

∗ Zookeeper: broker between the clients and

Drillbits, and among Drillbits.

https://drill.apache.org/

https://drill.apache.org/apacheASF/

Modern Federated Database Systems: An Overview

281

4 REQUIREMENTS AND

RESEARCH CHALLENGES

The modern federated database system main require-

ments are:

• Location and Data Sources Encapsulation (El-

more et al., 2015): Provide a smooth interface

to free programmers from having to learn several

query languages and store engines.

• The Deployment: should support cloud prac-

tices (Halperin et al., 2014) complexity and cost

of installation, administration, and maintenance

should not be prohibitive; mechanisms for pre-

dicting and debugging performance, as well as

controlling costs, should be available.

• Query Language (Kolev et al., 2016b): should

work on heterogeneous data stores; support arbi-

trary chain of queries, i.e., a query result in one

database be input for a query in another database;

be schema independent, i.e., allow the integration

of databases with or without schema; allow data-

metadata transformation, e.g., convert attributes

or relations into data and vice-versa.

• Query Tools: should support users and algo-

rithm designers with an easy-to-use set of inter-

faces, languages, and APIs that scale from sim-

ple SPJ (Select-Project-Join) queries to advanced

application-speciﬁc ones (Halperin et al., 2014).

• Processing (Halperin et al., 2014): should be efﬁ-

cient and support: scale queries and optimization

combining state-of-the-art and novel techniques,

e.g., use of bind joins, semi-joins, core parallel

query processing concepts.

• Real-time Decision Support (Elmore et al.,

2015): through stream processing able to connect

historical and stream data.

• Visualization Tools (Elmore et al., 2015): sup-

porting disparate data models and new user in-

teraction mechanisms. E.g., big data application

visualization should support questions like “give

me something interesting from data” in an ex-

ploratory way.

• Shufﬂe Data among Backends (Elmore et al.,

2015): i.e., supporting moving data and interme-

diate results from one storage to another as needed

to ﬁt a user’s query and high performance. E.g.,

each engine may know how to read binary data

in parallel directly from another engine (Elmore

et al., 2015). Using a monitoring system that

learns types of queries, and move data accord-

ingly, i.e., it transfers the data to the engine that

has the best data model to answer the query.

• Cross-system Solution (Elmore et al., 2015):

able to include other polystores.

The research challenges are (Stonebraker, 2015)(Bon-

diombouy and Valduriez, 2016)(Elmore et al., 2015):

• Query Language: easy to use query language

with efﬁcient processing over diverse stores.

SQL-like language facilitates integration with ex-

isting tools but comprises efﬁciency. Alternatives

are to access stores directly or to use a functional

query language that allows native subqueries as

functions within the query language.

• Complex Analytics: efﬁcient combination of

data from multiple stores with linear algebra al-

gorithms. Analytics tasks have been moved from

relational analytics (e.g., COUNT, SUM, AVG

with group by) to predictive models (e.g., ma-

chine learning, regression, statics). The majority

of such predictive models are based on linear alge-

bra algorithms, such as regression analysis, singu-

lar value decomposition, eigenanalysis, k-means

clustering, etc. Although linear algebra pack-

ages are optimized by software and hardware,

they have different characteristics when compared

to data systems, e.g., size of computation tiles,

choice of networks, compression techniques. So,

algebra packages and data stores should be decou-

pled, and data should be converted back and forth

between them in a non-expensive manner.

• Query Optimization: optimization should han-

dle data-ﬂow and multi-platform scheduling able

to update the cost model or add new heuristics

when data stores join or disjoin the system.

• Semantic Mapping and Record Linkage: auto-

matic translation of utterances to the local dialect

of a storage system and integration of the results.

• Automatic Copy: efﬁcient copy of data between

stores and federated system, considering data

transformation and memory access techniques.

• Distributed Transactions: handle transactions

over distributed, heterogeneous store systems with

diverse local transaction models. The problem is

harder when considering, e.g., NoSQL data stores

that do not provide ACID

transaction support.

• Automatic Load Balancing and Provisioning:

automatically balance data load among data

stores by replications or moving data, which re-

quires a monitoring feature.

• Novel User Interfaces: innovative data mining,

visualization, and browsing user interfaces. Typi-

cal user interaction ﬂows may not ﬁt modern fed-

Atomicity, Consistency, Isolation, and Durability.

ICEIS 2020 - 22nd International Conference on Enterprise Information Systems

282

eration. E.g., as new data sources may join or dis-

join the federation, new data may rise or disap-

pear, which brings new knowledge.

• Benchmarks: should be developed to evalu-

ate federations addressing store combinations and

multiple query language processing, like Poly-

Bench (Karimov et al., 2018).

5 CONCLUSION

Modern applications require the manipulation of

structured and unstructured data, usually in high vol-

ume, over distributed and heterogeneous data sources.

The pattern “one size ﬁts all” does not hold any-

more. Thus, innovative solutions capable of access-

ing and manipulating data in such an environment are

required.

This work presented the state-of-the-art, detailed

solutions, their main components, how queries are

speciﬁed and executed, and other features. Afterward,

we presented guidelines and challenges the solutions

should address. Researchers and practitioners can use

our ﬁnds to focus their work.

As future work, we aim at evaluating the tools in

practice through case studies and experiments to iden-

tify the level they meet the challenges and to bring

new open issues. We also intend to perform a system-

atic review towards a broader analysis improving the

overview presented in this work.

REFERENCES

Angles, R. and Gutierrez, C. (2008). Survey of graph

database models. ACM Comp. Surveys, 40(1):1–39.

Bondiombouy, C. and Valduriez, P. (2016). Query process-

ing in multistore systems: an overview. Research Re-

port RR-8890, INRIA.

Elmore, A., Duggan, J., Stonebraker, M., Balazinska, M.,

Cetintemel, U., Gadepally, V., Heer, J., Howe, B.,

Kepner, J., Kraska, T., et al. (2015). A demonstra-

tion of the bigdawg polystore system. Proceedings of

the VLDB Endowment, 8(12):1908–1911.

Guimar

aes, P. and Pereira, J. (2015). X-ray: Monitoring

and analysis of distributed database queries. In IFIP

International Conference on Distributed Applications

and Interoperable Systems, pages 80–93. Springer.

Gurusamy, V., Kannan, S., and Nandhini, K. (2017). The

Real Time Big Data Processing Framework: Advan-

tages and Limitations. Intl. Journal of Computer Sci-

ences and Engineering (JCSE), 5(12):305–312.

Halperin, D., Teixeira de Almeida, V., Choo, L. L., and et al.

(2014). Demonstration of the myria big data manage-

ment service. In Proceedings of the 2014 ACM SIG-

MOD Intl. Conf. on Mngt. of Data, pages 881–884.

Haslhofer, B., Momeni Roochi, E., Schandl, B., and Zan-

der, S. (2011). Europeana rdf store report. Technical

report, University of Vienna.

Hausenblas, M. and Nadeau, J. (2013). Apache drill: in-

teractive ad-hoc analysis at scale. Big data, 1(2):100–

104.

Iancu, B. and Georgescu, T. M. (2018). Saving Large Se-

mantic Data in Cloud: A Survey of the Main DBaaS

Solutions. Informatica Economica, 22(1).

Karimov, J., Rabl, T., and Markl, V. (2018). Polybench: The

ﬁrst benchmark for polystores. In Technology Confer-

ence on Performance Evaluation and Benchmarking,

pages 24–41. Springer.

Kolev, B., Bondiombouy, C., Valduriez, P., Jim

enez-Peris,

R., Pau, R., and Pereira, J. (2016a). The CloudMdsQL

Multistore System. In Proc. of Intl. Conf. on Manage-

ment of Data (SIGMOD’16), pages 2113–2116. ACM.

Kolev, B., Valduriez, P., Bondiombouy, C., Jimenez-Peris,

R., Pau, R., and Pereira, J. (2016b). CloudMd-

sQL: Querying Heterogeneous Cloud Data Stores

with a Common Language. Distributed and parallel

databases, 34(4):463–503.

Lenzerini, M. (2002). Data integration: A theoretical per-

spective. In Proceedings of the Twenty-First ACM

SIGMOD-SIGACT-SIGART Symposium on Principles

of Database Systems, pages 233–246. ACM.

Modoni, G. E., Sacco, M., and Terkaj, W. (2014). A survey

of rdf store solutions. In Intl. Conf. on Engineering,

Technology and Innovation (ICE), pages 1–7. IEEE.

Nayak, A., Poriya, A., and Poojary, D. (2013). Type of

NOSQL databases and its comparison with relational

databases. International Journal of Applied Informa-

tion Systems, 5(4):16–19.

Ozsu, M. T. and Valduriez, P. (2020). Principles of dis-

tributed database systems. Springer, 4th edition.

Sheth, A. P. and Larson, J. A. (1990). Federated database

systems for managing distributed, heterogeneous, and

autonomous databases. ACM Computing Surveys

(CSUR), 22(3):183–236.

Stonebraker, M. (2015). The case for polystore. https://wp.

sigmod.org/?p=1629.

Stonebraker, M., Bear, C., C¸ etintemel, U., Cherniack, M.,

Ge, T., Hachem, N., Harizopoulos, S., Lifter, J.,

Rogers, J., and Zdonik, S. (2007). One size ﬁts all?

part 2: Benchmarking results. In Proc. CIDR.

Tan, R., Chirkova, R., Gadepally, V., and Mattson, T. G.

(2017). Enabling query processing across heteroge-

neous data models: A survey. In IEEE Intl. Conf. on

Big Data (Big Data), pages 3211–3220. IEEE.

Zheng, Z., Wang, P., Liu, J., and Sun, S. (2015). Real-Time

Big Data Processing Framework: Challenges and So-

lutions. Applied Math. & Inf. Sciences, 9(6):3169.

Zulkeﬂi, N. S. S., Rahman, N. A., Bakar, Z. A., Nordin, S.,

Sembok, T. M. T., and Teo, N. H. I. (2013). Evalua-

tion of triple indices in retrieving web documents. In

Intl. Conf. on Advanced Computer Science Applica-

tions and Technologies (ACSAT), pages 525–529.

Modern Federated Database Systems: An Overview

283