Design of a Portable Programming Abstraction

for Data Transformations

Johannes Luong, Dirk Habich and Wolfgang Lehner

Database Systems Group, Technische Universit

at Dresden, 01062 Dresden, Germany

Keywords:

Database Programming Languages, System Integration, Parallel Programming Models, Data Analyses.

Abstract:

Novel data intensive applications and the diversiﬁcation of data processing platforms have changed data man-

agement signiﬁcantly over the last decade. In this changed environment, the expressiveness of the traditional

relational algebra is often insufﬁcient and data management systems have started to provide more powerful

special purpose programming languages. However, these languages create a tight coupling between applica-

tions and speciﬁc systems that can hinder further development on both sides of the equation. The goal of this

article is to start a discussion on the future of platform independent programming models for data processing

that re-establish the separation of application logic and implementation details that used to be a cornerstone of

data management systems. As a guide for that discussion, we introduce several recent related works on that

topic and also outline our own contribution, the Analytical Calculus.

1 INTRODUCTION

Over the last decade, two important drivers of spe-

cialization and adaptation have shaped the landscape

of data processing. On the one hand, new data heavy

applications, such as advanced statistical analyses,

have created the need for new expressive program-

ming models that exceed the capabilities of the tradi-

tional relational algebra. On the other hand, nonfunc-

tional requirements on data processing, such as data

volume or low latency, have resulted in the creation

of a large number of special purpose systems and li-

braries that achieve very high performance in particu-

lar scenarios. In the early days of the one size does

not ﬁt all era (Stonebraker et al., 2007), those two

forces have been reconciled in a mostly ad-hoc fash-

ion where individual applications use low-level data

processing APIs to implement their business logic.

Unfortunately, this ad-hoc reconciliation has created a

strong binding between applications and special pur-

pose processing systems that can hinder further de-

velopment on both sides. Applications are bound to

speciﬁc system level APIs and can not be migrated

to better technology without signiﬁcant rewrites and

system providers have to guarantee backwards com-

patibility in order to maintain the good will of their

customers. Furthermore, system level programming

requires specialized knowledge and experience which

increases the cost of using these technologies.

The desire to provide an easy to use high level pro-

gramming interface and to separate application logic

from system level details is certainly well known in

the database community. In database systems these

problems have been solved with the widespread adop-

tion of SQL and the relational algebra as program-

ming abstraction. SQL is an abstract declarative lan-

guage with operations for the selection, combination,

ﬁltering, and aggregation of relational datasets. The

semantics of these operations are deﬁned on the ab-

stract data type Relation and SQL does not stipulate

any further nonfunctional constraints on possible im-

plementations. Database system providers have used

the strong physical abstraction provided by SQL to

create a diverse set of relational database management

systems with widely varying nonfunctional proper-

ties. Application developers write their business logic

using the relatively easy to use abstract SQL interface

and have the freedom to chose an adequate database

system later on.

The relational model provided an excellent solu-

tion until the ﬁrst driver of specialization, a new set

of popular data heavy applications, has created the

need for a more ﬂexible and expressive approach to

data-oriented programming. In the big data commu-

nity, this need was ﬁrst answered by ﬂexible but sys-

tem speciﬁc low-level APIs. But once the analyses

of large datasets had become more widespread, the

drawbacks of system level programming became ap-

400

Luong, J., Habich, D. and Lehner, W.

Design of a Portable Programming Abstraction for Data Transformations.

DOI: 10.5220/0006945004000408

In Proceedings of the 7th International Conference on Data Science, Technology and Applications (DATA 2018), pages 400-408

ISBN: 978-989-758-318-6

parent very soon and popular systems, such as Apache

Hadoop

, Apache Spark

or TensorFlow

began to

introduce higher level languages

456

that make pro-

gramming of these systems much more accessible.

Although these languages succeed in simplifying the

use of big data platforms, most of them are special

purpose solutions that bind applications to a particu-

lar system. Further, some of these languages still ex-

pose low level system details, such as explicit caching

of intermediate results, explicit repartitioning of data,

and so forth.

Recently, several authors have recognized and dis-

cussed the problems of system speciﬁc programming

models and leaky abstractions in data processing. Jen-

nie Duggan and colleagues (Duggan et al., 2015) dis-

cuss issues that arise when multiple independent data

processing systems have to be integrated to work on a

common goal. They propose the BigDAWG polystore

system which introduces a uniﬁed query interface and

organizes data movements between participating sys-

tems. Ionel Gog and colleagues (Gog et al., 2015) ob-

serve that several big data cluster processing systems

use quite similar internal program representations that

can be mapped onto each other in an automated fash-

ion. Based on this ﬁnding they are able to decouple

high-level front end languages from the languages na-

tive execution environments and make programs in

those languages portable on all supported engines.

Shoumik Palkar and colleagues (Palkar et al., 2017;

Palkar et al., 2018) investigate a similar scenario

where an application uses multiple data processing

libraries and data movement between those libraries

becomes a bottleneck. They propose Weld, a low-

level in-memory storage and processing system that

libraries can use to access and share datasets. Hol-

ger Pirk and colleagues (Pirk et al., 2017) look at the

problem at a slightly different angle and investigate

a lower level programming model that can be auto-

matically mapped to efﬁcient multicore parallelism,

vector parallelism, as well as GPU programs. Despite

the more technical focus, they tackle the same issue

of decoupling data processing from low level system

details by introducing a language abstraction.

The goal of this paper is to start a discussion on the

decoupling data centric applications from processing

engines with the intention to achieve greater ﬂexibil-

ity, portability, and adaptivity. With this goal in mind,

in Section 2 we begin with a detailed discussion of

http://hadoop.apache.org/

https://spark.apache.org/

https://www.tensorﬂow.org/

http://pig.apache.org/

http://mahout.apache.org/

https://spark.apache.org/mllib/

each of the previously mentioned papers. In Section 3

we provide an introduction of some of our own work

on this topic and in Section 4 we end the paper with a

short conclusion.

2 RELATED WORK

In recent years, several new ﬂexible and portable

programming abstractions for data intensive appli-

cations have been published. In the following sec-

tions we are going to discuss four exemplary papers

that provide a good overview of the practical work

that has been done recently and that showcase im-

portant design issues of that space. Our selection is

not meant to be comprehensive and we focus exclu-

sively on practical approaches that try to introduce a

layer of abstraction between system speciﬁcs and ap-

plications. We also leave out some important works,

such as Alexandrov’s comprehension interface for

Apache Flink (Alexandrov et al., 2015) or Microsoft’s

LINQ (Meijer and Bierman, 2011; Yu et al., 2008),

because they do not raise signiﬁcant additional points

with regard to our focus.

2.1 BigDAWG

Jennie Duggan and colleagues (Duggan et al., 2015)

motivate their BigDAWG system with an interesting

application scenario that encompasses the use and in-

tegration of several specialized data types, such as

waveform data, plain text, structured records, and

semi structured documents. They argue that each

of these datatypes should be stored in a specialized

database system to beneﬁt from the superior perfor-

mance and query interface that a dedicated system can

provide. However, effectively accessing four differ-

ent database systems increases application complex-

ity and achieving overall high performance might en-

tail a signiﬁcant effort in application speciﬁc tweak-

ing and optimization.

To remedy these issues, Duggan et al. propose the

BigDAWG polystore system. Similar to a federated

database, BigDAWG provides a uniﬁed access inter-

face to multiple database systems. But, in contrast to

the federated systems from the 80s and 90s (Chawathe

et al., 1994; Carey et al., 1995; Stonebraker et al.,

1996), BigDAWG supports several data types and

query languages instead of just the relational model.

A given data model and its corresponding query lan-

guage, such as relations and SQL, forms a so called

data lake and each data lake can be backed by multi-

ple database engines that implement that model. For

each data lake, BigDAWG deﬁnes a canonical variant

Design of a Portable Programming Abstraction for Data Transformations

401

of the corresponding query language, such as the can-

nonical SQL variant and so forth. To integrate with

BigDAWG, engines have to provide a shim program

that can translate a data lake’s canonical language

variant into the engine’s own query language. For

example, a particular relational database system has

to be able to translate BigDAWG’s SQL into its own

SQL dialect and so forth. Users can combine queries

of different data lakes using special SCOPE and CAST

operators. The authors provide the following example

to showcase this feature:

RELATIONAL(

SELECT ∗

FROM R , CAST (A, r e l a t i o n )

WHERE R . v = A . v

) ;

In this case, RELATIONAL is a scope operator that an-

notates the following program as SQL query and CAST

converts an array A into a relation so it can be used in

the relational context. Unfortunately, the authors do

not provide any more comprehensive code examples

which makes it hard to judge the convenience of com-

bining data lakes in a real world setting.

BigDAWG choses a mostly hands-off approach to

query languages. For the most part, it reuses exist-

ing database languages and simply forwards queries

to implementations of the respective data lakes. This

approach has several beneﬁts that, in theory, make

BigDAWG easy to adopt and easy to extend. First,

users can easily pick up the system and integrate it

with their existing software as they don’t have to

adopt a new language. Second, the reuse of existing

languages allows BigDAWG to rely on the query op-

timization that is provided by many existing database

engines. Third, in principle, the hands-off approach

also facilitates the straight forward incorporation of

a broad set of data lakes into the BigDAWG poly-

store, because interactions between lakes are mini-

mal. However, each data lake still requires a canon-

ical query language and a shared data format. The

canonical language acts as a common denominator

for all possible implementations and therefore has to

be conservative with language features. This might

especially hinder the integration of data lakes that do

not offer a widely accepted query language, such as

vector query languages or ﬂexible UDF oriented in-

terfaces.

2.2 Musketeer

Ionel Gog and colleagues (Gog et al., 2015) inves-

tigate big data cluster processing engines, such as

Hadoop, Spark, or Naiad (Murray et al., 2013), and

the high-level query languages that these engines pro-

vide. The authors ﬁnd that different engines show

widely different performance characteristics for the

same workload and conclude that it is advisable to

chose an engine based on the speciﬁc task at hand.

Further, the authors also ﬁnd that the engine speciﬁc

high-level languages for relational and graph work-

loads create a certain lock in effect that prevents users

from switching engines once they have mastered its

languages.

Gog et al. solve these issues with their Musketeer

system which decouples high-level languages from

their native runtimes and can automatically select

an adequate processing engine for a given workload.

Conceptually, Musketeer implements a modern com-

piler architecture where a set of frontend languages

are translated into a common internal representation

and the internal representation is compiled into ex-

ecutable code for a particular runtime environment.

But instead of generating executables, Musketeer pro-

duces workloads for big data processing engines and

is even able to split a program into partial workloads

that are executed on different engines. The internal

representation is a data ﬂow language that is espe-

cially designed for parallel data processing. The data

ﬂow language provides a set of typical data-parallel

operators, such as MAP, GROUP BY, JOIN, and AGG,

but also a dynamic WHILE operator for iterative algo-

rithms.

Musketeer provides parsers for Hive SQL

Lindi (Murray et al., 2013), and the authors’ own

BEER DSL that supports relational and graph pro-

cessing. It can create workﬂows for Hadoop

, Spark

Naiad (Murray et al., 2013), PowerGraph (Gonzalez

et al., 2012), GraphChi (Kyrola et al., 2012), and

Metis (Mao et al., 2010) and for testing purposes it

can also generate serial C code. When Musketeer re-

ceives a new program it ﬁrst translates it, using the

adequate frontend, into the data ﬂow representation.

SQL like languages can be easily mapped to the in-

ternal representation because the relational algebra is

a data ﬂow language with data parallel operators it-

self. Even more, Musketeer provides operators, such

as GROUP BY, JOIN, or AGG, which more or less di-

rectly implement relational semantics. Graph pro-

cessing, on the other hand, has a somewhat different

processing model, making the translation more com-

plicated and the resulting intermediate representation

shares little resemblance with the input program. In

general, there are several different graph processing

models but in this article the authors only consider

the popular Gather Apply Scatter model that is also

https://hive.apache.org/

http://hadoop.apache.org/

http://spark.apache.org/

EDDY 2018 - Special Session on Adaptive Data Management meets Self-Adaptive Systems

402

used by systems like Pregel (Malewicz et al., 2010).

In this model, processing happens as a sequence of

uniform steps and each step consists of three phases:

1) each node of the graph gathers incoming messages

from its neighbours, 2) each node updates its inter-

nal state using an update function, and 3) each node

scatters outgoing messages to its neighbours. The au-

thors do not go into detail on how they translate graph

programs but one possible approach would create an

internal data ﬂow program that groups messages by

node IDs, joins the message groups with node states,

and maps over each (messages, state) tuple to gener-

ate new messages and node states for the next pro-

cessing step.

A similar mismatch between processing mod-

els arises at the other end of the translation when

internal programs are mapped to executable work-

loads. Processing engines, such as Hadoop, Spark,

or Naiad, use parallel data ﬂow languages themselves

and Musketeer operators can be mapped to those lan-

guages in a straight forward manner

. Dedicated

graph engines on the other hand, usually implement

a graph speciﬁc processing model like Gather Apply

Scatter which can neither express all Musketeer data

ﬂow programs in a straight forward manner nor ef-

ﬁciently execute those programs. The authors solve

this issue by introducing the notion of code idioms.

A code idiom is a certain well known data ﬂow pat-

tern that can be easily recognized by program analy-

ses and mapped to a different processing model like

Gather Apply Scatter. To make this approach work,

frontends have to be careful to actually generate the

appropriate data ﬂow patterns, otherwise the backend

will not be able to detect the idioms and can not target

graph engines.

In contrast to BigDAWG, Musketeer does not

chose a hands off approach to query languages but de-

ﬁnes its own internal processing model. The data ﬂow

model is broadly applicable and can express typical

query languages in a natural way. However, the model

is also more generic than some of the processing en-

gines that the authors want to address. This makes it

necessary to add a meta language of idioms which can

capture the semantics of certain compositions of data

ﬂow operators.

2.3 Weld

Shoumik Palkar and colleagues (Palkar et al., 2017;

Palkar et al., 2018) investigate modern analytics ap-

plications and how they make use of external libraries

With the exception of the WHILE operator which some-

times has to be handled in an external driver program.

such as Pandas

or NumPy

. In contrast to database

systems that usually apply sophisticated query opti-

mization, these libraries are often not able to perform

any kind of cross function optimizations, especially

if the functions belong to different libraries. To im-

prove the performance of analytics libraries, Palkar et

al. propose Weld an in-memory data store for shared

memory systems that offers a ﬂexible processing ori-

ented query interface. If libraries are adapted to use

Weld as storage and low-level processing backend, the

system can perform important optimizations, such as

pipelining or loop fusion, across function calls and

across libraries. To achieve this cross function be-

haviour, Weld implements a lazy query evaluation ap-

proach, where computations are only executed once

their results are actually required. Applications and

libraries use the Weld runtime API to allocate mem-

ory objects and to perform data parallel operations on

those objects. The result of these operations is an ab-

stract handle that can be passed to the host applica-

tion and subsequently to other Weld enabled libraries.

Only when a library has to return an actual value will

it force evaluation of the handle and the Weld runtime

can chose an optimized plan to perform the requested

transformations.

Weld offers a ﬂexible functional query interface

that revolves around parallel loops and a set of

builders. The data model includes primitive scalars,

structures, vectors, and dictionaries. Loops are used

to process individual elements of vectors or dictionar-

ies and builders are used to aggregate values. List-

ing 1 shows a basic sample query. Some of the

builders include:

• appender[T] which appends values of type T to

a vector of type vector[T]

• merger[T, func, id] which aggregates val-

ues of type T using an associative function

(T, T) → T and an identity value

• vecmerger[T, func] which inserts values of

type (Int, T) into a vector at a speciﬁc position

using func to merge the new value with the pre-

vious value at that position.

Weld makes the bold choice to ignore existing query

languages and instead proposes its own ﬂexible lan-

guage that has close ties to monad and monoid com-

prehensions (Grust, 2003; Fegaras and Maier, 1995).

In their 2018 paper (Palkar et al., 2018), the authors

demonstrate that Weld can be integrated into popular

libraries, handle real world application scenarios, and

signiﬁcantly improve overall performance in many

cases. In contrast to the previously discussed papers,

https://pandas.pydata.org/

http://www.numpy.org/

Design of a Portable Programming Abstraction for Data Transformations

403

Listing 1: Basic Weld example.

/ / v e c t o r l i t e r a l

i n : = [ 1 , 2 , 3 ] ;

/ / a p pe nd e r c r e a t e s a new v e c t o r

a : = ap p e n d e r [ i n t ] ;

/ / f o r r e t u r n s i t s b u i l d e r

/ / merge i n s e r t s i n t o a b u i l d e r

q : = f o r ( in , a ,

( in , v ) => merge ( in , v ∗v )

) ;

/ / r e t u r n s [ 1 , 4 , 9]

r e s u l t ( q ) ;

Weld binds its programming model to a particular ex-

ecution environment which limits the portability of

applications that decide to use the system. However,

shared memory performance is an important use case

that is often overlooked in the design of modern an-

alytics programming models and the authors clearly

demonstrate the importance of optimized low-level

machine access. What is more, we have good reasons

to believe that the Weld programming model could be

ported to other data engines without too much effort

as well.

2.4 Vodoo

Similar to Weld, Holger Pirk and colleagues (Pirk

et al., 2017) investigate a programming model for

data intensive processing on shared memory systems.

However, their work is less focused on creating a con-

venient processing environment for libraries and end

users, but rather on providing a portable way to ad-

dress different types of hardware parallelism, such as

multi core parallelism, SIMD vector instructions, or

GPU programming environments. Vodoo is a pro-

gramming environment and runtime system that al-

lows users to easily tune their programs for different

hardware scenarios. That is, users deﬁne their appli-

cation logic in an abstract data ﬂow language but an-

notate the abstract logic with additional information

on how the data is to be partitioned and distributed

at runtime. These annotations are used by Vodoo in

a predictable manner, to decide whether an operator

should use a multi threaded implementation, a SIMD

vector implementation, or a GPU implementation and

how each of these possibilities is conﬁgured in detail.

In contrast to the previously discussed models, this

approach gives users ﬁne grained control over how

their applications are executed.

Listing 2 shows a basic Vodoo example that cal-

culates the sum of a vector of doubles. A program

of a typical data ﬂow language would probably only

consist of the ﬁrst and last line of the program and

leave it to a runtime or compiler to decide how to

best compute the sum over a vector. Vodoo how-

ever offers much more explicit control over the de-

sired execution strategy. In particular, the input data

is ﬁrst partitioned into batches of size 1024 that are

aggregated independently before computing the over-

all result. This particular conﬁguration will result in

a multithreaded execution strategy where Vodoo as-

signs data batches to a set of worker threads. How-

ever, by simply changing the calculation of the IDs

/ / a s s i g n v a l u e s t o 2 SIMD l a n e s

IDs : = r a n g e ( d a t a ) % 2

the execution strategy can be changed into to a SIMD

based implementation, where four values are added in

parallel using vector instructions.

Listing 2: Basic Vodoo example.

d a t a : = l o a d ( ” DoubleVec ” )

/ / c r e a t e b a t c h e s o f s i z e 1024

IDs : = r a n g e ( d a t a ) / 1024

p o s i t i o n s : = p a r t i t i o n ( IDs )

p a r t i t i o n s : = s c a t t e r (

z i p ( d a t a , IDs ) ,

p o s i t i o n s

)

/ / compute t h e sum o f ea c h b a t c h

p a r t i a l S u m : = fol dS um (

p a r t i t i o n s . v a l ,

p a r t i t i o n s . i d

)

/ / compute t h e o v e r a l l sum

sum : = foldSum ( p a r t i a l S u m )

Vodoo offers an interesting approach to applica-

tion portability with regards to low-level parallelism.

The system makes it easy to explore implementation

variants and facilitates hardware speciﬁc tuning of

applications. Similar to GPU programming models,

Vodoo code is supposed to be compiled lazily, just

in time when the results are needed. This approach

enables very dynamic changes to execution strategies

even during runtime of an application. Unfortunately,

the authors limit their discussion to the implementa-

tion of traditional relational databases and do not ex-

plore richer semantics at the moment. We believe,

that the approach could be of good use in a broader

environment as well.

EDDY 2018 - Special Session on Adaptive Data Management meets Self-Adaptive Systems

404

3 THE ANALYTICAL CALCULUS

The separation of logic and execution in data inten-

sive applications is an important theme in our own re-

search as well. In a recent paper (Luong et al., 2017),

we have introduced the Analytical Calculus which is

our own proposal for a ﬂexible, rich, and portable pro-

gramming interface for data analytics. The Analytical

Calculus is a lighweight pure functional language

with a small core library of abstract data types for

parallelized data processing. The language has a sim-

ple static type system and contains few constructs be-

sides functions and function application. In the cur-

rent version, recursion is prohibited but we plan to

enable certain important recursion patterns (Meijer

et al., 1991) in future versions. A deliberate choice of

core types and the explicit support of domain speciﬁc

concepts allow the Analytical Calculus to support a

wide range of application scenarios. At the same time

these properties also allow the Analytical Calculus

to drive a large number of execution strategies that

scale from fast shared memory systems, such as Weld,

to large systems of systems, such as BigDAWG. The

Analytical Calculus enables adaptivity in data man-

agement by separating application logic from appli-

cation execution and by facilitating ﬂexibility on both

sides of that separation. This ﬂexibility can be ex-

ploited, for example, by a dynamic runtime system

that quickly adapts physical execution strategies to

changed workloads or data characteristics.

Similar to the relational algebra, the Analytical

Calculus is not meant to be written manually, but acts

as an intermediate representation that is generated by

high-level frontend languages, such as SQL, and that

is consumed by a runtime that translates the abstract

statements into physical operations. The lack of side

effects, the static type system, and the structured re-

cursion simplify code analyses and transformations

to a great degree and therefore make the Analytical

Calculus a very good ﬁt for a ﬂexible internal pro-

gram representation. Another similarity to the rela-

tional algebra is that the Analytical Calculus is not de-

signed for a speciﬁc runtime system but is supposed to

establish a general programming abstraction for data

processing that can be implemented by various run-

times. For example, we are currently developing a hy-

brid runtime that uses the Analytical Calculus to drive

a system that integrates a PostgreSQL

database,

a MongoDB

database, and a Spark cluster. The

Analytical Calculus is not designed for a particular

frontend language but can serve as intermediate lan-

guage for a wide set of purposes. However, we are in

https://www.postgresql.org/

https://www.mongodb.com/

hBagUnioni( λ cn. {(c.name, c. phone, n.name)} |

c ← customer,

n ← nation,

λ c n. c.nationkey = n.key )

with BagUnion := ( Bags, ∪,

0 )

Figure 1: Join with a monoid comprehension.

the process of developing the ACQL query language, a

superset of SQL that will contain extensions for linear

algebra. ACQL is the ﬁrst language for the Analytical

Calculus and will help us understand the capabilities

and limits of our model.

3.1 Monoid Comprehensions

The Analytical Calculus uses monoid comprehen-

sions (Fegaras and Maier, 1995) as core computa-

tional concept to perform tasks such as ﬁltering, trans-

forming, aggregating, and grouping values and to

build joins. Figure 1 shows how a monoid com-

prehension can be used to perform a natural equiv-

alence join and subsequent projection over two rela-

tions. The deﬁnition consists of two main parts: (i) a

monoid and (ii) a comprehension over that monoid.

A monoid is an algebraic structure that consists of a

set, an associative binary operation over that set, and a

neutral element for the operation. In the example, the

BagUnion monoid consists of (i) the set of all ﬁnite

bags, i.e. multisets, (ii) the bag union operation, and

(iii) the empty bag. A comprehension over a monoid

is a function that uses one or several datasources to

generate an element of its monoid. For example, the

comprehension in Figure 1 uses the two datasource

customer and nation to create a bag of tuples. Com-

prehensions consist of two parts that are separated by

a vertical bar: the head and the tail. The tail is a se-

quence of bindings that bind the elements of a data-

source to a variable and ﬁlters that accept or discard

the bindings left of the ﬁlter. A tail that contains mul-

tiple bindings computes the cross product of all bound

sources. The head is a function over the comprehen-

sion’s bindings that returns an element of the com-

prehension’s monoid. For example, the head function

in Figure 1 returns a bag that contains an individual

result tuple. The head is applied to each sequence

of bindings that is not discarded by one of the ﬁlters

and all head results are eventually combined using the

monoid’s binary operation.

Figure 2 shows an example of a monoid compre-

hension which is deﬁned over the set of integers with

the addition as operation and zero as neutral element.

The comprehension’s tail contains the two bindings n

and d which will generate cross product of the two

Design of a Portable Programming Abstraction for Data Transformations

405

hSumi( λ n d.

n ← [1, 2, 3, 4, 5, 6, 7, 8],

d ← [2, 3, 4],

λ n d. n mod d = 0 )

with Sum := ( N, +, 0 )

Figure 2: Aggregation using a comprehension.

sources: (1, 2), (1, 3), (1, 4), (2, 2), (2, 3), (2, 4), . . . ,

(8, 2), (8, 3), (8, 4). The ﬁlter accepts all bindings

where the second number is a divisor of the ﬁrst num-

ber, such as (2, 2), (3, 3), (4, 2), or (4, 4). The head

simply returns the fraction of ﬁrst and second number

for each accepted binding tuple and the comprehen-

sions builds the sum of all head results:

+ . . . +

This example demonstrates how different

monoids can be used to produce various result types.

Even greater expressiveness can be achieved by

nesting comprehensions which allows, for example,

to build groups and outer joins.

We are, of course, not the ﬁrst to promote the

use of comprehensions in data processing systems.

Authors like Grust (Grust, 2003) or Fegaras and

Maier (Fegaras and Maier, 1995) have pointed out

the theoretical beneﬁts of comprehensions, such as

expressiveness and strong support of optimizations,

a long time ago. Microsoft’s language integrated

queries (Meijer and Bierman, 2011) provide an im-

plementation of comprehensions in a mainstream pro-

gramming language that can be used to access a vari-

ety of data sources in a convenient manner. More re-

cently, Alexandrov and colleagues (Alexandrov et al.,

2015) have demonstrated that comprehensions can be

nicely mapped to the data ﬂow model of the big data

stream processing system Apache Flink

and in Sec-

tion 2 we have discussed the Weld system which uses

a language that is closely related to comprehensions

to drive a shared memory system and applies a va-

riety of important optimizations, such as pipelining

and loop fusion, to achieve very good performance

in that environment. In summary, we are very conﬁ-

dent in the usefulness of comprehensions as a ﬂexi-

ble and widely applicable basic building block of the

Analytical Calculus.

https://ﬂink.apache.org/

3.2 Explicit Domain Representation

A monoid comprehension is a generic transformation

operator that can be used to express a wide array

of important data processing functions. For exam-

ple the entirety of the relational algebra can be eas-

ily mapped onto comprehensions and the same is true

for many basic operations of the linear algebra. Even

basic path matching in graphs can be achieved using

comprehensions. However during this mapping from

a special purpose model to the more generic com-

prehensions, some information can be lost or obfus-

cated. For example, it might be easy to map a graph

analysis into a comprehension representation, but it is

much less obvious how to reverse this mapping and

decide whether a sequence of comprehensions repre-

sents a graph analysis. However, this reverse trans-

formation can be useful for several reasons. First,

many special purpose domains, such as the relational

algebra, the linear algebra, graph analysis, statistical

analysis, and so forth, deﬁne domain speciﬁc opti-

mization rules that can not be applied conveniently in

the comprehension representation. Therefore it would

be beneﬁcial to reverse the comprehension mapping

and apply the optimizations in the original represen-

tation. Second, for some of these special purpose do-

mains there are dedicated processing systems with op-

timized support for the particular domain, such as re-

lational database systems or graph engines. The goal

of the Analytical Calculus is to be usable as common

intermediate language for all data intensive process-

ing and it should therefore be able to drive these spe-

cial purpose systems by reversing the mapping of do-

main logic to monoid comprehensions.

In our discussion of Musketeer in Section 2, we

have already encountered this issue as well. Gog and

colleages (Gog et al., 2015) want to use a generic data

ﬂow model to drive special purpose graph processing

engines and rely on implicit code idioms to enable

the necessary reverse mapping. However, in contrast

to the implicit idioms of Musketeer, we decided give

domain speciﬁc functions an explicit representation

in the Analytical Calculus. For this purpose, we add a

set of domain libraries to the Analytical Calculus that

capture domain speciﬁc concepts with a set of well

known functions. These functions, such as Select,

Filter, NaturalJoin, or GroupBy are implemented us-

ing ordinary language elements of the core Analytical

Calculus libraries, but their names are visible in the

program code and can be used to create domain spe-

ciﬁc behaviour in optimizations or code generation.

In contrast to implicit idioms, these explicit domain

functions can give guarantees. For example, the Fil-

ter function of the relational library can guarantee

that the provided predicate deﬁnition can be translated

EDDY 2018 - Special Session on Adaptive Data Management meets Self-Adaptive Systems

406

Listing 3: The Analytical Calculus filter function.

f i l t e r ( r e l a t i o n : BagT , p r e d : FuncT ) {

c o m p r e h e n s i o n (

r e l a t i o n , / / b i n d i n g

pr e d , / / f i l t e r

bagOf , / / hea d

bagU ni on / / monoid o p e r a t i o n

bagEmpty / / monoid i d e n t i t y

)

}

main ( ) {

p r e d = Func ( row : RowT ) {

q t y = row ( ” q u a n t i t y ” )

g t 1 0 0 = q t y > 100

p r c = row ( ” p r i c e ” )

g t 2 5 = p r c > 2 5 . 0

g t 1 0 0 && g t 2 5

}

l i = c a t a l o g . sym ( ” l i n e i t e m ” )

f i l t e r ( l i , p r e d )

}

into a SQL WHERE statement and reject incompati-

ble predicates at compile time.

Listing 3 shows a simpliﬁed version of the rela-

tional filter function of the Analytical Calculus.

The function takes a target relation and a predicate

function as arguments and simply forwards those pa-

rameters to the comprehension function. The rela-

tion is used as only binding of the comprehension and

the predicate function is used as comprehension ﬁlter.

The BagOf constructor provides the comprehension’s

head function and the bag union and empty bag con-

structor deﬁne the comprehension’s monoid. In the

main function, filter is used to select lineitems that

have a quantity greater than 100 and a price greater

than 25. In itself, the filter function is rather unre-

markable. However, the signiﬁcance of the function

lies in its well known name “filter” which is visible

to analysis, transformations and and code generators.

Using that name, it is relatively easy to provide an

analysis that checks whether predicate functions can

actually be translated into SQL or not.

4 CONCLUSIONS

Over the last decade, novel data intensive applications

and the need for scaleable and very fast hardware ar-

chitectures have reshaped the landscape of data pro-

cessing. At the beginning of this transition, appli-

cations and processing engines were closely coupled

into expensive single purpose solutions. Over time,

more accessible big data systems started to emerge

and these systems often provide their own dedicated

programming languages to simplify application de-

velopment. However, these languages create a tight

coupling between application and processing system

that can hinder further development of both applica-

tions and processing engines.

The goal of this article is to start a discussion

on the future of processing models for data inten-

sive applications. In the ﬁrst part of the article

we provided an in-depth look at four recent related

works: BigDAWG, Musketeer, Weld, and Vodoo. The

BigDAWG polystore system integrates a set of dedi-

cated data processing engines behind a uniﬁed query

interface that mostly reuses existing query languages

and their optimizers. Musketeer deﬁnes a uniﬁed data

ﬂow language that can be used to decouple special

purpose languages from their native processing en-

gines. Weld is a data processing engine for shared

memory systems that can be used by data analytics li-

braries to coordinate and optimize computations and

memory access and Vodoo is an abstract data pro-

cessing language and code generator that gives users

the ability to easily access different types of hardware

parallelism.

In the second part of the article, we have outlined

the Analytical Calculus, our own proposal for a mod-

ern programming model for data processing. The

Analytical Calculus is a small functional language

that uses monoid comprehensions as primary compu-

tational abstraction. The Calculus is used as an in-

termediate language that can be used as translation

target of high-level frontend languages, such as SQL,

and that can drive a wide array of data processing run-

times. The Analytical Calculus contains domain spe-

ciﬁc libraries with well known function names to fa-

cilitate domain speciﬁc optimizations and to enable

code generation for special purpose data processing

systems, such as RDBMS.

ACKNOWLEDGMENTS

The authors would like to thank the German Federal

Minstry of Education and Research (BMBF) for the

opportunity to do research in the VAVID project under

grant 01IS14005.

REFERENCES

Alexandrov, A., Thamsen, L., Kunft, A., Kao, O., Katsi-

fodimos, A., Herb, T., and Markl, V. (2015). Implicit

Parallelism through Deep Language Embedding. Pro-

ceedings of the 2015 ACM SIGMOD International

Conference on Management of Data, pages 47–61.

Design of a Portable Programming Abstraction for Data Transformations

407

Carey, M. J., Haas, L. M., Schwarz, P. M., Arya, M., Cody,

W. F., Fagin, R., Flickner, M., Luniewski, A. W.,

Niblack, W., Petkovic, D., et al. (1995). Towards

heterogeneous multimedia information systems: The

garlic approach. In Research Issues in Data Engi-

neering, 1995: Distributed Object Management, Pro-

ceedings. RIDE-DOM’95. Fifth International Work-

shop on, pages 124–131. IEEE.

Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland,

K., Papakonstantinou, Y., Ullman, J., and Widom, J.

(1994). The tsimmis project: Integration of heteroge-

nous information sources.

Duggan, J., Elmore, A., Stonebraker, M., Balazinska, M.,

Howe, M., Kepner, J., Madden, S., Maier, D., Matt-

son, T., and Zdonik, S. (2015). The BigDAWG Poly-

store System. ACM Sigmod Record, 44(2):11–16.

Fegaras, L. and Maier, D. (1995). Towards an effective cal-

culus for object query languages. In Proceedings of

the 1995 ACM SIGMOD international conference on

Management of data, pages 47–58.

Gog, I., Schwarzkopf, M., Crooks, N., Grosvenor, M. P.,

Clement, A., and Hand, S. (2015). Musketeer: all

for one, one for all in data processing systems. Eu-

roSys’15, pages 1–16.

Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., and Guestrin,

C. (2012). Powergraph: distributed graph-parallel

computation on natural graphs. In OSDI, volume 12,

page 2.

Grust, T. (2003). Monad Comprehensions: A Versatile Rep-

resentation for Queries. The Functional Approach to

Data Management, pages 288–311.

Kyrola, A., Blelloch, G. E., and Guestrin, C. (2012).

Graphchi: Large-scale graph computation on just a pc.

USENIX.

Luong, J., Habich, D., and Lehner, W. (2017). AL: Uniﬁed

Analytics in Domain Speciﬁc Terms.

Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C.,

Horn, I., Leiser, N., and Czajkowski, G. (2010).

Pregel: a system for large-scale graph processing. In

Proceedings of the 2010 ACM SIGMOD International

Conference on Management of data, pages 135–146.

ACM.

Mao, Y., Morris, R., and Kaashoek, M. F. (2010). Op-

timizing mapreduce for multicore architectures. In

Computer Science and Artiﬁcial Intelligence Labora-

tory, Massachusetts Institute of Technology, Tech. Rep.

Citeseer.

Meijer, E. and Bierman, G. (2011). A co-relational model

of data for large shared data banks. Communications

of the ACM, 54(4):49.

Meijer, E., Fokkinga, M., and Paterson, R. (1991). Func-

tional programming with bananas, lenses, envelopes

and barbed wire. pages 124–144.

Murray, D. G., McSherry, F., Isaacs, R., Isard, M., Barham,

P., and Abadi, M. (2013). Naiad: a timely dataﬂow

system. In Proceedings of the Twenty-Fourth ACM

Symposium on Operating Systems Principles, pages

439–455. ACM.

Palkar, S., Thomas, J., Narayanan, D., Thaker, P., Pala-

muttam, R., Negi, P., Shanbhag, A., Schwarzkopf,

M., Pirk, H., Amarasinghe, S., Madden, S., Zaharia,

M., Palkar, S., Thomas, J., Narayanan, D., Thaker, P.,

Palamuttam, R., and Negi, P. (2018). Evaluating End-

to-End Optimization for Data Analytics Applications

in Weld. PVLDB, 11(9):1002–1015.

Palkar, S., Thomas, J. J., and Shanbhag, A. (2017). Weld:

A common runtime for high performance data analyt-

ics. Conference on Innovative Data Systems Research

(CIDR).

Pirk, H., Moll, O., Zaharia, M., and Madden, S. (2017).

Voodoo -A Vector Algebra for Portable Database Per-

formance on Modern Hardware.

Stonebraker, M., Aoki, P. M., Litwin, W., Pfeffer, A., Sah,

A., Sidell, J., Staelin, C., and Yu, A. (1996). Mari-

posa: a wide-area distributed database system. The

VLDB Journal—The International Journal on Very

Large Data Bases, 5(1):048–063.

Stonebraker, M., Madden, S., Abadi, D. J., Harizopoulos,

S., Hachem, N., and Helland, P. (2007). The End of an

Architectural Era (It’s Time for a Complete Rewrite).

Vldb, 12(2):1150–1160.

Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson,

U.,

Kumar, P., Jon, G., Gunda, P. K., and Currey, J.

(2008). DryadLINQ : A System for General-Purpose

Distributed Data-Parallel Computing Using a High-

Level Language. Proceedings of the 8th USENIX con-

ference on Operating systems design and implementa-

tion, pages 1–14.

EDDY 2018 - Special Session on Adaptive Data Management meets Self-Adaptive Systems

408