ONE-T

O-MANY DATA TRANSFORMATIONS

As Relational Operations

Paulo Carreira

Faculty of Sciences, University of Lisbon, C6 - Piso 3, 1749-016 Lisboa, Portugal

Keywords:

Data Warehousing, Data Cleaning, Data Integration, ETL, Query optimization.

Abstract:

Transforming data is a fundamental operation in data management activities like data integration, legacy data

migration, data cleaning, and extract-transform-load processes for data warehousing. Since data often resides

on relational databases, data transformations are often implemented as relational queries that aim at leveraging

the optimization capabilities of most RDBMSs.

However, due to the limited expressive power of Relational Algebra, several important classes of data trans-

formations cannot be speciﬁed as SQL queries. In particular, SQL is unable to express data transformations

that require the dynamic creation of several tuples for each tuple of the source relation.

This paper proposes to address this class of data transformations, common in data management activities, by

extending Relational Algebra with a new relational operator named data mapper. A starting contribution of

this work consists of studying the formal aspects of the mapper operator focusing on its formal semantics and

expressiveness. A further contribution consists of supporting a cost-based optimization of data transformations

expressions combining mappers with standard relational operators. To that end, a set of algebraic rewriting

rules and different physical execution algorithms are being developed.

1 INTRODUCTION

Data transformation is a fundamental step of data

management activities such as integration, cleaning,

migration and warehousing of data. In these activi-

ties, data represented by a ﬁxed source schema must

be transformed into a ﬁxed target data schema.

A frequent problem in this context is the exis-

tence of data heterogeneities, i.e., the use of different

representations of the same data in source and target

schemas (Rahm and Do, 2000). For example: the use

of different units of measurement or the use of differ-

ent representations for compound data (e.g. multiple

attributes representing day, month and year informa-

tion vs a single date attribute) ocur frequently. An-

other important source of data heterogeneities is the

representation of data according to different aggrega-

tion levels (e.g. hourly vs daily). When the source

data represents an aggregation of the target data (e.g.,

yearly aggregated data in the source and monthly data

in the target), the data transformation that has to take

Relation

LOANS

Relation

PAYMENTS

ACCT

AM

12

20.00

3456 140.00

901 250.00

ACCTNO

AMOUNT SEQNO

0012

20.00 1

3456 100.00 1

3456 40.00 2

0901 100.00 1

0901 100.00 2

0901 50.00 3

Figure

1: Illustration of an unbounded data-transformation.

(a). The source relation

LOANS

on the left, and (b) the target

relation

PAYMENTS

on the right.

place needs to generate several tuples for each source

tuple. Let us henceforth designate this class of trans-

formations as one-to-many data transformations.

Consider

a relation

LOANS

[

ACCT

,

AM

] (represented

in Figure 1) that stores the details of loans requested

per account. Suppose

LOANS

data must be trans-

formed into

PAYMENTS

[

ACCTNO

,

AMOUNT

,

SEQNO

], the

target relation, according to the following require-

ments:

503

Carreira P. (2007).

ONE-TO-MANY DATA TRANSFORMATIONS - As Relational Operations.

In Proceedings of the Ninth International Conference on Enterprise Information Systems - DISI, pages 503-507

DOI: 10.5220/0002405105030507

Copyright

c

SciTePress

1. In the target relation, all the account numbers

are left padded with zeroes. Thus, the attribute

ACCTNO

is obtained by (left) concatenating zeroes

to the value of attribute

ACCT

.

2. The target system does not support payment

amounts greater than 100. The attribute

AMOUNT

is obtained by breaking down the value of at-

tribute

AM

into multiple installments with a max-

imum value of 100, in such a way that the sum

of amounts for the same

ACCTNO

is equal to the

source amount for the same account. Further-

more, the target ﬁeld

SEQNO

is a sequence number

for the installment. This sequence number starts

at 1 for each sequence of installments of a given

account.

The implementation of data transformations simi-

lar to those requested for producing the target relation

PAYMENTS

is challenging, since solutions to the prob-

lem involve the dynamic creation of tuples based on

the value of attribute

AM

.

The remainder of the paper is organized as fol-

lows: Next, we present the problem statement and

enumerate the main contributions of the paper. The

research ﬁeld of data transformations is reviewed in

Section 2. Section 3 brieﬂy presents the mapper oper-

ator, introducing the logical and physical optimization

issues. The current status of the paper work is detailed

in Section 4 and Section 5 concludes.

1.1 Problem Statement

To minimize development effort and maximize per-

formance, data transformations must be written in

a language that is declarative, optimizable, expres-

sive. Data transformations are often expressed as

Relational algebra (RA) expressions, which is a lan-

guage that meets the two former requirements. In

fact, many usefull data transformations can be natu-

rally expressed as RA queries. However, due to the

limitations in the expressive power of RA, relational

queries are insufﬁcient for expressing many interest-

ing data transformations (Lakshmanan et al., 1996;

Miller, 1998). In particular, RA is not capable of de-

riving new items (Paredaens, 1978) and thus, cannot

represent the class of one-to-many data transforma-

tions.

Currently, to develop one-to-many data transfor-

mations, one has to resort, either to a general pur-

pose programming language, to some ﬂavor of pro-

prietary scripting of an ETL tool, or to a stored proce-

dure written in the DBMS proprietary programming

language. In any case, besides the inadequacy of

these solutions to express one-to-many data transfor-

mations concisely, there is little possibility of lever-

aging the dynamic optimization capabilities of the

DBMS.

1.2 Contribution

This paper proposes a new operator named data map-

per which extends RA for expressing one-to-many

data transformations.

Since data transformations are often performed by

RDBMSs, or by tools and languages that are also to

based on RA to various extents, the new operator is

a general solution to express one-to-many data trans-

formations in these systems. In particular, by incorpo-

rating the mapper operator, RDBMSs will be capable

of efﬁciently handling a new class of data transforma-

tions, enabling their use in a greater variety of data

management activities that require data transforma-

tions.

An advantage of adressing the problem of one-to-

many data transformations through an operator is that

it can be embedded in expressions involving standard

relational operators and also be logically and phys-

ically optimized. To this end, we envision a cost-

based optimization of data transformations expressed

as a combination of standard relational operators and

mappers. We propose (i) to formalize the mapper op-

erator, (ii) to study the formal properties of the map-

per operator focusing on its expressiveness and alge-

braic properties, (iii) to develop alternative physical

execution algorithms, and (iv) to adapt current cost-

based query optimization techniques to handle map-

pers.

2 DATA TRANSFORMATIONS

Data transformation is an old problem and the idea of

using a declarative language to specify such transfor-

mations has been proposed back in the 1970’s with

two prototypes, Convert (Shu et al., 1975) and Ex-

press (Shu et al., 1977), both aiming at data conver-

sion.

To support the growing span of applications of

RDBMSs, several extensions to RA have been pro-

posed since its inception, mainly in the form of new

operators. Applications requiring data transforma-

tions bring a new requirement to RA as their focus is

no more limited to the initial idea of selecting infor-

mation, but also involves the production of new data

items (Paredaens, 1978).

In the context of data cleaning, Potter’s Wheel

fold (Raman and Hellerstein, 2001) operator and Ajax

(Galhardas et al., 2000) map operator were proposed

for expressing one-to-many data transformations. The

ICEIS 2007 - International Conference on Enterprise Information Systems

504

Data Fusion tool (Carreira and Galhardas, 2004) im-

plements one-to-many data transformations in the

context of legacy-data migrations. None of these,

however, proposes an extension of the relational al-

gebra or addresses logical and physical optimization

issues.

Data transformations are also required in ETL

processes. To the best of our knowledge, in most ETL

tools, to express one-to-many data-transformations,

the user has to resort to some form of ad-hoc script-

ing. Furthermore, the optimization of ETL data trans-

formations is a recent effort (Simitsis et al., 2005).

When performing data integration, data has to be

transformed from the data sources to the integrated

database or vice-versa. TSIMMIS MSL (Papakon-

stantinou et al., 1996) and Squirrel ISL (Zhou et al.,

1996) are data integration languages whose main goal

is to fusion data from several sources. These lan-

guages, like others for restructuring semi-structured

data, e.g, YAT (Cluet et al., 1998), and TransScm

(Milo and Zhoar, 1998), have their expressiveness re-

stricted to avoid potentially dangerous speciﬁcations

(that may result in diverging computations). As a

result, they cannot express the class of one-to-many

data transformations.

3 THE MAPPER OPERATOR

The mapper operator can be formalized as a unary op-

erator µ

F

that takes a relation instance of the source

relation schema as input and produces a relation in-

stance of the target relation schema as output. The

operator is parameterized by a set F of special func-

tions, which we designate as mapper functions. The

intuition is that each mapper function f

A

i

expresses a

part of the envisaged data transformation, focused on

a non-empty set A

i

of attributes of the target schema.

Consider relation schemas X and Y. Furthermore,

let Y = A

1

·...·A

k

, where each A

i

is a set of schema at-

tributes. Given a tuple u of a source relation s(X), the

expression µ

F

({u}) denotes the tuples t in Dom(Y )

such that, for every set of attributes A

i

, associate the

values given by f

A

i

(s). Further details can be found

in (Carreira et al., 2005a). The mapper operator is

formally deﬁned as follows: Given a set of mapper

functions F = { f

A

1

,..., f

A

k

}, the mapper of s with re-

spect to F, denoted by µ

F

(s), is the relation instance

of the target relation schema deﬁned by

µ

F

(s)

def

= {t ∈ Dom(Y ) | ∃u ∈ s s.t. t[A

i

] ∈ f

A

i

(u),

∀1 ≤ i ≤ k}

The data transformation of the introductory

example can be expressed by means of a map-

per µ

acct,amt

, with two mapper functions. The

function acct is the

ACCT

-mapper function that

returns a singleton with the account number

ACCT

properly left padded with zeroes. The

function amt is the

[AMOUNT,SEQNO]

-mapper func-

tion s.t., amt(am) is given by {(100,i) | 1 ≤ i ≤

(am/100)}∪{(am%100,(am/100) + 1) | am%100 6= 0},

where % represents the modulus operation. For

instance, if v is the source tuple (

901

,

250.00

),

the result of evaluating amt(v) is the set

{(

100

,

1

),(

100

,

2

),(

50

,

3

)}. Given a source relation s

including v, the result of the expression µ

acct,amt

(s)

is another relation that contains the set of tuples

{h

’0901’

,

100

,

1

i, h

’0901’

,

100

,

2

i, h

’0901’

,

50

,

3

i}.

3.1 Logical Optimization

Better plans for queries involving mappers can be

achieved through the systematic application of a new

set of algebraic rewriting rules. One such simple al-

gebraic rewriting rule is µ

F

(r on s) = µ

F

(r) on µ

F

(s), if

none of the mapper functions in F produces duplicate

values.

An important property that inﬂuences the choice

of a particular plan for binary operators, is the ex-

pected number of tuples of each of its sub-plans.

Since mappers can generate several output tuples for

each input tuple, estimating the number of output tu-

ples of a mapper is an interesting problem. One way

to approach the problem consists of estimating the

mapper fan-out factor. If a mapper was never eval-

uated before, an interesting question is how to ﬁnd a

good initial estimate for its fan-out. We believe that

in many situations the fan-out factors of mapper func-

tions can be combined to produce an initial answer.

Another interesting observation is that when mapper

functions return empty sets, no output tuples are pro-

duced. Thus, the mapper in some situations may act

as a ﬁlter, which turns selectivity into another relevant

factor. The already non-trivial problem of optimiz-

ing queries with mappers can be taken one step fur-

ther. Mapper functions can be expensive and due to

data skewness, its cost is subject to change at query-

execution time.

3.2 Physical Optimization

The formal semantics of the mapper equips it with a

simple iterator-based execution model as follows: For

each input tuple, perform the evaluation of each map-

per function and then compute the Cartesian prod-

uct of the results. The output relation is obtained by

unioning all the tuples so obtained.

ONE-TO-MANY DATA TRANSFORMATIONS - As Relational Operations

505

This simple model favors the integration of our

mapper operator in the query execution mechanisms

of an RDBMS. However, it turns out that in the pres-

ence of expensive functions, like, e.g., string match-

ing or check-digit computations, this na

¨

ıve execution

of the mapper operator can be very inefﬁcient.

The total cost of evaluating a mapper can be min-

imized by avoiding superﬂuous function evaluations.

First, columns often have duplicate values. This sug-

gests the use of caching techniques. In the presence

of potentially many functions and tables with multi-

million tuples, the choice of the functions is an opti-

mization problem in itself. Second, some functions

return empty sets. When an empty set is found, no

output tuples are produced for a given input tuple.

Thus, there is no need to evaluate the remaining func-

tions. This observation suggests an interesting strat-

egy that consists of evaluating the functions that are

more selective ﬁrst.

4 CURRENT STATUS

Deﬁning a new operator is a signiﬁcant research effort

as it requires both theoretical and practical insight. In

such effort, two issues need to be addressed forefront.

The usefulness of the operator needs to be validated

and the class of problems being solved has to be for-

mally deﬁned.

To address the ﬁrst issue, we pursued a commer-

cial venture that resulted in the inclusion of native

support for one-to-many data transformations in a

commercial tool (Carreira and Galhardas, 2004). The

tool is being used in several real-world legacy-data

migration projects that corroborate the need for sup-

porting one-to-many data transformations.

Up to this moment, we have been able to put for-

ward a formal semantics for the new operator that

enabled us to perform a formal study of the expres-

siveness of the operator. We developed the formal

demonstration that the mapper-extended RA (MRA)

is strictly more expressive than standard RA.

A formal deﬁnition of the class of one-to-many

data transformations is underway. We conjecture that

two sub-classes of one-to-many data transformations

exist: One comprising data transformations express-

ible through RA and another comprising those ex-

pressible only through MRA.

A set of algebraic rewriting rules for generat-

ing logical query plans involving mappers and some

standard relational operators have been developed to-

gether with their formal proofs of correctness (Car-

reira et al., 2005a). A ﬁrst set of rewriting rules

for expressions involving mappers and joins has also

emerged. Currently, a set of experiments is being con-

ducted to determine the factors that inﬂuence the ef-

fectiveness of the proposed rewritings (Carreira et al.,

2005b). Prototypical implementations for physical

mapper operator algorithms are being developed in

Java using the XXL framework (van den Bercken

et al., 2000). These algorithms adapt ideas of mem-

oization and hybrid hashing proposed by (Hellerstein

and Naughton, 1996) to multiple functions.

5 CONCLUSIONS

In this work, we address the problem of specify-

ing one-to-many data transformations that are fre-

quently required in data integration, data cleaning,

legacy-data migration, and ETL scenarios. Since

one-to-many data transformations are not expressible

through standard RA queries, we proposed the map-

per operator. The new operator allows to naturally

express one-to-many data transformations, while ex-

tending the expressive power of RA at the same time.

Up to now some operators have been proposed

for addressing the problem of expressing one to many

data-transformations (Cunningham et al., 2004; Gal-

hardas et al., 2001; Raman and Hellerstein, 2001;

Amer-Yahia and Cluet, 2004). Although these opera-

tors show similarities with mappers, most of them are

only capable of expressing a subset of one-to-many

transformations.

As data often resides in RDBMSs, data transfor-

mations speciﬁed as relational expressions can take

direct advantage of their optimization capabilities. In

this trend, several RDBMSs, like e.g., Microsoft SQL

Server, already include additional software packages

speciﬁc for ETL tasks. However, as far as we know,

none of these extensions is supported by the corre-

sponding theoretical background in terms of existing

database theory. Therefore, the capabilities of rela-

tional engines, in terms of optimization opportuni-

ties are not fully exploited in activities involving data

transformations, like ETL or data-cleaning.

REFERENCES

Amer-Yahia, S. and Cluet, S. (2004). A declarative ap-

proach to optimize bulk loading into databases. ACM

Transactions of Database Systems, 29(2):233–281.

Carreira, P. and Galhardas, H. (2004). Efﬁcient develop-

ment of data migration transformations. In ACM SIG-

MOD Int’l Conf. on the Managt. of Data.

Carreira, P., Galhardas, H., Lopes, A., and Pereira, J.

(2005a). Extending relational algebra to express one-

ICEIS 2007 - International Conference on Enterprise Information Systems

506

to-many data transformations. In 20th Brasillian Sym-

posium on Databases SBBD’05.

Carreira, P., Galhardas, H., Pereira, J., and Lopes, A.

(2005b). Data mapper: An operator for expres-

siong one-to-many data transformations. In 7th Int’l

Conf. on Data Warehousing and Knowledge Discov-

ery, DaWaK ’05, volume 3589 of LNCS. Springer-

Verlag.

Cluet, S., Delobel, C., Sim

´

eon, J., and Smaga, K. (1998).

Your mediators need data conversion! In ACM SIG-

MOD Int’l Conf. on the Managt. of Data.

Cunningham, C., Graefe, G., and Galindo-Legaria, C. A.

(2004). PIVOT and UNPIVOT: Optimization and Ex-

ecution Strategies in an RDBMS. In Proceedings

of the International Conference on Very Large Data

Bases (VLDB’04), pages 998–1009. Morgan Kauf-

mann.

Galhardas, H., Florescu, D., Shasha, D., and Simon, E.

(2000). Ajax: An extensible data cleaning tool. ACM

SIGMOD Int’l Conf. on Managt. of Data, 2(29).

Galhardas, H., Florescu, D., Shasha, D., Simon, E., and

Saita, C. A. (2001). Declarative data cleaning: Lan-

guage, model, and algorithms. In Proc. of the Int’l

Conf. on Very Large Data Bases (VLDB’01).

Hellerstein, J. M. and Naughton, J. F. (1996). Query execu-

tion techniques for caching expensive methods. ACM

SIGMOD Int’l Conf. on Managt. of Data.

Lakshmanan, L. V. S., Sadri, F., and Subramanian, I. N.

(1996). SchemaSQL - A Language for Querying and

Restructuring Database Systems. In Proc. Int’l Conf.

on Very Large Databases (VLDB’96), pages 239–250.

Miller, R. J. (1998). Using schematically heterogeneous

structures. Proc. of ACM SIGMOD Int’l Conf. on the

Managt. of Data, 2(22):189–200.

Milo, T. and Zhoar, S. (1998). Using schema matching to

simplify heterogeneous data translation. In Proc. of

the Int’l Conf. on Very Large Data Bases (VLDB’98).

Papakonstantinou, Y., Garcia-Molina, H., and Ullman, J.

(1996). MedMaker: A Mediator System Based on

Declarative Speciﬁcations. In Proc. Int’l. Conf. on

Data Engineering.

Paredaens, J. (1978). On the expressive power of the

relational algebra. Information Processing Letters,

7(2):107–111.

Rahm, E. and Do, H.-H. (2000). Data Cleaning: Problems

and current approaches. IEEE Bulletin of the Techni-

cal Comittee on Data Engineering, 24(4).

Raman, V. and Hellerstein, J. M. (2001). Potter’s Wheel:

An Interactive Data Cleaning System. In Proc. of the

Int’l Conf. on Very Large Data Bases (VLDB’01).

Shu, N. C., Housel, B. C., and Lum, V. Y. (1975). CON-

VERT: A High Level Translation Deﬁnition Lan-

guage for Data Conversion. Communic. of the ACM,

18(10):557–567.

Shu, N. C., Housel, B. C., Taylor, R. W., Ghosh, S. P., and

Lum, V. Y. (1977). EXPRESS: A Data EXtraction,

Processing and REStructuring System. ACM Trans-

actions on Database Systems, 2(2):134–174.

Simitsis, A., Vassiliadis, P., and Sellis, T. K. (2005). Opti-

mizing etl processes in data warehouses. In Proc. of

the 21st Int’l Conf. on Data Engineering, ICDE 2005.

van den Bercken, J., Dittrich, J. P., and Seeger, B. (2000).

XXL: A prototype for a library of query processing

algorithms. In Proc. of the ACM SIGMOD Int’l Conf.

on Managt. of Data. ACM Press.

Zhou, G., Hull, R., and King, R. (1996). Generating Data

Integration Mediators That Use Materialization. Jour-

nal of Intelligent Information Systems, 6(2/3):199–

221.

ONE-TO-MANY DATA TRANSFORMATIONS - As Relational Operations

507