
 
research community addressed these problems over 
the years, specifically in what is concerned to 
mapping BPMN processes to BPEL language 
(Ouyang et al., 2007). The detail and scope disparities 
between more abstract and concrete models represent 
a huge distance between two representations, simply 
because they serve different purposes. In order to 
simplify ETL development, we propose the 
application of a task-clustering technique to group a 
set of finer grain tasks into a collection of tasks, flows 
and control primitives, providing a method to 
organize them using layers of abstraction and 
supporting different detail to serve the several 
stakeholders in different project phases. The cluster 
categorization provides a way to identify and classify 
patterns that can be instantiated and reused in 
different ETL processes. Each pattern is scope 
independent and provides a specific skeleton that not 
only specifies their internal behaviour but also 
enables communication and data interchange with 
other patterns in a workflow context.  
In this paper, we demonstrate the feasibility and 
effectiveness of the approach we developed and 
followed analysing a real world ETL scenario and 
identifying common tasks clusters. In section 2 we 
discuss its generalization and use as general 
constructs in an ETL package. Then, we demonstrate 
how activities can be grouped to build ETL 
conceptual models in different abstraction layers 
(section 3), presenting two ETL skeletons (data 
lookup and data conciliation and integration) 
exposing their configuration and behaviour in a way 
that they can be executed in a target ETL tool (section 
4). Next, some related work is exposed (section 5). 
Finally, in section 6, we evaluate the work done, 
pointing out some research guidelines for future 
work. 
2 ETL TASKS CLUSTERING 
On-line Transaction Processing (OLTP) systems are 
responsible for recording all business transactions 
performed in enterprise operational systems. These 
are built to support specific business, which store 
operational data from the daily business operations. 
Therefore, they are the main data sources used by 
data warehousing systems. In more complex cases, 
data are distributed following the distinct business 
branches of a company. These data can be stored in 
sophisticated relational databases or in more simple 
data structures (e.g. texts or spreadsheets files). Due 
to this variety of information sources, problems on 
populating a data warehouse often occur (Rahm and 
Do, 2000). 
Generally, tasks used on the extraction step (E) 
are responsible to gather data in the data sources and 
put it into a data staging area (DSA). The DSA is a 
working area where data is prepared before going to 
the data warehouse. For that, the DSA provides the 
necessary metadata to support the entire ETL process, 
providing, for example, support for data correction 
and data recovery mechanisms using domain oriented 
dictionary, mapping mechanisms or quarantine tables. 
Transformation and cleaning (T) procedures are 
applied posteriorly to data extraction, using the data 
that was already stored in temporary structures in the 
DSA. After this second step, data is loaded (L) to a 
target data warehouse, following schema rules, and 
operational and business constraints. Essentially, an 
ETL process represents a data-driven workflow 
representing a set of tasks and their associated control 
flows and business rules that together express how 
the system should coordinated. Typically, the 
commercial tools use workflows to the representation 
of very specific tasks that are frequently grouped 
together each time we want to represent a same 
procedure.  
To reduce the impact of such situations, we 
purpose an approach (also used in others application 
scenarios (Singh et al., 2008) that allows us to 
organize ETL tasks into clusters and execute them as 
a single block. However, we went a little bit further 
formalizing a set of “standard” clusters representing 
some of the most common techniques used in real 
world ETL scenarios. Based on a set of input 
parameters, the tasks that compose a cluster should 
produce a specific output. In fact, we are creating 
routines or software patterns that can be used with the 
aim to simplify ETL development, reducing the 
implementation errors and time needed to implement 
ETL processes. With the ETL patterns identified, we 
propose a multi-layer approach to represent them, 
using BPMN pools to represent the several layers of 
abstraction, composed by several lanes representing 
the tasks that should be applied for each group of 
similar records to be processed. 
To demonstrate the potential of our approach, we 
present a common ETL scenario, representing an 
ordinary data extraction process that using two 
different data sources prepare data through 
confirming and inserting it posteriorly in a data 
warehouse. The first source, a relational schema, 
stores data about flights (dates and flight time), travel 
locations (city and country) and planes (brand, model 
and type, and manufacturer data). The spreadsheet 
source represents a subset of the data structures that 
can be found in the relational schema. 
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
208