Business Process Modeling Techniques for Data Integration Conceptual

Modeling

Ana Ribeiro

1 a

, Bruno Oliveira

2 b

and

Oscar Oliveira

2 c

University of Tr

as-os-Montes and Alto Douro, Portugal

CIICESI, School of Management and Technology, Porto Polytechnic, Portugal

Keywords:

Data Pipelines, Business Process Model and Notation, Conceptual Modeling, Analytical Systems,

Methodology.

Abstract:

Data pipelines play a crucial role in analytical systems by managing the extraction, transformation, and load-

ing of data to meet decision-making needs, however, due to the inherent complexity of data management.

Despite their importance, the development of data pipelines is frequently carried out in an ad-hoc manner,

lacking standardized practices that ensure consistency and coherence across implementations. In recent years,

the Business Process Model and Notation (BPMN) has emerged as a powerful tool for conceptual modeling in

diverse analytical and operational scenarios. BPMN offers an expressive framework capable of representing a

wide range of data processing requirements, enabling structured and transparent design. This work explores

the application of BPMN to data integration pipeline modeling, analyzing existing methodologies and propos-

ing a standardized set of guidelines to enhance its use.

1 INTRODUCTION

The need to capture and analyze data constantly

evolves as organizations expand their business ac-

tivities (Yaqoob et al., 2016). Current trends high-

light analytical and storage approaches designed to

handle large-scale, often less structured data (Inmon,

2016). However, having access to vast amounts of

data does not guarantee informed decision-making

(Janssen et al., 2017). It is crucial to ensure that data

quality standards are maintained to support decision-

making processes effectively.

Data integration, ETL, and data pipelines are in-

terconnected concepts for modern data processing.

Data integration refers to the process of combining

data from different sources to provide a uniﬁed view,

enabling organizations to analyze and make decisions

based on comprehensive datasets (Lenzerini, 2002).

ETL is a speciﬁc subset of data integration that fo-

cuses on extracting data from disparate sources, trans-

forming it into a format suitable for analysis, and then

loading it into a destination, such as a data ware-

house (Kimball and Caserta, 2004). Data pipelines,

https://orcid.org/0009-0000-4462-1149

https://orcid.org/0000-0001-9138-9143

https://orcid.org/0000-0003-3807-7292

on the other hand, are end-to-end workﬂows that au-

tomate and streamline the process of moving, trans-

forming, and storing data (Munappy et al., 2020). A

data pipeline can encompass various stages, including

ETL, but may also involve additional operations like

data validation, enrichment, and real-time processing

(Dayal et al., 2009). In essence, ETL is one of the

key processes within a data pipeline, while data in-

tegration is the broader goal that data pipelines aim

to achieve by facilitating the seamless ﬂow of data

across systems.

Due to the complexity of data management, ETL

processes consume signiﬁcant resources. As a critical

component, ETL impacts the system’s adequacy: fail-

ure to deliver high-quality data compromises the sys-

tem’s reliability (Souibgui et al., 2019; Nwokeji and

Matovu, 2021). ETL development is typically driven

by ad-hoc practices, which, unlike traditional soft-

ware development, lack a solid methodology to guide

and document the transition from conceptual repre-

sentation to physical implementation. While some

ETL tools offer support in this regard, they often rely

on proprietary methods, notations, and methodolo-

gies. These tools primarily address the physical level

of ETL, offering limited support for conceptual prim-

itives to describe systems at a higher level (Prakash

and Rangdale, 2017).

158

Ribeiro, A., Oliveira, B. and Oliveira, Ó.

Business Process Modeling Techniques for Data Integration Conceptual Modeling.

DOI: 10.5220/0013497800003929

In Proceedings of the 27th International Conference on Enterprise Information Systems (ICEIS 2025) - Volume 1, pages 158-169

ISBN: 978-989-758-749-8; ISSN: 2184-4992

Business Process Modeling and Notation (BPMN)

(Aagesen and Krogstie, 2015) is a standardized mod-

eling notation that offers a set of artifacts for model-

ing business processes. It focuses on two key aspects:

managing and planning a workﬂow and modeling the

implementation architecture. BPMN notation is ad-

vantageous, as it allows for process design to be ex-

plored from multiple perspectives, enhancing process

expressiveness, particularly during early development

phases. However, this ambiguity can pose challenges

for interpreting processes at the execution level during

implementation.

The remainder of this paper is structured as fol-

lows: After the introduction, Section 2 reviews the

most relevant research in the ﬁeld of ETL modeling.

Section 3 outlines the approach proposed in this pa-

per for ETL conceptual modeling. Section 4 presents

a case study demonstrating the application of the pro-

posed conceptual modeling approach to a real-world

scenario. Finally, Section 5 summarizes the key con-

clusions of this work and outlines directions for future

research.

2 RELATED WORK

A Business Process Model and Notation (BPMN)

is a graphical notation for modelling business pro-

cesses[16]. BPMN emerged as a successor to

ﬂowcharts simple diagrams with boxes and ar-

rows—and the Uniﬁed Modeling Language (UML),

the ﬁrst structured method for process ﬂow represen-

tation (Oliveira et al., 2015). BPMN is designed to be

easy to use and understand by all involved in business

processes, including business analysts, programmers,

technical implementers, and managers responsible for

monitoring the processes. Its simplicity and expres-

siveness for representing ﬂows, participants, condi-

tions, and events have made BPMN widely adopted

across domains. The latest speciﬁcation, BPMN 2.0,

published in 2011, has since become the most popular

business process modelling language. It bridges the

gap between technical and non-technical users, offer-

ing a versatile and widely understood tool supported

by BPMS. The notation’s ﬂexibility is evident in its

three levels of modelling (Silver, 2011):

• Descriptive Modeling: This level represents an

initial view of a business process, highlighting

its main activities. It focuses on simple docu-

mentation of workﬂows, such as mapping and de-

scribing routine processes within an organization.

High-level and occasionally non-compliant with

BPMN validation rules, descriptive modeling en-

hances communication across the organization. It

is ideal for depicting existing processes without

delving into speciﬁcs.

• Analytical Modeling: Building on the previous

level, analytical modeling incorporates rules and

outcomes for more precise and detailed represen-

tations. It fosters collaboration among stakehold-

ers, including business analysts, technical staff,

and managers. This level clariﬁes process ac-

tivities and objectives, enabling detailed work-

ﬂow representations and performance analyses for

optimization. Validation against BPMN speciﬁ-

cations and hierarchical organization are empha-

sized. Analytical modelling is commonly used in

areas such as human resources, logistics, and pro-

curement.

• Executable Modeling: This advanced level em-

phasizes precision and detail, enabling process

execution through simulations. Simulations gen-

erate performance data and validate compliance

with BPMN modelling rules. Executable mod-

elling requires detailed descriptions of process

attributes and aims to convert diagrams into

software-ready formats like XML-based speciﬁ-

cations (e.g., XML Process Deﬁnition Language,

XPDL).

These three BPMN representation levels differ

in abstraction, information detail, complexity, utility,

and standardization. The choice of model depends on

these factors and the purpose of the process design,

ensuring alignment with project needs and goals. In

(Oliveira et al., 2021), the authors presented a BPMN-

based conceptual modeling approach for represent-

ing ETL processes across three distinct abstraction

layers. BPMN’s expressiveness, advantageous for

ETL representation, enables diverse conceptual mod-

els due to varying thought processes and representa-

tion styles among individuals. The authors proposed

an approach that organizes ETL conceptual model-

ing into separate layers, each focusing on a speciﬁc

level of detail. This structure provides the ETL devel-

opment team with tailored tools for communication

at different phases of ETL development. Each layer

builds upon the constructs described in the preceding

layer, progressively adding detail. This incremental

enrichment of system requirements contributes to a

more agile development process. The authors embod-

ied the notion of patterns when designing this top-

down approach for conceptual modelling, providing

pre-conﬁgurable components that represent common

tasks used in the ETL environment.

In the past, several other authors focused on mod-

elling ETL, proposing methodologies for ETL con-

ceptual modeling (Vassiliadis et al., 2005), (Tru-

jillo and Luj

an-Mora, 2003). In (Dupor and Jo-

Business Process Modeling Techniques for Data Integration Conceptual Modeling

159

vanovi, 2014) developed a method focused on pro-

viding a simple visual overview to ease the repre-

sentation of ETL processes. Biswas et al. (Biswas

et al., 2017; Biswas et al., 2019) utilized Systems

Modeling Language (SysML) to explore requirement

and activity diagrams for conceptual representation,

which could be transformed into XML Metadata In-

terchange (XMI) for programmable interpretation. In

(Raj et al., 2020) took a broader approach by mod-

eling data pipelines, including ETL/ELT transfor-

mations across various applications and data types

(e.g., continuous or batch). The authors provide an

overview of designing a conceptual model for data

pipelines, which serves as a common communication

framework among different data teams. Additionally,

this model can facilitate the automation of monitor-

ing, fault detection, mitigation, and alerting at various

stages of the data pipeline.

However, some approaches rely on speciﬁc nota-

tions, adding complexity for ETL development teams

to learn and communicate these to non-technical

stakeholders. Adapting widely used notations can

mitigate these issues. For example, Akkaoui et al.

(Akkaoui and Zimanyi, 2009) suggested that ETL

processes can be viewed as specialized business pro-

cesses, facilitating communication among technical

and non-technical staff. BPMN, as a widely adopted

business process modeling and execution standard,

has naturally extended into ETL modeling.

We believe that BPMN is a valuable tool for ETL

modeling, as ETL processes can be understood as a

specialized type of business process. Just like tradi-

tional business workﬂows, ETL involves a sequence

of structured activities—data extraction, transforma-

tion, and loading—each governed by rules, dependen-

cies, and business requirements. Given this parallel,

our work explores how methodologies originally de-

signed for modeling standard business processes can

be adapted to ETL, providing a structured and in-

tuitive representation of data workﬂows. By using

BPMN’s expressiveness and widespread adoption, we

aim to bridge the gap between business and technical

stakeholders, enhancing communication, improving

process transparency, and enabling automation. Our

approach seeks to demonstrate that BPMN can not

only facilitate conceptual modeling of ETL but also

support optimization, monitoring, and maintainabil-

ity of ETL workﬂows, ultimately contributing to more

efﬁcient data integration practices.

3 MODELLING WITH LAYERS

The development of abstract models helps improve

the understanding of the process for all involved par-

ties, whether they are business owners, business ana-

lysts, or more technical users (Soffer et al., 2012). Es-

pecially during the early stages of development, con-

ceptual models play an extremely important role, as

users validate business requirements. BPMN offers a

simple yet powerful notation for process representa-

tion, which is highly suitable for ETL processes. Be-

yond its expressiveness, BPMN also provides mech-

anisms for execution. Additionally, one advantage is

that business users are already familiar with the nota-

tion, and existing business processes can be leveraged

to understand the logic of processes and data ﬂow.

However, adopting a notation for modelling ETL pro-

cesses with BPMN is not straightforward. Conven-

tions need to be adopted to guide the modelling pro-

cess, ensuring that there are no divergent representa-

tions of the process that could lead to misinterpreta-

tions.

Three levels of representation are deﬁned in (Sil-

ver, 2011): Descriptive Modeling, Analytical Mod-

eling, and Executable Modeling. These levels guide

the progressive and increasingly detailed representa-

tion of processes, aligning with the needs of different

project phases. Given BPMN’s suitability for ETL

processes, this study explores how this approach can

be applied in the ETL context.

The goal of this methodology is to adapt BPMN

modeling principles for ETL across different levels,

ensuring clarity, expressiveness, and consistency in

both speciﬁcation and implementation. Following

the top-down BPMN modeling approach described

in (Silver, 2011), a hierarchical structure for ETL

models is proposed. Through an analysis of BPMN

process modeling methods, the applicability of these

techniques to ETL was assessed. Based on this anal-

ysis, relevant methods and rules were selected and

adapted to ﬁt ETL-speciﬁc requirements, deﬁning the

key elements of each modeling level. The Executable

Modeling rely on specialized execution engines rather

than direct BPMN execution. The application of com-

position principles and carefully chosen BPMN ele-

ments ensures adherence to best practices for concep-

tual ETL modeling.

3.1 Level 1 - Descriptive Modeling

At Level 1, BPMN modeling is designed to be sim-

ple, intuitive, and easy to read, using a core set of

elements that resemble traditional ﬂowcharts. This

level focuses on high-level process documentation,

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

160

avoiding technical complexity while ensuring clarity

in workﬂow representation. The Level 1 palette for

BPMN 2.0 includes essential components: tasks (for

individual activities), subprocesses (for grouping re-

lated tasks), gateways (for decision logic), and events

(for process initiation and completion). These ele-

ments provide a clear, structured view of ETL work-

ﬂows, where tasks represent ETL operations, subpro-

cesses organize process phases and gateways control

ﬂow decisions. Events deﬁne key moments affecting

the process, such as data availability or pipeline fail-

ures. Pools and lanes introduce role-based segmen-

tation, allowing the representation of data reposito-

ries, system components, or ETL phases (extraction,

transformation, loading). While message ﬂows ex-

ist in BPMN, they are typically minimized in Level 1

models, focusing instead on sequence ﬂows that de-

ﬁne the core process execution order. Annotations

and artifacts further enhance the model by providing

descriptive details, supporting human readability and

process traceability.

In adapting BPMN to ETL modeling, the Level 1

approach ensures that data engineers and business an-

alysts can collaboratively design, analyze, and com-

municate ETL processes without technical barriers.

By maintaining a consistent modeling style and focus-

ing on core BPMN elements, Level 1 lays the founda-

tion for more detailed analytical and executable mod-

els, bridging the gap between conceptual design and

implementation.

Figure 1 focuses on high-level representations of

ETL workﬂows, using a simpliﬁed set of BPMN ele-

ments to ensure clarity and comprehensibility. These

elements provide a structured way to describe the core

steps of an ETL process while maintaining an intu-

itive, ﬂowchart-like visual representation. The key

BPMN elements used in this layer include:

1. Flow Objects

(a) Events: Represent the start and end of the pro-

cess.

i. Start Event: Marks the beginning of an ETL

process (e.g., data extraction initiation).

ii. End Event: Indicates the completion of the

ETL process (e.g., after loading data into the

destination system).

(b) Activities: Deﬁne process steps and transfor-

mations.

i. Task: Represents a single ETL operation, such

as data cleansing, transformation, or aggrega-

tion.

ii. Subprocess: Groups multiple tasks into a sin-

gle, reusable unit to simplify complex work-

ﬂows.

deﬁning decision points.

i. Exclusive Gateway (XOR): Directs the pro-

cess ﬂow based on conditions (e.g., different

paths for valid vs. invalid data).

ii. Parallel Gateway (AND): Splits or synchro-

nizes multiple concurrent ETL tasks.

(d) Connecting Objects

i. Sequence Flow: Deﬁnes the execution order

of tasks within the ETL pipeline.

ii. Message Flow (limited use in Level 1): Repre-

sents interactions with external entities but is

generally reserved for higher modeling levels.

iii. Associations: Link tasks to additional infor-

mation (e.g., documentation or annotations).

2. Swimlanes

(a) Pools: Represent the overall ETL system or a

high-level organizational boundary.

(b) Lanes: Subdivide pools to illustrate responsi-

bilities, such as extraction, transformation, and

loading phases.

3. Artifacts

(a) Data Objects: Represent inputs, intermediate

results, or outputs in the ETL process.

(b) Text Annotations: Provide clariﬁcations and

additional details for process elements.

alization and structure.

The descriptive modeling level ensures that ETL

processes are represented in a structured yet accessi-

ble format. By using a limited but powerful set of

BPMN elements, Level 1 diagrams effectively com-

municate the ﬂow of data, transformations, and de-

pendencies while maintaining simplicity. This ap-

proach enables collaboration between business users

and technical teams, fostering a shared understand-

ing of ETL workﬂows before progressing to more de-

tailed analytical and executable models.

3.2 Level 2 - Analytical Modelling

Level 2 builds upon the foundational elements intro-

duced in Level 1 by incorporating intermediate events

to model dynamic behaviors within processes and

boundary events to handle task-speciﬁc conditions.

While Level 1 relies on a strict sequence where each

step follows the completion of the previous one, Level

2 introduces these elements to allow processes to re-

act to speciﬁc conditions or interruptions during ex-

ecution. This capability is particularly relevant in

ETL workﬂows, where tasks may require compen-

satory actions to handle exceptions or prevent process

Business Process Modeling Techniques for Data Integration Conceptual Modeling

161

Figure 1: BPMN Elements Used in Level 1 - Descriptive Modeling.

failures. For example, in a data validation scenario,

an unexpected value encountered during a name nor-

malization task might trigger a redirection of affected

records to a quarantine table for further analysis.

Boundary events, depicted as double-lined circles

on the edge of activities, play a crucial role in en-

abling these adaptive behaviors. These events deﬁne

alternative process ﬂows based on speciﬁc triggers,

ensuring greater fault tolerance and resilience in ETL

pipelines. Key boundary events include:

• Timer events, which introduce time-based triggers

to control execution timing.

• Message events, which handle external communi-

cations between process components.

• Error events, which manage process exceptions

by directing workﬂows toward predeﬁned com-

pensatory actions.

By leveraging these elements, Level 2 BPMN

models enhance ETL process resilience, addressing

real-world challenges such as data validation errors,

timeouts, and system failures. This approach im-

proves error handling, adaptability, and process au-

tomation, ensuring robust ETL pipeline execution.

Beyond process ﬂow control, Level 2 also formalizes

data interactions through data objects, data stores, as-

sociations, and annotations:

• Data Objects represent transient data repositories,

modeling inputs, outputs, and intermediate stor-

age.

• Data Stores represent persistent data repositories,

such as database records.

• Directed Associations link activities to these ob-

jects, specifying their role in the workﬂow.

• Parameterized Annotations enhance documenta-

tion by explicitly deﬁning input/output relation-

ships and process descriptions, improving model

clarity and implementation accuracy.

For error recovery, BPMN introduces compensa-

tion activities, which serve to reverse or mitigate the

effects of a completed task. These activities are linked

to compensation events, ensuring that corrective ac-

tions restore process integrity before normal execu-

tion resumes. Notably, compensation activities do not

execute by default; they must be explicitly triggered

when a process requires rollback actions. By integrat-

ing these mechanisms, Level 2 BPMN models rein-

force error management and process transparency in

ETL workﬂows.

In Level 2, BPMN elements are used to create

models that support decision-making, process opti-

mization, and analysis. The main focus is on ensuring

that the models are easy to understand and reﬂect the

real-world process dynamics, without overcomplicat-

ing them with implementation details. Add this in the

preamble

Table 1 highlights the key differences between

BPMN elements used in Level 1 and Level 2. While

no elements are explicitly excluded in Level 2, this

level reﬁnes and extends the constructs introduced in

Level 1 by incorporating additional details and be-

havioral nuances. Level 2 introduces intermediate

events, including timer, message, and error events,

which enable processes to respond dynamically to

speciﬁc conditions. It also expands control ﬂow capa-

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

162

Table 1: Comparison of BPMN Elements in Level 1 and Level 2.

BPMN Element Used in Level 1? Used in Level 2? Notes

Events

Start Event Yes Yes Used in both levels.

End Event Yes Yes Used in both levels.

Intermediate Events Rarely Yes Used in Level 2 for dynamic behavior.

Timer Event No Yes Time-based triggers for workﬂows.

Message Event No Yes Handles communication between processes.

Error Event No Yes Manages process exceptions.

Activities

Task Yes Yes Used in both levels.

Sub-process Rarely Yes Expanded in Level 2 for modular design.

Ad-hoc Sub-process No Yes Allows dynamic execution of tasks.

Service Task No Yes Represents system-automated tasks.

Manual Task No Yes Represents human-performed tasks.

Gateways

Exclusive Gateway (XOR) Yes Yes Used for decision-making.

Parallel Gateway (AND) Yes Yes Used for parallel execution.

Inclusive Gateway (OR) No Yes Allows multiple paths based on conditions.

Complex Gateway No Yes Used for advanced decision logic.

Event-based Gateway No Yes Decisions based on events.

Data Objects and Flows

Data Object No Yes Represents transient data.

Data Store No Yes Represents persistent data.

Data Input/Output No Yes Models data exchanges.

Sequence and Message Flows

Sequence Flow Yes Yes Used in both levels.

Message Flow Rarely Yes More common in Level 2 for inter-process communication.

Conditional Flow No Yes Represents ﬂows based on conditions.

Default Flow No Yes Indicates the default path in decisions.

Swimlanes

Pool Yes Yes Represents process participants.

Lane Rarely Yes More detailed role distinction in Level 2.

Artifacts

Text Annotation Rarely Yes Used more in Level 2 for documentation.

Group No Yes Helps group related activities.

bilities with complex and inclusive gateways, enhanc-

ing decision-making ﬂexibility. Furthermore, Level 2

formalizes data interactions through data objects and

ﬂows, while also supporting more advanced process

structuring with ad-hoc sub-processes. Artifacts such

as grouping and text annotations improve documen-

tation and process clarity. Unlike Level 1, which pri-

marily focuses on a high-level, descriptive represen-

tation of workﬂows, Level 2 enhances complexity by

emphasizing process analysis, conditions, data depen-

dencies, and event-driven execution, making it more

suitable for analytical modeling and execution-ready

process deﬁnitions.

3.3 Level 3 - Executable Modeling

Level 3 builds upon the descriptive and analytical

foundations established in the previous levels, extend-

ing BPMN representations to facilitate the transition

from conceptual models to deployable ETL work-

ﬂows. Unlike Levels 1 and 2, which focus on high-

level abstraction and dynamic behavior modeling,

Level 3 introduces execution-related elements nec-

essary for the operationalization of ETL processes.

However, rather than directly executing BPMN mod-

els, this level assumes integration with specialized ex-

ecution engines, such as ETL platforms and workﬂow

automation tools, which interpret and implement the

deﬁned logic in a structured, repeatable manner.

Executable modeling at this level emphasizes

the precise speciﬁcation of process elements to en-

sure consistency in implementation. Key compo-

nents include service tasks, script tasks, and busi-

ness rule tasks, which serve as placeholders for data

transformation operations, validations, and orchestra-

tions. These tasks are augmented with technical de-

tails, such as parameterized conﬁgurations, metadata-

driven transformations, and control ﬂow directives,

aligning the conceptual model with practical execu-

tion requirements. Furthermore, explicit data map-

pings, connections to external systems, and integra-

tion with logging and monitoring frameworks ensure

a seamless transition from design to deployment.

The composition principles established in earlier

Business Process Modeling Techniques for Data Integration Conceptual Modeling

163

levels remain fundamental in Level 3, ensuring that

BPMN diagrams retain their clarity and expressive-

ness despite the increased complexity introduced by

executable elements. Subprocesses are leveraged to

encapsulate reusable logic, reducing redundancy and

enhancing maintainability, while event-driven trig-

gers enable real-time adaptations to changes in data

quality, availability, or external conditions. Compen-

sation mechanisms and fault tolerance strategies are

explicitly modeled to handle process failures, ensur-

ing robust execution of ETL workﬂows.

Despite its focus on execution, Level 3 remains

a conceptual modeling stage, distinguishing itself

from physical implementations by abstracting low-

level conﬁgurations such as SQL scripts, API calls, or

system-speciﬁc ETL operations (considering the ETL

context). Instead, it provides a structured represen-

tation of execution logic that can be translated into

platform-speciﬁc implementations while maintaining

adherence to best practices for ETL modeling.

4 CASE STUDY

The case study involves a Data Warehouse (DW)

referenced structured around the ”Sales” fact table,

where each record represents a sale of a stock item.

The dimensions in the Sales DW include:

• Temporal Dimension with daily granularity

• Stock Item: Represents details about stock items.

• Customer: Stores customer data and plays two

Role-Playing roles: the customer making the pur-

chase and the one receiving the invoice.

• Employee: Contains information about employ-

ees involved in sales.

• City: Represents geographic location details.

• Creating a high-level map to understand the entire

ETL process.

• Developing a top-level BPMN diagram based on

this map.

• Expanding the top-level diagrams into detailed

child-level diagrams.

• Adding message ﬂows to reﬁne the process inter-

actions.

For each fact, a set of measures is deﬁned, e.g.,

quantity, tax amount or proﬁt. Based on the approach

described in the previous section, a proposal for ETL

process modeling is presented, focusing on the ﬁrst

two levels of modeling using BPMN notation: Level 1

- Descriptive Modeling and Level 2 - Analytical Mod-

eling. Descriptive Modeling (Level 1) serves as an

initial representation of the ETL process, highlight-

ing its key activities with the primary goal of docu-

menting the process ﬂow in a simple and clear man-

ner. Using Level 1 elements and applying the pro-

posed methodology, a top-down modeling approach

is suggested.

The ﬁrst step in modeling the ETL process is

deﬁning its scope. The process runs daily at 11:00

PM, concluding once data is loaded into the Sales

fact table. Each instance represents a Sales schema

population cycle, with a single predeﬁned end event.

The ETL process follows a top-down approach, start-

ing with a high-level overview and reﬁning it into de-

tailed BPMN diagrams. Figure 2 illustrates a concep-

tual model providing an overview of the ETL process,

speciﬁcally depicting the data load into the dimension

tables: Temporal, City, Customer, Employee, and

Product. To ensure referential integrity, dimensions

are populated ﬁrst, followed by the fact tables—in

this case, the sales fact table. This dependency is rep-

resented using Parallel Gateways, which do not nec-

essarily enforce parallel execution but rather indicate

the independence of tasks.

Following a hierarchical modeling, each child-

level process is depicted in a separate diagram, linked

to a subprocess activity referenced in the high-level

diagram. The sub process ”Load Customer Dimen-

sion” is presented in Figure 3. A detailed sub pro-

cess for Customer Dimension Loading consists of

performing three support tasks:

• Update Lineage Table: This activity logs the start

time of the ETL process, capturing when the data

loading operation begins. Once the process com-

pletes, the lineage table is updated again to record

the completion time, ensuring a detailed audit trail

of the data ingestion process.

• Clear Staging Table: This activity ensures that

the staging area is properly prepared for new

data ingestion by removing outdated or temporary

records from previous ETL runs. Since the stag-

ing table acts as an intermediary storage space be-

fore loading data into the target system, clearing

it prevents duplication, inconsistencies, and data

conﬂicts.

• Get Last ETL Cutoff Time: This step retrieves the

cutoff timestamp, which deﬁnes the starting point

for extracting new or updated customer records

from the source system. The cutoff time is stored

in a metadata or control table and represents the

last successful ETL execution. By using this

timestamp, the process ensures that only new or

modiﬁed customer records since the last data load

are extracted, optimizing performance and pre-

venting redundant data processing.

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

164

Load Temporal

Dimension

Load Customer

Dimension

Load City

Dimension

Load City

Employee

Load Products

Load Sales

Fact

EndStart

Daily at 23:00pm

Figure 2: BPMN general conceptual model - Level 1.

• Extract Customer Data: Identiﬁes and loads Cus-

tomer data

• Load Customer Dimension: Stores transformed

data into the target dimension.

Subprocesses in BPMN are essential for creating

hierarchical abstraction levels, simplifying model rep-

resentation through a top-down approach. Any task

that can be further decomposed should be modeled

as a subprocess, while atomic activities should re-

main as individual tasks. The ”Load Customer Data

to DSA” subprocess consists of identifying changes

in customer data for further transformation and clean

(”Transform Customer Data”). The ”Load Customer

Dimension” sub process loads the data to Dimension

and the Lineage Table is updated.

The expansion of the ”Extract Customer Data”

subprocess is shown in Figure 4. It consists of

four subprocesses focused on identifying changes re-

lated to customers, including: Changes in purchase

groups table, Changes in customer categories table

and Changes in customer records table. These activ-

ities are powered by Change Data Capture (CDC), a

technique used to efﬁciently detect and track changes

in source tables. CDC ensures that only the new

or modiﬁed data since the last update is captured,

preventing the unnecessary extraction of unchanged

data and improving the overall performance of the

ETL process. These tasks ensure that only the rele-

vant customer data (new or changed) is extracted and

staged for further processing, allowing efﬁcient data

management and ensuring the target database reﬂects

the most up-to-date customer information. The ﬁnal

tasks involve update CutOff time and the Lineage ta-

ble. Level 2 Modeling, or Analytical Modeling, in-

volves a detailed representation of the ETL process

designed for interpretation by advanced users. As the

level of abstraction increases, so does the complexity

of the model. To elaborate on the details of the ETL

ﬂow, new elements, such as annotations and data ob-

jects, are added. Annotation elements should include

the following. Input data for the activity (input), de-

scription of the task’s functionality (description), and

output data from the activity (output). Each process

should be documented in as much detail as possible,

which will signiﬁcantly facilitate the physical imple-

mentation and interpretation of the process. Level 2

method involves adding data objects to the already

implemented diagrams. These objects represent the

states of data ﬂowing within a process, whether they

are data inputs, data outputs, or data storage.

Figure 5 provides a detailed view of the concep-

tual model shown in Figure 3 for Load Customer Data

to DSA. In this expanded version, database objects

corresponding to the data sources and destinations in-

volved in the Change Data Capture (CDC) tasks are

included. This expansion enhances the model by of-

fering a more granular representation of the process

ﬂow, showcasing how data is captured, transformed,

and loaded between various system components, in-

cluding the source and target databases. Additionally,

Business Process Modeling Techniques for Data Integration Conceptual Modeling

165

Start

Update ETL

Lineage Table

Clean Data

Staging Table

Get Last ETL

CutOff Time

Update Last

ETL CutOff

Time

Update ETL

Lineage Table

Transform

Customer Data

Load Customer

Dimension

Load Customer

Data to DSA

End

Figure 3: Load Customer Data sub-process Conceptual Model

Figure 3: Load Customer Data sub-process Conceptual Model.

EndStart

CDC Buying

Groups

CDC Customer

Categories

Null Address

DSA.Categories

DSA.CustomersCustomers

DSA.CustomerQu

arantine

EndStart

Figure 5: Load Customer Data to DSA sub process Conceptual Model.

These annotations can also be structured using

common notations, ensuring that the information is

both human-readable and machine-executable. A

JSON structure can serve as an intermediate represen-

tation, dynamically generating executable code or, at

the very least, a skeleton of an executable model. This

structured approach enables automation, consistency,

and maintainability in ETL processes. A concrete

example of this approach can be seen in SSIS, one

of the most widely used traditional ETL tools. Fig-

ure 6 presents a JSON representing for ”CDC Buying

Groups”. It can be translated to a physical model us-

ing BIML (Business Intelligence Markup Language).

BIML is an XML-based language that allows devel-

opers to deﬁne SSIS packages programmatically. By

structuring annotations in JSON, we can automate the

translation into BIML, which in turn generates a fully

functional SSIS package.

In ETL modeling, data annotations need to be

highly speciﬁc to ensure that the transformation logic,

error handling, and data mappings are clear and exe-

cutable. Unstructured text is generally avoided be-

cause it introduces ambiguity, making it difﬁcult to

translate into executable primitives. When annota-

tions are too vague or ﬂexible, misunderstandings

can arise, leading to inconsistencies between con-

ceptual models and their physical implementations.

On the other hand, if we over-detail data annota-

tions, we essentially create a custom conﬁguration

language. While this increases precision, it also in-

troduces complexity—users may ﬁnd that once a lan-

guage becomes too structured, it is more efﬁcient to

directly use well-known technical languages such as

SQL, Python, or speciﬁc ETL scripting languages.

This creates a trade-off between abstraction and us-

ability. In the age of Artiﬁcial Intelligence (AI), more

abstract annotations can be leveraged as prompts for

AI agents. These agents can validate the annotations

against execution requirements, disambiguate unclear

transformations or mappings, and assist users in gen-

erating physical ETL models from high-level speciﬁ-

cations. AI can also work in reverse: once a physical

model (e.g., a Microsoft Integration Services - SSIS

package) is generated, a trained AI agent can map it

back to conceptual primitives, ensuring that the doc-

umentation remains synchronized with the evolving

project. This bi-directional mapping helps maintain

consistency and improves collaboration between busi-

Business Process Modeling Techniques for Data Integration Conceptual Modeling

167

Figure 6: Example JSON structure for representing data an-

notations.

ness users and technical teams. This two-way map-

ping (from conceptual to executable and vice versa)

helps in model-driven development, where business

logic and technical implementation stay in sync.

5 CONCLUSIONS

Data pipelines are often the most complex and time-

consuming aspect of building data systems. Many

tools are available, each offering speciﬁc features to

facilitate data processing, but the complexity of the

processes still presents challenges. Activities like ex-

traction, transformation, and loading (ETL) can be

represented in various ways, often implemented in

multiple languages, leading to intricate designs.

Designing an efﬁcient, error-free data pipeline is

a resource-intensive task, and despite extensive re-

search, there is no consensus on best practices for

modeling them. Different methods, including UML,

BPMN, and proprietary models, have been proposed,

but no standardized structure for data pipeline work-

ﬂows exists. A structure that reduces the program-

ming workload in design, optimization, and main-

tenance is necessary, offering recommendations for

aligning with business requirements and enhancing

performance.

BPMN provides a method for translating busi-

ness requirements into a conceptual model, indepen-

dent of speciﬁc tools. However, no single standard

has emerged for conceptualizing data pipelines, and

existing approaches require signiﬁcant developer in-

put and business user validation. BPMN simpli-

ﬁes data pipeline representation with a well-known

language, focusing on core process aspects without

technical details. However, it can create inconsis-

tencies, as the same process can be modeled dif-

ferently. A structured approach reduces redundancy

and enhances clarity, validated in the context of data

pipelines.

The proposed approach offers general rules for

solving problems, regardless of context or implemen-

tation tool. It bridges the conceptual model and phys-

ical implementation, providing detailed documenta-

tion for both business and advanced users. The ap-

proach draws from pioneering work in BPMN ap-

plied to data pipelines, including three levels of rep-

resentation: Descriptive, Analytical, and Executable

Modeling. BPMN enables multi-perspective design

but may introduce ambiguity in execution, which this

work aims to address through a methodology offering

guidelines and best practices. Future research could

involve modeling real-world pipeline scenarios to val-

idate and reﬁne the approach.

The approach presented can also help to bridge the

gap between conceptual models and physical plans,

enhancing the value of conceptual modeling in under-

standing and implementing data pipelines. It provides

a solid foundation for physical implementation, which

can be enriched with ”physical” details while keeping

process logic at higher levels. While this paper ex-

plores how structured metadata can be leveraged to

generate ETL implementations (e.g., using BIML for

SSIS), it does not cover AI-driven translation due to

scope and space limitations. However, future work

could focus on:

• Developing AI models capable of translating

physical ETL implementations back into concep-

tual primitives.

• Creating AI-assisted ETL design tools that inter-

actively guide users in reﬁning and validating an-

notations.

• Exploring model synchronization techniques to

ensure that business logic and technical imple-

mentations remain aligned throughout the project

lifecycle.

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

168

This integration of AI-driven ETL modeling could

signiﬁcantly improve ETL design, making it more

ﬂexible, automated, and easier to maintain over time.

ACKNOWLEDGEMENTS

This work has been supported by national funds

through FCT - Fundac¸

ao para a Ci

encia e Tec-

nologia through projects UIDB/04728/2020 and

UIDP/04728/2020.

REFERENCES

Aagesen, G. and Krogstie, J. (2015). BPMN 2.0 for Mod-

eling Business Processes, pages 219–250. Springer

Berlin Heidelberg.

Akkaoui, Z. E. and Zimanyi, E. (2009). Deﬁning etl worfk-

lows using bpmn and bpel. In Proceeding of the ACM

twelfth international workshop on Data warehousing

and OLAP DOLAP 09, pages 41–48. ACM.

Biswas, N., Chattapadhyay, S., Mahapatra, G., Chatterjee,

S., and Mondal, K. C. (2019). A new approach for

conceptual extraction-transformation-loading process

modeling. International Journal of Ambient Comput-

ing and Intelligence, 10:30–45.

Biswas, N., Chattopadhyay, S., and Mahapatra, G. (2017).

Sysml based conceptual etl process modeling. In

International Conference on Computational Intel-

ligence, Communications, and Business Analytics,

pages 242–255. Springer Singapore.

Dayal, U., Castellanos, M., Simitsis, A., and Wilkinson,

K. (2009). Data integration ﬂows for business in-

telligence. In Proceedings of the 12th International

Conference on Extending Database Technology: Ad-

vances in Database Technology, pages 1–11. ACM.

Dupor, S. and Jovanovi, V. (2014). An approach to concep-

tual modelling of etl processes. In 37th International

Convention on Information and Communication Tech-

nology, Electronics and Microelectronics (MIPRO).

IEEE.

Inmon, B. (2016). Data Lake Architecture: Designing the

Data Lake and Avoiding the Garbage Dump. Technics

Publications; 1st edition.

Janssen, M., van der Voort, H., and Wahyudi, A. (2017).

Factors inﬂuencing big data decision-making quality.

Journal of Business Research, 70:338–345.

Kimball, R. and Caserta, J. (2004). The Data Warehouse

ETL Toolkit: Practical Techniques for Extracting,

Cleaning, Conforming, and Delivering Data. John

Wiley & Sons, Inc.

Lenzerini, M. (2002). Data integration. In Proceedings

of the twenty-ﬁrst ACM SIGMOD-SIGACT-SIGART

symposium on Principles of database systems, pages

233–246. ACM.

Munappy, A. R., Bosch, J., and Olsson, H. H. (2020). Data

Pipeline Management in Practice: Challenges and

Opportunities, pages 168–184. Springer-Verlag.

Nwokeji, J. C. and Matovu, R. (2021). A Systematic Liter-

ature Review on Big Data Extraction, Transformation

and Loading (ETL), pages 308–324. Springer, Cham.

Oliveira, B., Oliveira, O., and Belo, O. (2021). Using bpmn

for etl conceptual modelling: A case study. In Van-

DerAalst, W., editor, Proceedings of the 10th Interna-

tional Conference on Data Science, Technology and

Applications (DATA), pages 267–274. SCITEPRESS.

Citations/Indexing: crossref, dblp, scopus: 0, wos: 0.

Oliveira, B., Santos, V., Gomes, C., Marques, R., and

Belo, O. (2015). Conceptual-physical bridging - from

bpmn models to physical implementations on ket-

tle. In Daniel, F. and Zugal, S., editors, CEUR

Workshop Proceedings, volume 1418, pages 55–59.

CEUR-WS.org.

Prakash, G. H. and Rangdale, S. (2017). Etl data con-

version: Extraction, transformation and loading data

conversion. International Journal of Engineering and

Computer Science, 6:22545–22550.

Raj, A., Bosch, J., Olsson, H. H., and Wang, T. J. (2020).

Modelling data pipelines. In 46th Euromicro Confer-

ence on Software Engineering and Advanced Applica-

tions, SEAA, pages 13–20. IEEE.

Silver, B. (2011). Bpmn Method and Style: A Levels-Based

Methodology for Bpm Process Modeling and Improve-

ment Using Bpmn 2.0. Cody-Cassidy Press, second

edition edition.

Soffer, P., Kaner, M., and Wand, Y. (2012). Towards Un-

derstanding the Process of Process Modeling: Theo-

retical and Empirical Considerations, pages 357–369.

Springer, Berlin, Heidelberg.

Souibgui, M., Atigui, F., Zammali, S., Cherﬁ, S., and Yahia,

S. B. (2019). Data quality in etl process: A prelim-

inary study. Procedia Computer Science, 159:676–

687.

Trujillo and Luj

an-Mora, S. (2003). A uml based approach

for modeling etl processes in data warehouses. Con-

ceptual Modeling - ER 2003 - Lecture Notes in Com-

puter Science, 2813:307–320.

Vassiliadis, P., Simitsis, A., Georgantas, P., Terrovitis, M.,

and Skiadopoulos, S. (2005). A generic and customiz-

able framework for the design of etl scenarios. Infor-

mation Systems, 30:492–525.

Yaqoob, I., Hashem, I. A. T., Gani, A., Mokhtar, S., Ahmed,

E., Anuar, N. B., and Vasilakos, A. V. (2016). Big

data: From beginning to future. International Journal

of Information Management, 36:1231–1247.

Business Process Modeling Techniques for Data Integration Conceptual Modeling

169