Managing Evolution of Heterogeneous Data Sources of a Data Warehouse

Darja Solodovnikova

, Laila Niedrite

and Lauma Svilpe

Faculty of Computing, University of Latvia, Raina blvd. 19, Riga, Latvia

Keywords:

Data Warehouse, Evolution, Change Propagation, Metadata, Big Data.

Abstract:

The evolution of heterogeneous integrated data sources has become a topical issue as data nowadays is very

diverse and dynamic. For this reason, novel methods are necessary to collect, store and analyze data from

various data sources as efﬁciently as possible, while also handling changes in data structures that have occurred

as a result of evolution. In this paper, we propose a solution to problems caused by evolution of integrated data

sources and information requirements for the data analysis system. Our solution incorporates an architecture

that allows to perform OLAP operations and other kinds of analysis on integrated big data and a mechanism for

detecting and automatically or semi-automatically propagating changes occured in heterogeneous data sources

to a data warehouse.

1 INTRODUCTION

Over the last few years, due to the increase in vol-

ume of data stored in data warehouses and emergence

of various new data sources, such as unstructured and

semi-structured data types, new challanges related to

data storage, maintenance and usage arise. Therefore,

novel efﬁcient algorithms for the integration and anal-

ysis of heterogeneous data are necessary for the devel-

opment of data warehouses. Besides, despite the fact

that in most cases data sources are already integrated

into the data warehouse system, there is a need for

automatized solutions that are able to deal with data

variability and evolution problems.

Traditional methods applied in relational

databases may not be laveraged for new types

of data, therefore, tools, technologies, and frame-

works have been introduced to support large-scale

data analytics, such as Apache Hadoop ﬁle system,

Apache HBase database management solution, Hive

data warehouse solution and others. These tools are

mainly aimed at managing the data growth, leaving

the problems caused by evolution of data and their

structure unresolved. Hence, handling various types

of changes in data structure still requires a large

amount of manual work on the part of the developer

as existing solutions do not support automatic or

semi-automatic propagation of changes in data

sources to a data warehouse.

https://orcid.org/0000-0002-5585-2118

https://orcid.org/0000-0002-8173-6081

To address the evolution problem, we propose a

solution to handle changes caused by the evolution of

heterogeneous integrated data sources. Our solution

includes a data warehouse architecture that on one

hand supports various types of data analysis of data

integrated in a data warehouse and on the other hand

is able to discover changes in structured and semi-

structured data sources and automatically or semi-

automatically propagate them in the system to main-

tain continuous system operation.

The rest of this paper is organized as follows. In

Section 2 the recent studies related to the topic are

discussed. In Section 3 we outline the proposed data

warehouse architecture. Section 4 is dedicated to the

description of the case study system and the running

example we use throughout the paper to illustrate our

solution. We overview the metadata employed in our

solution to handle evolution in Section 5. The main

contribution of this paper is presented in Section 6,

where we discuss the adaptation scenarios for changes

in data sources and present the mechanism for change

handling. Finally, we conclude with directions for fu-

ture work in Section 7.

2 RELATED WORK

The problem of data warehouse evolution has been

studied extensively in relational database environ-

ments. There are in general two approaches to solv-

ing evolution problems. One approach is to adapt just

Solodovnikova, D., Niedrite, L. and Svilpe, L.

Managing Evolution of Heterogeneous Data Sources of a Data Warehouse.

DOI: 10.5220/0010496601050117

In Proceedings of the 23rd International Conference on Enterprise Information Systems (ICEIS 2021) - Volume 1, pages 105-117

ISBN: 978-989-758-509-8

105

the existing data warehouse schema (Bentayeb et al.,

2008) or ETL processes (Wojciechowski, 2018) with-

out keeping the history of changes and another ap-

proach (Ahmed et al., 2014), (Golfarelli et al., 2006),

(Malinowski and Zim

anyi, 2008) is to maintain mul-

tiple versions of schema that are valid during some

period of time.

The topicality of evolution problems in big data

environments is discussed in the several recent re-

view papers. The authors in (Kaisler et al., 2013)

mention dynamic design challenges for big data ap-

plications, which include data expansion that occurs

when data becomes more detailed. A review paper

(Cuzzocrea et al., 2013) indicates research directions

in the ﬁeld of data warehousing and OLAP. Among

others, the authors mention the problem of designing

OLAP cubes according to user requirements. Another

recent vision paper (Holubov

a et al., 2019) discusses

the variety of big data stored in the multi-model poly-

store architectures and suggests that efﬁcient manage-

ment of schema evolution and propagation of schema

changes to affected parts of the system is a complex

task and one of the topical issues.

We have also found several studies that deal with

evolution problems in big data context. A solution to

handling data source evolution in the integration

ﬁeld was presented in the paper (Nadal et al.,

2019). The authors propose the big data integration

ontology for the de

nition of integrated schema, source schemata,

their versions and local-as-view mappings between

them. When a change at a data source occurs, the

ontology is supplemented with a new release that re-

ﬂects the change. Our approach differs in that the pro-

posed architecture is OLAP-oriented and is capable of

handling not only changes in data sources, but also re-

quirements.

Another study that considers evolution is pre-

sented in the paper (Chen, 2010). The author pro-

poses a data warehouse solution for big data analysis

that is implemented using MapReduce paradigm. The

system supports two kinds of changes: slowly chang-

ing dimensions are managed with methods proposed

in (Kimball and Ross, 2019) and fact table changes

are handled by schema versions in metadata. Unlike

our proposal, the system does not process changes in

big data sources.

An architecture that exploits big data technolo-

gies for large-scale OLAP analytics is presented in the

paper (Sumbaly et al., 2013). The architecture sup-

ports source data evolution by means of maintaining

a schema registry and enforcing the schema to remain

the same or compatible with the desired structure.

There is also the latest study presented in (Wang

et al., 2020) dedicated to evolution problems in het-

erogeneous integrated data sources. The authors pro-

pose to use deep learning to automatically deal with

schema changes in such sources.

For our solution, we adapted the metadata model

proposed in (Quix et al., 2016) to describe data

sources of a data lake. The authors distinguish three

types of metadata: structure metadata that describe

schemata of data sources, metadata properties and se-

mantic metadata that contain annotations of source el-

ements. In our approach, we extended the model with

metadata necessary for evolution support.

3 DATA WAREHOUSE

ARCHITECTURE

To solve problems caused by the evolution, we pro-

pose a data warehouse architecture for the analysis of

big data, which supports OLAP operations and other

types of analysis on integrated data sources, as well

as includes algorithms for detecting and handling var-

ious changes in structured, semi-structured, and un-

structured data sources of a data warehouse and in-

formation requirements. The detailed description of

the architecture is given in the paper (Solodovnikova

and Niedrite, 2018). The architecture consists of var-

ious components that provide data ﬂow and process-

ing from the source level to the data stored in the data

warehouse. The interaction of these components is

shown in the Figure 1.

The basic components of the data warehouse ar-

chitecture are data sources, data highway, metastore

and adaptation component. At the source level, data

is obtained from various heterogeneous sources (in-

cluding big data sources) and loaded into the ﬁrst

level of the data highway (data processing pipeline

which may be considered as a data lake) in its orig-

inal format for further processing. Because big data

includes different types of data, the system supports

structured sources (database tables), semi-structured

sources (such as XML, JSON or CSV format), and

unstructured data (log ﬁles, photos, videos).

The data highway consists of several levels. The

idea and the concept of the data highway was ﬁrst

presented in (Kimball and Ross, 2019). At the ﬁrst

level, raw data is stored. The data for each subse-

quent data highway level is obtained from the pre-

vious level by performing transformations, aggrega-

tions and integrating separate data sets. Usually, data

at the later levels is updated less frequently. The num-

ber of levels, their contents and the frequency of their

updating are determined by the requirements of the

particular system. The ﬁnal level of the highway is

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

106

Figure 1: Data warehouse architecture.

a data warehouse which stores structured aggregated

multidimensional data.

Business analysts can access already processed

data from pre-calculated OLAP cubes introduced to

improve query performance, but data analysts can an-

alyze data at various levels of the data highway and

are involved in the data retrieval and transformation

process. Developers access metadata at the metastore

via the metadata management tool.

The unique feature of the architecture is the adap-

tation component that is aimed at handling changes in

data sources or other levels of the data highway. The

metadata management tool is intended for users, but

the adaptation component is integrated with that tool.

4 CASE STUDY

As a proof of concept, we have applied the solution

presented in this paper to the data warehouse that

integrates data on research publications authored by

the faculty of our university. In this section, we will

brieﬂy describe the case study system and present the

running example of the change.

4.1 Publication Data Warehouse

Data about publications are integrated from four data

sources. Structured data about faculty (authors of ar-

ticles) are acquired from the university information

system. Semi-structured data are gathered in XML

and JSON formats from the library information sys-

tem and indexing databases Scopus and Web of Sci-

ence. The system architecture consists of three lev-

els of the data highway, the last of them being a data

warehouse storing data from all four sources fully in-

tegrated within the ﬁrst and second levels of the high-

way.

During the operation of the publication data ware-

house, several changes in data sources and data high-

way levels were detected by the adaptation compo-

nent or introduced manually via the metadata man-

agement tool. Such real-world changes include an ad-

dition of new data items and removal of existing data

items in data sources, addition of a new data source

and change in a value of a data set property. We

have also emulated other types of changes described

in Section 6 in order to practically verify our solution.

4.2 Running Example

In order to demonstrate our approach in the next sec-

tions of this paper, we will use a running example of

a change that occured in the publication data ware-

house. A new XML element citeScoreYearInfoList

was added to the XML document Scopus metrics ob-

tained from SCOPUS. It was composed of several

subelements that were also absent in the previously

gathered documents. Before the change, data about

other Scopus metrics were used in the data ware-

house to evaluate publications. The new metric de-

ﬁned by SCOPUS should have been considered too in

this evaluation.

5 METADATA

The operation of the data warehouse architecture is

mainly based on the data in the metadata repository.

Using a metadata management tool and deﬁning dif-

ferent metadata, the developer determines how the

system will work. The metadata repository stores six

types of interconnected metadata: Schematic meta-

data describe schemata of data sets stored at different

levels of the highway. Mapping metadata deﬁne the

logic of ETL processes. Information about changes in

Managing Evolution of Heterogeneous Data Sources of a Data Warehouse

107

data sources and data highway levels is accumulated

in the evolution metadata. Cube metadata describe

schemata of precomputed cubes. Adaptation meta-

data accumulate proposed changes in the data ware-

house schema. Finally, adaptation rules store addi-

tional information provided by the developer required

for change propagation.

5.1 Schematic, Mapping and Evolution

Metadata

To describe schemata of data sources and data high-

way levels necessary for the analysis along with

changes in structure and other properties of involved

data sets we developed the metadata shown in Fig-

ure 2. In this section, we brieﬂy describe the physical

implementation of the schematic, mapping and evolu-

tion metadata. The detailed description these types of

metadata is given in the paper (Solodovnikova et al.,

2019).

5.1.1 Schematic Metadata and Mappings

The table DataSet represents a collection of

DataItems that are individual pieces of data. Data sets

are assigned FormatTypes that are stored in the table

Type and grouped into parent types. Parent types and

corresponding format types currently supported in the

system are deﬁned in the Table 1. Data items are also

assigned types, such as table columns, XML elements

or attributes, objects or arrays in JSON, keywords or

tags that describe unstructured data sets, and others.

Table 1: Format types of data sets.

Parent Type Format Types

Structured data set Table

Semi-structured

data set

XML, JSON, CSV, RDF,

HTML, Key-Value

Unstructured data

set

Text, image, other multi-

media

A data set can be obtained from a DataSource or it can

be part of a DataHighwayLevel. In addition to data set

formats, information on the Frequency and Velocity of

a data set retrieval, being batch, near real-time, real-

time and stream is also stored. If a data set is a part

of a data warehouse, a Role in the multidimensional

model is assigned to such data set (dimension or fact)

and corresponding data items (attribute or measure).

If there is a link between different data items in

the same or across different data sets, the tables Rela-

tionship and RelationshipElement are used to connect

the respective child and parent objects. The types of

relationships supported include composition, foreign

key and predicate.

Data sets in the system are either extracted from

data sources or obtained from other levels of the data

highway, thus the information about provenance of

data sets within the data highway must be maintained

to make it possible to follow their lineage and imple-

ment change handling process. For this purpose, map-

pings were introduced in the metadata. A record in

the table Mapping indicates a transformation (saved

in the column Operation) that is used to obtain a data

item at the next data highway level from data items at

the previous levels associated via the table Mappin-

gOrigin.

5.1.2 Metadata Properties

Even though several common properties of data sets

and items, such as types, velocity, frequency, are ex-

plicitly reﬂected in our metadata model, there still

might be other characteristics of data that are neces-

sary to be stored. Examples of such properties include

ﬁle names, sizes, character sets, check constraints,

data types, mechanism used to retrieve data from a

data source (for instance, API request), etc. To this

end, we included the table MetadataProperty. Any

name:value pair relevant to schema elements present

in the model can be added to the metadata. Consider-

ing that various elements may possess different prop-

erties, such approach allows for some ﬂexibility. Fur-

thermore, it allows to store user-deﬁned properties or

so-called conversational metadata associated with the

Author who recorded the property.

5.1.3 Evolution Metadata

Thus far, we have discussed the schematic and map-

ping metadata, however, the infomation on evolution

of data sets must also be maintained in the metastore.

Hence, we identiﬁed a set of change types that are

discussed in more detail in the section 6.1, as well as

we included the table Change intended for recording

changes that have been automatically discovered or

introduced manually. Every record of this table stores

a date and time of the change, its Type, Status (new,

propagated or in process of propagation) and is asso-

ciated with a schema element affected by the change.

The columns AttrName, OldAttrValue and NewAttr-

Value are ﬁlled with a name, previous and updated

values of a property if a value has been changed. If

the change was produced manually, we associate it

with the corresponding Author.

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

108

Figure 2: Schematic and evolution metadata model.

5.1.4 Running Example

The schematic metadata of the publication data ware-

house were gathered automatically and supplemented

by additional information manually. The informa-

tion about Scopus metrics and the addition of a new

element as indicated in Section 4.2 was represented

in metadata as shown in the table 2. The table

shows an example of a relationship between the par-

ent element citeScoreYearInfoList and the child ele-

ment citeScoreCurrentMetric in XML document. The

new element citeScoreYearInfoList included a more

complicated hierarchy of child elements that are not

diplayed in the table due to space limitations.

5.2 Adaptation Metadata

After changes have been recorded in the metadata by

the change discovery algorithm, the change handling

mechanism of the adaptation component determines

possible scenarios for each change propagation. The

information about possible scenarios for each change

type is stored in the adaptation metadata. Scenarios

Table 2: Metadata describing the addition of a new element.

Table Data References

DataSource SCOPUS

DataSet Scopus met-

rics

DataSource:

SCOPUS

DataItem citeScore-

YearInfoList

DataSet: Scopus

metrics

DataItem citeScore-

CurrentMetric

DataSet: Scopus

metrics

Relationship Type: Com-

position

Parent: citeScore-

YearInfoList

Relationship-

Element

Child: citeScore-

CurrentMetric

Change Type: Addi-

tion

DataItem:

citeScoreYear-

InfoList

that may be applied to handle changes are provided to

the data warehouse developer who chooses the most

appropriate ones that are to be implemented. Since

certain scenarios may require additional data from the

developer, such data are entered by the developer via

Managing Evolution of Heterogeneous Data Sources of a Data Warehouse

109

the metadata management tool and stored in the adap-

tation rules metadata after the respective scenario has

been chosen.

The Figure 3 demonstrates the adaptation meta-

data model utilized in the change handling mecha-

nism. The model incorporates three tables from the

evolution and schema metadata (Figure 2): Change,

Types and Author. The table Change stores informa-

tion about identiﬁed changes. It is possible to deter-

mine which type of change has occurred by process-

ing entries in this table. The overall change handling

process is based on the records in this table.

The table Author stores information about system

users. This table is used to identify which user per-

formed adaptation after a certain change in the sys-

tem. The table Type is used as a classiﬁer, which de-

termines the types of various elements used in the sys-

tem. To implement the evolution mechanism, several

types and subtypes have been created for storing dif-

ferent statuses and data types in the adaptation meta-

data.

5.2.1 Change Adaptation Scenarios and

Operations

The tables ChangeAdaptationOperation and

ChangeAdaptationScenario are intended for storing

information about change adaptation scenarios and

operations to be performed to implement each

scenario. These tables are ﬁlled with data manually

before any changes occur.

Change adaptation operations are steps that must

be taken to handle a change in the system. They

are deﬁned as short and universal as possible. Each

of the operations is stored as a record in the table

ChangeAdaptationOperation. An operation is as-

signed a type that indicates whether it can be per-

formed manually or automatically. In the former case,

the column Operation stores a textual description of

what the developer must do to perform the operation.

In the latter case, the column Operation stores the

name of the procedure to be executed.

A change adaptation scenario is a series of sequen-

tial operations that are performed to successfully han-

dle a change in the system. Multiple adaptation sce-

narios correspond to each change type. The steps of

each scenario are stored in the table ChangeAdapta-

tionScenario. Each step of a scenario contains a ref-

erence to the adaptation operation, a change type, as

well as a parent record identiﬁer from the same table.

Storing the parent record identiﬁer maintains the se-

quence of operations. First steps of each scenario do

not have parent records. Each next step contains a ref-

erence to the previous step. Such structure facilitates

adjustments. For example, adding an operation in the

middle of a scenario requires only two table records

to be edited.

Several scenarios may overlap in terms of their op-

erations, therefore operations are stored in the sepa-

rate table to avoid duplication of information in the

metadata.

In order to follow the change handling process and

store information about the execution of each opera-

tion, we introduced the table ChangeAdaptationPro-

cess which stores the adaptation scenario correspond-

ing to each actual change. This table is populated

automatically during the change handling process.

When a change has taken place, the adaptation com-

ponent inserts potential alternative adaptation scenar-

ios into the table ChangeAdaptationProcess for that

change. Each record in the table ChangeAdaptation-

Process stores the link to the operation of the adap-

tation scenario (deﬁned in the table ChangeAdapta-

tionScenario). We also store a reference to a record

of the table Change to identify which of the changes

is being handled. In order to keep track of the change

handling process, the status of the operation to be per-

formed (whether the operation has been performed or

not) is also stored, as well as the date, time and the

user who executed or launched the operation.

5.2.2 Branching Conditions of Change

Adaptation Scenarios

To store the branching conditions of adaptation sce-

narios and to manage the fulﬁllment of the conditions,

the tables ChangeAdaptationCondition and CaCon-

ditionMapping have been introduced. Although the

type of change can be determined at the time of its

occurrence, it does not guarantee the existence of an

unambiguous change adaptation scenario. There are

various conditions under which a scenario can branch

out. They can be evaluated automatically or manu-

ally. In the former case, the adaptation component

determines which of the scenario branches to follow.

In the latter case, the developer needs to evaluate the

speciﬁc situation and make a decision.

Just like operations, the conditions for handling

different types of changes overlap, so they are stored

in a separate table ChangeAdaptationCondition. Each

record of this table stores the type of the condition

(whether the condition is evaluated manually or auto-

matically) and the condition deﬁnition. If the condi-

tion is executable manually, a textual description that

must be evaluated by the developer is stored in the

column Condition. If the condition is automatically

evaluable, the name of the function to be executed is

stored there. The conditions are manually described

for each change type in advance and are entered in the

table before any changes occur.

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

110

Figure 3: Adaptation metadata model.

Because each operation that is executed within a

scenario can have multiple conditions, the table Ca-

ConditionMapping is used to provide the N:N rela-

tionship between the tables ChangeAdaptationCondi-

tion and ChangeAdaptationScenario.

Conditions that can be evaluated automatically

can be checked before each operation since they do

not require the intervention of the developer. Thus,

it is possible to automatically move along one of the

branches deﬁned in the scenario. However, when per-

forming a manual condition evaluation (the developer

decides on the future course of the scenario), it is nec-

essary to keep information about the decision the de-

veloper has made in order for this decision to be taken

into account throughout the whole change handling

process.

The table CaManualConditionFulﬁllment is used

to store decisions of the developer. Records in this ta-

ble are linked to the particular change and condition

as well as determine the condition fulﬁllment status

(whether the condition is fulﬁlled or not). The table

CaManualConditionFulﬁllment is ﬁlled in automati-

cally during the speciﬁc change handilng process.

5.2.3 Additional Information for the Change

Handling Process

During the change handling process, various addi-

tional data may be required to be provided by the de-

veloper. Such data might be needed to perform oper-

ations included in the scenario as well as to evaluate

conditions.

If any additional information from the developer

is required during the change handling, it is saved

in the table ChangeAdaptationAdditionalData, which

stores the reference to the table Change, the informa-

tion on the purpose for which the additional data is

used, as well as the data itself. The format of the data

depends on the type of data. For example, storing

only the identiﬁer will be a simple number. However,

there might be situations when data has another for-

mat (JSON, XML, CSV, etc.), therefore, a textual data

type with a large capacity is used for the column Data.

6 EVOLUTION SUPPORT

The main element of the proposed architecture that

is responsible for handling changes in data sources

and information requirements is the adaptation com-

ponent. Its main task is to detect changes that have

taken place and to generate change adaptation scenar-

ios for each change. To ensure the desired function-

ality, the adaptation component uses data in the meta-

data repository, as well as additional data that cannot

be automatically identiﬁed, in which case the devel-

oper provides this data manually using the metadata

management tool. When the developer has chosen the

speciﬁc scenarios to execute and provided all neces-

sary data or scenacio choice has been made automati-

cally based on the evaluation of automatic conditions,

the adaptation component manages the execution of

scenarios. Since data from data sources are loaded

into the data highway initially in their original for-

mat, the data acquisition may continue even during

Managing Evolution of Heterogeneous Data Sources of a Data Warehouse

111

the change handling process. In this section, we con-

centrate on changes supported in our proposed solu-

tion as well as their propagation mechanism.

6.1 Atomic Changes

Various kinds of changes to data sets employed in

each level of the data highway must be handled by

the adaptation component. The list of atomic changes

supported in our proposed solution classiﬁed accord-

ing to the part of the metadata model they affect fol-

lows:

• Schematic Changes: addition of a data source,

deletion of a data source, addition of a data high-

way level, deletion of a data highway level, addi-

tion of a data set, deletion of a data set, change of

data set format, renaming a data set, addition of

a data item, change of a data item type, renam-

ing a data item, deletion of a data item from a

data set, addition of a relationship, deletion of a

relationship, addition of a mapping, deletion of a

mapping;

• Changes in Metadata Properties: addition of a

metadata property, deletion of a metadata prop-

erty, update of an attribute value.

Most of these changes may be identiﬁed automati-

cally by the change discovery algorithm implemented

as a part of the adaptation component and described

in detail in the paper (Solodovnikova and Niedrite,

2020). Manually introduced changes are processed by

the metadata management tool, however, automatic

change discovery is triggered when any new data are

loaded from data sources into the data highway by

wrappers or during ETL processes.

Initially, the change detection algorithm gath-

ers schema metadata and properties of existing data

sources and data highway levels in temporary meta-

data. For metadata collection, special procedures are

used depending on the format of data sets. Struc-

tured, semi-structured and unstructured data sets are

handled by different procedures. To identify changes

in metadata, a change discovery algorithm ﬁrst pro-

cesses data sources and data highway levels, then data

sets and data items, and ﬁnally mappings and rela-

tionships. For each processed element, the algorithm

compares metadata describing the element available

in the metastore with collected temporary metadata

and identiﬁes differences that determine types of

atomic changes occured. The identiﬁed changes are

then saved in the table Change.

6.2 Change Adaptation Scenarios

After any changes have been detected, the adapta-

tion component of our proposed architecture must

ﬁrst generate potential adaptation scenarios for each

change and then execute scenarios according to

branching conditions. We have predeﬁned adapta-

tion scenarios for each change type and operations

necessary for each scenario. In total, 34 differ-

ent change adaptation scenarios were deﬁned for 19

atomic changes. The average proportion of automatic

operations and conditions within each change adapta-

tion scenario is almost 47%. In the following subsec-

tions, scenarios for each atomic change are described.

6.2.1 Addition of a Data Source

If a new data source is required for decision mak-

ing, examples of data sets from the new source must

be added manually so that the metadata collection

procedures can generate the necessary metadata for

the new source structure. The adaptation component

must then create the data structures according to the

data sets of the new source at the ﬁrst level of the

data highway (see the change 6.2.5). The developer

must then deﬁne schemas of other new data highway

levels, ETL processes, and create the corresponding

metadata.

6.2.2 Deletion of a Data Source

If a data source from which data was previously

downloaded is no longer available, then three manual

adaptation scenarios are possible:

• Replacement of a Missing Data Source with Data

from Other Sources. The objective of this sce-

nario is to make it possible to continue data load-

ing to data items that depend on the deleted data

source. To implement the replacement of the

deleted data source, the developer must provide

information on alternative data sources. For each

data item from the missing source, an alternative

data item from another source must be speciﬁed

or a formula that calculates data items of the miss-

ing source from data items of other sources must

be provided. For this adaptation scenario, it may

be necessary to add metadata describing the data

structure and properties of the new data source

(see the change 6.2.1).

• Data Source Skipping. If replacement of a miss-

ing data source is not possible, data items that de-

pend on the missing data source can not be further

updated. The adaptation scenario in such a case

includes the modiﬁcation of ETL processes (not

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

112

ﬁlling dependent data items) and modiﬁcation of

the mapping metadata.

• Hybrid Scenario. If it is possible to replace part

of data items that were obtained from the missing

source, for such data items new ETL processes,

data structure of new sources (if needed) and other

properties must be deﬁned. Other data items that

can not be obtained any more are left blank.

6.2.3 Addition of a Data Highway Level

If the developer has added a new data highway level,

for this change the adaptation scenario consists of

describing the data structure of the new level in the

metadata, creating the data structure according to the

deﬁnition and deﬁning ETL processes in the mapping

metadata. If any new data source is required to popu-

late the new data highway level, the metadata describ-

ing the data structure of the source must be obtained

automatically. In this case, the developer must pro-

vide examples of data sets from the new data source

to detect the data structure.

6.2.4 Deletion of a Data Highway Level

If the developer has deleted a data highway level, it

must be examined whether data sets of other data

highway levels depend on the deleted level. If such

data sets exist, ETL processes must be modiﬁed to

replace the missing data highway level. This can be

done automatically if a data source that was used to

retrieve the deleted data is available. If the data source

is not available, this change should be treated as the

change 6.2.2.

6.2.5 Addition of a Data Set

If the developer has added a data set to a data highway

level, then this change is implemented manually. The

developer must deﬁne the structure (data items) of the

new data set, the data highway level to which the new

data set is added, the mapping metadata and metadata

properties. If an additional data source is required to

obtain data for the new data set, it must be added by

implementing the change 6.2.1.

If a new data set is added to a data source, the

data set metadata should be automatically collected

and the new data set must be added to the ﬁrst level of

the data highway.

6.2.6 Deletion of a Data Set

The following adaptation scenarios are possible for

this change type, depending on the origin of the

deleted data set.

If a data set is deleted from a data source, three

manual adaptation scenarios are possible:

• Replacement of a Deleted Data Set with Data

from Other Data Sets. This scenario is possible

only if the developer can provide information on

alternative data sets. For each data item in the

deleted data set, an alternative data item from an-

other data set must be speciﬁed or a formula that

calculates data items of the deleted data set from

data items of other data sets must be provided.

For this adaptation scenario, it may be necessary

to add metadata describing the data structure and

properties of the new data set (see the change

6.2.5).

• Data Set Skipping. If replacement of a deleted

data set is not possible, data items that depend on

the deleted data set are not further updated. The

adaptation scenario includes the modiﬁcation of

ETL processes (not ﬁlling dependent data items)

and mapping metadata adjustment.

• Hybrid Scenario. If it is possible to replace part

of a data items belonging to the deleted data set,

then for such data items new ETL processes, data

structure of new sources (if needed) and other

properties must be deﬁned.

If the developer has deleted a data set from a data

highway level, it must be determined whether other

data sets depend on the deleted set. If such datasets

exist, ETL processes must be modiﬁed to replace the

deleted data set. This can be done automatically if the

data source that was used to retrieve the deleted data

set is still available. If the data source is not available,

this change must be handled as one of the scenarios

described in this subsection for the case when a data

set is deleted from a data source.

6.2.7 Change of Data Set Format

If only a data set format has changed and its struc-

ture (data items) has remained the same, such change

can be propagated automatically if it is possible to re-

deﬁne related ETL processes to use a different data

set format. This implies that mapping metadata must

be modiﬁed to use a different format for data load.

Such a solution may also require additional informa-

tion from the developer on the modiﬁcation of ETL

processes.

If there have been considerable changes in the for-

mat, for example from unstructured data to structured

data or vice versa, then they would probably result in

the change of the structure of the data set. Hence, such

change must be processed as deletion and addition of

data items (see the changes 6.2.9 and 6.2.12).

Managing Evolution of Heterogeneous Data Sources of a Data Warehouse

113

6.2.8 Renaming a Data Set

This change is handled automatically as it only affects

metadata and possibly ETL processes. The adapta-

tion scenario consists of renaming the data set in the

table DataSet and transforming ETL processes to re-

ﬂect the new data set name.

6.2.9 Addition of a Data Item

If a data item has been added to an existing data set

which is a part of a data highway level, the metadata

of the new data item must be created. If the developer

has made this change, he or she must specify a name,

type, and (if applicable) data warehouse role of the

new data item, the data set that contains the new data

item, mapping metadata, and metadata properties. If

an additional data source which was not previously

used in the system is required for data loading, it must

be added by implementing the change 6.2.1.

If a new data item has been added to a source data

set, such change must be processed automatically.

Metadata describing the new data item must be col-

lected by the adaptation component, and the new data

item must be added to the data set at the ﬁrst level of

the data highway that corresponds to the source data

set.

6.2.10 Renaming a Data Item

This change only affects metadata and possibly ETL

processes, thus it is handled automatically. Similarly

as in the case of data set renaming, the adaptation sce-

nario consists of renaming the data item in the table

DataItem and transforming ETL processes to reﬂect

the new data item name.

6.2.11 Change of a Data Item Type

The change of a data item type can be handled auto-

matically if it is possible to redeﬁne ETL processes

affected by the change to use another data item type.

In such as case, the mapping metadata are updated

to reﬂect the changed data type. To implement such

adaptation scenario, an additional information from

the developer on the modiﬁcation of ETL processes

might be required.

6.2.12 Deletion of a Data Item from a Data Set

Several adaptation scenarios are possible for this

change type, depending on the origin of the deleted

data item.

If the deleted data item belonged to a source data

set, two adaptation scenarios can be applied:

• Replacement of a Deleted Data Item with Data

from Other Sources or Data Sets. In order to

implement this adaptation scenario, the developer

must provide additional information on an alter-

native data item or a formula that calculates the

deleted data item from other data items. Since the

alternative data item of other data items used in

the formula might not be present in the system, it

may be necessary to add metadata about the struc-

ture and properties of the new data items (see the

change 6.2.9).

• Data Item Skipping. If the deleted data item can

not be replaced by others or calculated by any for-

mula, it is necessary to determine data items of the

data highway affected by the change and modify

ETL processes along with the mapping metadata

to skip these affected data item.

If the developer has deleted a data item from a data

set that is a part of a data highway level, any other

data items that have been obtained from the deleted

data item must be identiﬁed by analysing the map-

ping metadata. If such data items exist, the deleted

data item should be replaced in the mapping metadata

and ETL processes. Such replacement is performed

automatically if a data source from which the deleted

data item was extracted is still available. If the data

source is not available any more, the change must be

proocessed by one of the above described scenarios.

6.2.13 Addition of a Relationship

This change may be required for processing other

changes, such as addition of a new data set (see sub-

section 6.2.5). This change may also affect ETL pro-

cesses. If a new relationship between data items is

deﬁned in the data source, this information should be

recorded in the metadata. The developer must be in-

formed of the change so that he or she decides on the

further use of this information.

6.2.14 Deletion of a Relationship

If a relationship between data items has been deleted,

the adaptation scenarios of this change depend on the

type of the relationship.

• If the deleted relationship is a foreign key, it must

be determined which ETL processes use that for-

eign key to connect data. The developer must be

notiﬁed of the change. The information on the

mapping metadata where the foreign key is used is

included in the notiﬁcation. Further change han-

dling must be performed manually.

• There is a special case of a relationship with

the type Composition that is handled differently.

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

114

Such relationships may exist between data items

in XML or JSON documents. Deleting a rela-

tionship with the type Composition means that the

structure of XML or JSON elements has changed

and one of the data items has been moved else-

where in the document structure. Thus, such in-

formation must be determined and, in order to

handle the deletion of a composition relation-

ship, a new relationship with the type Composi-

tion needs to be established according to the new

document structure.

6.2.15 Addition of a Mapping

A new mapping may only be added manually by the

developer using the metadata management tool. To

handle this change type, a transformation function

that calculates a data item from other data items must

be speciﬁed. Data items must be previously created in

the data highway and metadata describing them must

already be available.

6.2.16 Deletion of a Mapping

If a mapping has been deleted, there are two possible

adaptation scenarios:

• Replacement of a Transformation Function. A

data item obtained using a transformation func-

tion included in the deleted mapping should be

replaced by another function, if it is possible. In

this case, the developer adds information about

the mapping replacement and the adaptation com-

ponent adds a new mapping by performing the

change 6.2.15. If necessary, new data items or

data sets are also added.

• Deletion. If it is not possible to replace the map-

ping by another mapping, the adaptation compo-

nent marks the mapping as deleted and removes

the deleted mapping from ETL processes.

6.2.17 Addition of a Metadata Property

If a new property has been added to an element in the

schematic metadata, the developer must be notiﬁed

about the change and must make a decision regarding

further usage of the new property. Hence, this change

is processed manually.

6.2.18 Deletion of a Metadata Property

If a change discovery algorithm detects a deletion

of a metadata property, it is necessary to determine

whether any ETL procedures use the missing prop-

erty. In this case, it must be assessed whether the ETL

processes can be automatically adapted or whether

additional information from the developer is required.

6.2.19 Update of an Attribute Value

If a value of an attribute has been updated and that

modiﬁcation has not been recorded as another change

type (for example, as a change of data set format

or data item type), it must be checked whether the

changed attribute has been used in ETL procedures.

In this case, all dependent ETL procedures must be

adapted to utilize the new value of the attribute.

6.3 Change Propagation

In order to successfully propagate any change to the

data highway, the change handling mechanism an-

alyzes schematic and mapping metadata as well as

adaptation scenarios and operations predeﬁned for

each change type. Based on the analysis results, the

mechanism populates tables ChangeAdaptationPro-

cess and CaManualConditionFulﬁllment included in

the adaptation metadata and tries to apply automatic

conditions and operations. There are two stages of the

change handling mechanism described in detail in the

following sections.

6.3.1 Initial Change Processing

The goal of this stage is to determine potential change

adaptation scenarios and create initial records in the

tables ChangeAdaptationProcess and CaManualCon-

ditionFulﬁllment.

The high-level pseudocode of the initial change

processing stage implemented as a procedure Creat-

eChangeAdaptProcess is presented as Algorithm 1.

First, the procedure selects data on changes with the

status New. Then for each new change, the change

type is determined by the function GetChangeType

by analyzing data in the table Change. After

that, all scenario steps predeﬁned for the change

type are selected in the correct order and the func-

tion InsertChangeAdaptationProcess inserts records

that correspond to each scenario step into the table

ChangeAdaptationProcess. Then, manual conditions

for the current scenario step are selected from the ta-

ble ChangeAdaptationCondition, linked with the cur-

rently processed change and saved in the table Ca-

ManualConditionFulﬁllment with the status Not ex-

ecuted. Finally, the status of the change is updated

to In progress so that handling of this change can be

considered as initiated.

Managing Evolution of Heterogeneous Data Sources of a Data Warehouse

115

Algorithm 1: Initial change processing.

Procedure CreateChangeAdaptProcess()

while exists Change C where C.status =

’New’ do

vProcessCreated ← false;

vChangeType ← GetChangeType(C);

if vChangeType is not null then

foreach scenario step S that exists

for vChangeType do

P ← InsertChangeAdaptation-

Process(S,C);

InsertManualCondition-

Fulﬁllment(P,S,C);

vProcessCreated ← true;

end

if vProcessCreated then

UpdateChangeInProgress(C);

end

6.3.2 Scenario Execution

After initial change processing, the second stage of

the change handling mechanism - scenario execu-

tion is run. Scenario execution is based on condi-

tion checks and execution of operations. Change han-

dling steps can be both automatic and manual, and

the developer can make decisions about the change

handling process. If the execution of the scenario re-

quires the intervention of the developer, the algorithm

is stopped and resumed only when the developer has

made his or her decision regarding manual conditions

or performed the speciﬁed operation.

The Algorithm 2 demonstrates the pseudocode

of the procedure RunChangeAdaptationScenario that

executes an adaptation scenario for the speciﬁc

change. First, the function GetChangeAdaptation-

ProcessSteps retrieves the adaptation process steps

deﬁned during the initial change processing and

stored in the table ChangeAdaptationProcess. Then

for each step that has not been previously executed

and has a status Not adapted, the adaptation process

is continued only if it is necessary to perform an au-

tomatic operation, as well as the corresponding con-

ditions are met. Manual conditions are checked us-

ing the table CaManualConditionFulﬁllment. For au-

tomatically executable conditions, a procedure name

is obtained from the column Condition of the table

ChangeAdaptationCondition. By running the cor-

responding procedure it is possible to evaluate the

condition fulﬁllment. Following the same princi-

Algorithm 2: Scenario execution.

Procedure

RunChangeAdaptationScenario(C: Change)

Steps ←

GetChangeAdaptationProcessSteps(C);

foreach process step S in Steps do

if S.StatusType = Not adapted then

O ← GetProcStepOperation(S);

if O.OperationType = Automatic

and ConditionsFulﬁlled(C, S)

then

ExecuteAdaptationProc(C,O);

SetProcessStepAdapted(S);

end

exit;

end

ple, the procedures for performing the steps of the

change adaptation process are also executed. Their

names are stored in the column Operation of the table

ChangeAdaptationOperation.

6.4 Running Example

The change described in Section 4.2 corresponds to

the atomic change type Addition of a Data Item to an

existing source data set. For the running example, the

change occured was discovered automatically by the

change discovery algorithm during the execution of

ETL process. The information about the change was

registered in metadata as shown in the table 2.

According to the scenario described in Section

6.2.9 for the case of a data source extension, this

change is propagated automatically. The metadata

describing the structure of a new element was al-

ready gathered by the change discovery algorithm,

so the change handling mechanism must just copy

the metadata describing structure of the new element

citeScoreYearInfoList from the source metadata to the

metadata describing the corresponding data set at the

ﬁrst level of the data highway and inform the devel-

oper about new data became available. Such change

does not affect the operation of the system, so the de-

veloper may anytime decide how to use new data.

7 CONCLUSIONS

In this paper, we presented our approach to dealing

with changes in heterogeneous data sources caused

by their evolution as well as development of infor-

ICEIS 2021 - 23rd International Conference on Enterprise Information Systems

116

mation requirements. We proposed a data warehouse

architecture that includes data acquisition from vari-

ous sources, as well as ETL processes for transform-

ing data into an integrated structure so that it can be

loaded into a data warehouse. The operation of the

system is based on metadata that describe schema

of all data sets involved in the system as well as all

changes identiﬁed by the change discovery algorithm.

The main contribution of this paper is the mech-

anism for processing of discovered changes and

changes performed manually. As a proof of concept,

the proposed solution has been successfully applied

to the publication data warehouse.

There are several beneﬁts of the proposed ap-

proach comparing to manual processing of changes in

data sources and information requirements. Changes

of certain types are discoverable automatically and

comprehensive information about changes occurred is

available to the developer. Management of evolution

is ensured with less human participation. Change pro-

cessing is transparent as all operations performed and

conditions veriﬁed are available to the developer. The

proposed approach is ﬂexible and may be extended

by deﬁning additional operations and conditions in

the corresponding metadata tables, then building new

change adaptation scenarios from them and assigning

these scenarios to change types.

Possible directions of future work include deﬁni-

tion of preferences regarding adaptation scenarios for

various change types that are expressed by the devel-

oper to be used to choose scenarios automatically.

ACKNOWLEDGEMENTS

This work has been supported by the European

Regional Development Fund (ERDF) project No.

1.1.1.2./VIAA/1/16/057.

REFERENCES

Ahmed, W., Zim

anyi, E., and Wrembel, R. (2014). A logi-

cal model for multiversion data warehouses. In Data

Warehousing and Knowledge Discovery, pages 23–34,

Cham. Springer International Publishing.

Bentayeb, F., Favre, C., and Boussaid, O. (2008). A

user-driven data warehouse evolution approach for

concurrent personalized analysis needs. Integrated

Computer-Aided Engineering, 15(1):21–36.

Chen, S. (2010). Cheetah: A high performance, custom data

warehouse on top of mapreduce. Proc. VLDB Endow.,

3(1–2):1459–1468.

Cuzzocrea, A., Bellatreche, L., and Song, I.-Y. (2013). Data

warehousing and olap over big data: Current chal-

lenges and future research directions. In Proceedings

of the 16th International Workshop on Data Ware-

housing and OLAP, DOLAP ’13, page 67–70, New

York, NY, USA. ACM.

Golfarelli, M., Lechtenb

orger, J., Rizzi, S., and Vossen, G.

(2006). Schema versioning in data warehouses: En-

abling cross-version querying via schema augmenta-

tion. Data & Knowledge Engineering, 59(2):435 –

459.

Holubov

a, I., Klettke, M., and St

orl, U. (2019). Evolu-

tion management of multi-model data. In Heteroge-

neous Data Management, Polystores, and Analytics

for Healthcare, pages 139–153, Cham. Springer In-

ternational Publishing.

Kaisler, S., Armour, F., Espinosa, J. A., and Money, W.

(2013). Big data: Issues and challenges moving for-

ward. In 2013 46th Hawaii International Conference

on System Sciences, pages 995–1004.

Kimball, R. and Ross, M. (2019). The data warehouse

toolkit: The deﬁnitive guide to dimensional modeling,

ed. wiley.

Malinowski, E. and Zim

anyi, E. (2008). A conceptual

model for temporal data warehouses and its transfor-

mation to the er and the object-relational models. Data

& Knowledge Engineering, 64(1):101 – 133.

Nadal, S., Romero, O., Abell

o, A., Vassiliadis, P., and Van-

summeren, S. (2019). An integration-oriented ontol-

ogy to govern evolution in big data ecosystems. Infor-

mation Systems, 79:3 – 19.

Quix, C., Hai, R., and Vatov, I. (2016). Metadata extraction

and management in data lakes with gemms. Complex

Systems Informatics and Modeling Quarterly, (9):67–

83.

Solodovnikova, D. and Niedrite, L. (2018). Towards a

data warehouse architecture for managing big data

evolution. In Proceedings of the 7th International

Conference on Data Science, Technology and Ap-

plications, DATA 2018, page 63–70, Setubal, PRT.

SCITEPRESS - Science and Technology Publications,

Lda.

Solodovnikova, D. and Niedrite, L. (2020). Change discov-

ery in heterogeneous data sources of a data warehouse.

In Databases and Information Systems, pages 23–37,

Cham. Springer International Publishing.

Solodovnikova, D., Niedrite, L., and Niedritis, A. (2019).

On metadata support for integrating evolving hetero-

geneous data sources. In New Trends in Databases

and Information Systems, pages 378–390, Cham.

Springer International Publishing.

Sumbaly, R., Kreps, J., and Shah, S. (2013). The big data

ecosystem at linkedin. In Proceedings of the 2013

ACM SIGMOD International Conference on Manage-

ment of Data, SIGMOD ’13, page 1125–1134, New

York, NY, USA. ACM.

Wang, Z., Zhou, L., Das, A., Dave, V., Jin, Z., and Zou, J.

(2020). Survive the schema changes: Integration of

unmanaged data using deep learning. arXiv preprint

arXiv:2010.07586.

Wojciechowski, A. (2018). Etl workﬂow reparation by

means of case-based reasoning. Information Systems

Frontiers, 20(1):21–43.

Managing Evolution of Heterogeneous Data Sources of a Data Warehouse

117