A FRAMEWORK FOR ON-DEMAND INTEGRATION OF

ENTERPRISE DATA SOURCES

Matti Heikkurinen

, Tapio Niemi

, Marko Niinim

aki

, and Vesa Sivunen

CERN IT Division, Openlab, CH-1211 Geneva, Switzerland

Department of Computer Sciences, FIN-33014 University of Tampere, Finland

Helsinki Institute of Physics, CERN Ofﬁces, CH-1211 Geneva, Switzerland

Keywords:

Grid, OLAP, Data warehousing

Abstract:

Deploying a data warehouse system in a company is usually an expensive and risky investement. Constructing

a data warehouse is a large project that can take very long time and a company cannot know in advance exactly

what beneﬁts a data warehouse will offer. Thus, in many cases, data warehousing projects have either been

abandoned or been shown to be at least partial failures.

We propose a new method by providing a platform to implement business intelligence systems on. The basic

idea is to construct the analysis database (i.e. an OLAP cube) on demand and only include the data that is

needed for the analysis at hand from the operational databases. In this way the data is always up-to-date,

suitable for the current analysis, and some of the biggest risks associated with data warehouse systems can be

avoided. In addition, external data can be included in the analysis. The computational costs related to the cube

construction are likely to remain at acceptable level, since only the relevant part of the data for the current

analysis is needed from operational databases.

We outline the use of Grid techologies in the implementation to offer a cost-effective way to harness enough

computing power used on parallel processing and sufﬁcient security infrastructure (GSI). To deal with het-

erogenous data sources the XML language with XSL transformations is applied.

1 INTRODUCTION

Data warehousing and OLAP are offered as solutions

for analysing business data, e.g. (Mohania et al.,

1999; Chaudhuri and Dayal, 1997). The motivation

is to collect relevant data from different operational

databases into a one large database or a collection of

smaller databases (often called “data marts”). The

collection is usually done periodically, for example

once a week. The data warehouse software is used to

answer queries about the aggregate data of the busi-

ness transactions, not to process transactions them-

selves. Therefore, data warehouse systems are called

OLAP (On-Line Analytical Processing) in contrast to

OLTP (On-Line Transaction Processing). A common

method is to use a multidimensional data model to

represent the data warehouse data - in other words the

data are represented as a multidimensional hyper cube

where the coordinates can be based on variables such

as time, department, product sold and so on.

Data warehouse systems are usually large and ex-

pensive investments, and tend to be inherently mis-

sion critical due to their close ties with the opera-

tional data management and top level decision mak-

ing. Constructing a data warehouse is typically a large

project that can easily take even a year, tying up re-

sources from multiple levels of management (tactical

and strategic) in addition to the IT personnel. As al-

ways in the case of large IT projects, it is difﬁcult to

predict in advance what beneﬁts the system will of-

fer to justify the huge cost of the deployment project.

Some studies suggest that a third of all the informa-

tion systems projects are abandoned before comple-

tion (Ewusi-Mensah, 1997) and over half of the com-

pleted ones are almost 200% over budget further. It is

thus not surprising that in many cases data warehouse

projects have not really offered what the company has

believed. This illustrates the large risks involved in

choosing this kind of approach to data management

integration.

To mitigate the risks outline above we suggest a

new method for business intelligence systems that

would enable reaping most of the beneﬁts of a tra-

ditional data warehouse without the strategic risks in-

volved. The approach is based on applying the re-

cent advances in the Grid technology development in

604

Heikkurinen M., Niemi T., Niinimäki M. and Sivunen V. (2004).

A FRAMEWORK FOR ON-DEMAND INTEGRATION OF ENTERPRISE DATA SOURCES.

In Proceedings of the Sixth International Conference on Enterprise Information Systems, pages 604-608

DOI: 10.5220/0002655806040608

 SciTePress

a way that would allow efﬁcient and modular inte-

gration of various legacy systems into a ”virtual data

warehouse”. This modular integration would also al-

low replacing these legacy systems independently of

each other. This would make it possible to optimise

the components in their local environment without

limiting their relevance as sources of information for

strategic decision making.

Some recent developments in the design and pro-

totyping tools are addressing these problems by mak-

ing it possible to centralize the information about the

corporate data sources and prototype the system that

would result from the use of the planned ETL (Ex-

traction, Transformation, Loading) process. The pro-

posed approach will push these ideas even further

by making it possible to iterate these prototypes in

quick succession with the real transactional data, and

then put the ﬁnal, “accepted” version into production.

This is done by leveraging Grid Data access proto-

cols, which make possible hiding the details of the

local storage systems from the ETL process.

Another advantage of the use of the Grid technol-

ogy comes from its ability to create “Virtual Organi-

zations” from groups of entities that belong to differ-

ent security domains. By mapping the local identities

(user accounts, server certiﬁcates and other entities

managed by the corporate security infrastructure) to

Grid certiﬁcates, it is possible to integrate data from

sources that are not controlled by a single entity. Data

access and security features make the proposed ap-

proach especially interesting in cases like managing

the information systems in recently merged compa-

nies - or establishing an extended enterprise type col-

laboration with strategic partners.

The basic idea behind the virtual data warehouse

is to construct the analysis database (i.e. an OLAP

cube) on demand, and only include the data that are

needed for the analysis at hand. Since the data can be

gathered directly from the operational data sources,

it is always up-to-date and suitable for near-realtime

analysis of emerging trends in the company and its

environment. The most difﬁcult problem on-demand

construction of OLAP cubes is the extraction of data

from operational databases. We offer the following

three approaches to deal with the problem.

1. In most cases, only a small subset of data stored in

OLTP systems is fetched to the OLAP cube. This is

possible since we aim at building the OLAP cube to

solve some speciﬁc problem. In the traditional data

warehouse approach the OLAP cube is constructed

for general purpose analyses.

2. Grid technologies facilitate dynamic allocation of

computing power from larger resource pool for

ETL processing. While it is likely that OLTP sys-

tems themselves are more efﬁcient now than cou-

ple of years ago, the development does not in itself

help ETL phase, since the OLTP systems are of-

ten legacy systems that will not be updated very of-

ten. Furthermore, the amount of data and the users

of the OLTP system will usually increase in phase

with the system capacity.

3. The selection of the relevant data for the analysis

can be done remotely in the operational databases

before shipping the data for analysis. Agent tech-

nology can partially perform the local data selec-

tion and aggregation to decrease the need of the

processing power of the local database servers.

Business intelligence systems, like OLAP, have

traditionally been limited to the data stored in the

data warehouses or some other well-deﬁned, struc-

tured databases. There can be cases where this pre-

deﬁned set of data sources is not enough, since the

phenomenon under analysis can depend on something

outside the scope of the company. For example, the

oil price or the weather can have a remarkable effect

on business through some complex cause and effect

chain. If the rules that are used to test scenarios are

limited to the data that is in the corporate data ware-

house, the analysis cannot ﬁnd all the possible expla-

nations for a phenomenon. Virtual data warehouse

methodology enables the user to include external data

to the OLAP cube through Grid Data access.

The motivation to use Grid technologies in the im-

plementation is related to the capacity of the Grid

frameworks to provide enough secure computing and

storage capacity on demand to handle much larger

datasets than traditional systems with similar costs.

This kind of use of parallel processing and shared

computing resources requires a strong, universally ac-

cepted security infrastructure that is used to access the

computing resources of external service provider (this

type of Grid can be seen as a advanced ”Utility Com-

puting” solution). In addition to Grid technologies,

we use XML language with XSL transformations for

data source integration (The World Wide Web Con-

sortium, 1999).

2 RELATED WORK

Zurek and Sinnwell (Zurek and Sinnwell, 1999) have

studied how changes in companies should reﬂect to

data warehouses. They have noticed that companies

tend to change their organization often, necessitating

the realignment of data warehouse schema quite fre-

quently. Especially the required dimensions of data

and their hierarchies can change even more frequently

than the organization itself.

Sypherlink is one of the companies advertising

more ﬂexible prototyping and ETL tools for data

warehousing systems (Sypherlink, 2003). This solu-

A FRAMEWORK FOR ON-DEMAND INTEGRATION OF ENTERPRISE DATA SOURCES

605

tion addresses some of the issues associated with fre-

quent organizational changes.

Integrating data sources for OLAP using XML are

studied e.g. Pedersen et al. (Pedersen et al., 2002).

Using Grid computing in OLAP cube construction are

studied, for example, by Niemi et al. (Niemi et al.,

2003b; Niemi et al., 2003a).

3 BACKGROUND

3.1 OLAP

The OLAP cube is a multidimensional database in

which the dimension attributes (i.e. co-ordinates) de-

termine the value accessed. For example, the dimen-

sions can be time, location, product, and the measures

are sales and proﬁt. The dimensions usually have a hi-

erarchical structure, which enable analysing the data

on different levels of details - for example at day- or

month-level. In the latter case, the monthly data is ag-

gregated from the daily data. The user analyses data

by choosing dimensions to determine the viewpoint

to data and “drilling down” or “rolling up” in the di-

mension hierarchy.

The contents of OLAP databases are typically col-

lected from other data repositories, such as opera-

tional databases. For a well-deﬁned and targeted sys-

tem where the information needs are well known, it

may be straightforward to collect the right data at the

right time. However, with more and more data gener-

ally available also the information needs do develop.

Consequently, it gets more difﬁcult to anticipate the

needs of the OLAP users. This leads to a situation in

which it is increasingly difﬁcult to know in advance

what data are required - and when - for the desired

analysis tasks.

In a corporate environment the need for up-to-date

data means that the OLAP cube needs to be con-

structed almost in real time (Akinde et al., 2003). Ge-

ographical distribution adds challenges related to the

management of the user rights, fault tolerance of the

computation over the wide area networks and limited

amount of available bandwidth.

3.2 Grid

Our design applies Grid technologies in security and

parallel processing. Foster and Kesselman describe

the Grid as “a software infrastructure that enables

ﬂexible, secure, coordinated resource sharing among

dynamic collections of individuals, institutions and

resources” (Foster et al., 2001).

Essential Grid components for our work are the

Grid Security Infrastructure and distributed comput-

ing based on a Java agent technology. The main ben-

eﬁt for selecting these methods for the data integra-

tion is their large user base, indicating a large com-

munity of developers and tool providers. Further-

more, the standardization has progressed to a level

where most of the components have several imple-

mentations, both open source and commercially sup-

ported. This offers a ﬂexible way to balance the sav-

ings in the license fees in prototyping with the ben-

eﬁts of commercial support structure for production

system.

Security, user authentication and user authoriza-

tion An essential components of our framework is

the Grid Security Infrastructure (GSI), which allows

secure connections to potentially all computers in the

Grid. GSI is based on public key encryption, X.509

certiﬁcates, proxy certiﬁcates with a limited validity

(e.g. 12 hours), and the Secure Sockets Layer (SSL)

communication protocol (The Globus project, 2002).

Our design uses HTTP/HTTPS protocol to access re-

mote databases, since it is normally allowed to pass

through ﬁrewalls. The user authentication is based on

a certiﬁcate, thus separate user IDs or passwords are

not needed.

Many aspects of the data access designed here

is based on technologies of the European Data-

Grid, EDG (European DataGrid, 2002). The client’s

request for EDG data security manager contains

the user’s proxy certiﬁcate, which is used to au-

thenticate the user. The authentication is based

on either the subject of the certiﬁcate or certiﬁ-

cate extension ﬁelds described in the next para-

graph. As an example of the ﬁrst case, let us as-

sume that the user has a certiﬁcate with subject

“/O=Grid/O=NorduGrid/OU=hip.ﬁ/CN=John Doe”.

The authentication component can assign database

update rights to person with this subject. Moreover, it

can assign database read rights to anyone whose cer-

tiﬁcate subject contains “O=NorduGrid/OU=hip.ﬁ”.

EDG data security scheme can be applied to data ac-

cesses independently of the type of the data source.

VOMS In the Grid, data accesses often cross organ-

isational boundaries, making it impractical to bind the

access to the user’s certiﬁcate subject alone. For this

reason, virtual organization (VO) extensions to proxy

certiﬁcate ﬁles have been introduced.

The Virtual Organization Membership System

(VOMS) presents an extension to a user’s X509

proxy certiﬁcate that includes their VO membership

information. When a VOMS-proxy is generated,

a VOMS server is contacted to request a VOMS-

extended proxy certiﬁcate that follows the standard

X509v3 (Housley et al., 1999) certiﬁcate format. All

the standard ﬁelds of the proxy certiﬁcate are used

to store the user’s authentication information. The

ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

606

X509v3 extension (1.3.6.1.4.1.8005.100.100.1) part

is used to include authorization information in the

user’s proxy certiﬁcate. The authorization informa-

tion is stored in triplets with the following syntax:

GROUP: string, ROLE: string, CAP: string,

where GROUP represents the user’s group in the

VO (like “OLAP users”), ROLE their role (like

“Administrators”), and CAP optional capability

speciﬁcations (like “10 GB disk space”).

Distributed Computing Using Agent Technology

Another key component for our system is a mobile

Java agent framework (GridBlocks Agent (GB Agent)

technology (Karppinen et al., 2003)) that allows dis-

tributed ETL processing without shipping all the data

over the network. This Java agent technology offers

also a way to hide the structure of the stored data from

the OLAP system, since all the data sources (different

types of databases and also other sources of data) will

offer the GB agent interface for the OLAP cube con-

struction. The necessary data transformations for this

are encoded in XML, using the XSLT language.

3.3 XML and XML Tools

XML (eXtensible Markup Language) is offered as a

solution while operating with data in a heterogeneous

environment. XML aims to be “the universal format

for structured documents and data on the Web” (The

World Wide Web Consortium, 2002a). As such, XML

is a meta language – a language for describing other

languages – which lets one design customised markup

languages for different types of documents, using a

declaration syntax deﬁned in recommendation.

In addition to the private data of the company, lots

of usefull data for analysis purposes is available on the

public Internet. Yet, utilising this data is difﬁcult since

the data formats tend to differ from the ones used in

the analysis system. XML can be used to solve this

problem since the e.g. HTML language of WWW

can be seen as an XML sub language.

The key is-

sue for our system is that an XML sub language can

be transformed to another by using XSLT transforma-

tions. XSLT is a programming language, but many

tools novel tools for writing XSLT programs are be-

ing published, see e.g. (Altova, 2003). These enable

the user to generate the XSLT code via graphical user

interface operating on more abstract level.

Several XML query languages have been made for

retrieving the XML based data, for example XQL,

XML-QL, and Quilt. At the moment XQuery de-

veloped by the W3C XML Query Working Group

- is becoming de facto standard query language for

XML databases. The mission of the W3C group is

Strictly speaking, only XHTML is a “proper” XML sub

language (The World Wide Web Consortium, 2002b).

to (W3C, 2003): “provide ﬂexible query facilities to

extract data from real and virtual documents on the

Web... Ultimately, collections of XML ﬁles will be

accessed like databases.” XQuery has a SQL-like syn-

tax grafted on to XPath, which is a language for ad-

dressing parts of an XML document. The expectation

is that XQuery will be for XML databases what SQL

is for relational databases, i.e. a standard and ven-

dor independent way to query and retrieve data (W3C,

2003).

4 FRAMEWORK

ARCHITECTURE

Our work focuses on the ETL process, in which we

combine the extraction and transformation processes

and include aggregation computing in them. Figure 1

illustrates the system architecture.

Figure 1: System architecture

The data sources can be legacy OLTP systems

with a simple system accepting local processing code

added. This interface can be protected using the Grid

security framework. The collection server hides the

complexity of the underlying system, making it pos-

sible to use OLAP tools with no direct Grid support.

Data Collection and Integration We assume that

for conﬁdential data, Grid security access mechanism

like the one described in Section 3.2 has been ap-

plied. Once the data has been extracted, it must be

transformed in order to provide a uniform base for our

analysis.Writing XSLT instructions is quite complex

but, in our case, the situation is easier since the tar-

get XML language is always the same, the one used

by our ETL system. Therefore, it is quite straightfor-

ward to implement a tool to help the user to tell how

the source XML (or even HTML) is transformed to

the target XML. The aim is that the user works with

a graphical interface mainly just clicking the ﬁelds of

A FRAMEWORK FOR ON-DEMAND INTEGRATION OF ENTERPRISE DATA SOURCES

607

the XML document to indicate which ﬁeld of the doc-

ument corresponds to the right ﬁeld of the target docu-

ment. The method also enables to deﬁne calculations

in the transformation process.

Data Selection and Aggregation In the data selec-

tion we use XML query languages. The data selection

is done in the network node where the data is stored.

Data aggregation could also be done by using XSLT

and XQuery but the current implementations are inef-

ﬁcient for large amounts of data. Thus, we use Java

agent technology since Java is platform independent -

a crucial feature in a heterogeneous environment.

Data Loading Most of the commercial OLAP

servers support uploading of XML data. Failing this,

writing a (e.g. XSLT) program to transfer the data

into the form understood by the server and/or using

the API of the OLAP server is necessary.

5 CONCLUSIONS

We have presented a design that applied Grid com-

ponents for data access and security to combine data

from different sources. In our method, the data for

the analysis at hand is collected ’on-line’ from oper-

ational databases. In this way, we can avoid some

of the issues of monolithic data warehouse projects

by deploying the system step by step, utilizing exist-

ing legacy systems in the new framework. Grid tech-

nologies are applied to distribute the computation for

the OLAP cube construction to achieve more perfor-

mance. Finally, the data is uploaded into an OLAP

server to enable the use of common OLAP tools.

From the technical standpoint, our design is a promis-

ing alternative to existing data warehousing methods

and thus a candidate for larger-scale tests, involving

industrial usage scenarios and real operational data.

ACKNOWLEDGMENTS

We wish to thank Richard Randell from Imagina-

tion Lab, Lausanne. The work was supported by

Academy of Finland, project number 203554.

REFERENCES

Akinde, M., Bhlen, M., Johnson, T., and et al. (2003). Efﬁ-

cient OLAP query processing in distributed data ware-

houses. Information Systems, 28.

Altova (2003). Altova corp.

Available on: http://www.altova.com.

Chaudhuri, S. and Dayal, U. (1997). An overview of data

warehousing and OLAP technology. SIGMOD Rec.,

26(1):65–74.

European DataGrid (2002). European datagrid.

Available on: http://www.eu-datagrid.org.

Ewusi-Mensah, K. (1997). Critical issues in abandoned in-

formation systems development projects. Communi-

cations of the ACM, 9(40):74–80.

Foster, I., Kesselman, C., and Tuecke, S. (2001). The

anatomy of the Grid: Enabling scalable virtual organi-

zations. International Journal of Supercomputer Ap-

plications, 15(3).

Housley, R., Ford, W., Polk, W., and Solo, D. (1999).

RFC 2459, internet X.509 public key infrastruc-

ture certiﬁcate and CRL proﬁle. Available on

http://www.ietf.org/rfc/rfc2459.txt.

Karppinen, J., Niemi, T., and Niinim

aki, M. (2003). Mo-

bile analyzer - new concept for next generation of dis-

tributed computing. In CCGrid 2003. A poster.

Mohania, M., Samtani, S., Roddick, J., and Kambayashi,

Y. (1999). Advances and research directions in data

warehousing technology. Australian Journal of Infor-

mation Systems.

Niemi, T., Niinimaki, M., Nummenmaa, J., and Thanisch,

P. (2003a). Applying grid technologies to XML based

OLAP cube constraction.

Niemi, T., Niinim

aki, M., and Sivunen, V. (2003b). Inte-

grating distributed heterogeneous databases and dis-

tributed grid computing. In ICEIS 2003, volume 1,

pages 96–103. ICEIS Press.

Pedersen, D., Riis, K., and Pedersen, T. B. (2002). Query

optimization for OLAP-XML federations. In DOLAP

2002, pages 57–64. ACM Press.

Sypherlink, I. (2003). Data warehousing.

Available on: http://www.sypherlink.com/

solutions/dataware.asp.

The Globus project (2002). Commodity Grid Kits. Avail-

able on: http://www-unix.globus.org/cog/.

The World Wide Web Consortium (1999). XSL

transformations (XSLT). Available on:

http://www.w3.org/TR/xslt.

The World Wide Web Consortium (2002a). Exten-

sible markup language (XML). Available on:

http://www.w3.org/XML/.

The World Wide Web Consortium (2002b). XHTML, the

extensible hypertext markup language. Available on:

http://www.w3.org/TR/xhtml1.

W3C (2003). XML query (XQuery).

Available on: http://www.w3.org/XML/Query.

Zurek, T. and Sinnwell, M. (1999). Data warehousing has

more colours than just black & white. In VLDB 1999.

Morgan Kaufmann.

ICEIS 2004 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

608