AN ARCHITECTURE FRAMEWORK

FOR COMPLEX DATA WAREHOUSES

ome Darmont, Omar Boussa

ıd, Jean-Christian Ralaivao and Kamel Aouiche

ERIC, University of Lyon 2

5 avenue Pierre Mend

es-France

69676 Bron Cedex

France

Keywords:

Data warehousing, Complex data, XML warehousing, Metadata, Domain-speciﬁc knowledge.

Abstract:

Nowadays, many decision support applications need to exploit data that are not only numerical or symbolic,

but also multimedia, multistructure, multisource, multimodal, and/or multiversion. We term such data complex

data. Managing and analyzing complex data involves a lot of different issues regarding their structure, storage

and processing, and metadata are a key element in all these processes. Such problems have been addressed

by classical data warehousing (i.e., applied to “simple” data). However, data warehousing approaches need

to be adapted for complex data. In this paper, we ﬁrst propose a precise, though open, deﬁnition of complex

data. Then we present a general architecture framework for warehousing complex data. This architecture

heavily relies on metadata and domain-related knowledge, and rests on the XML language, which helps storing

data, metadata and domain-speciﬁc knowledge altogether, and facilitates communication between the various

warehousing processes.

1 INTRODUCTION

Data warehousing and OLAP (On-Line Analytical

Processing) technologies (Inmon, 2002; Kimball and

Ross, 2002) are now considered mature. They are

aimed, for instance, at analyzing the behavior of a

customer, product, or company, and may help mon-

itoring one or several activities (commercial or med-

ical pursuits, patent deposits, etc.). More precisely,

they help analyzing these activities under the form of

numerical data. However, in real life, many decision

support ﬁelds (customer relationship management,

marketing, competition monitoring, medicine...) need

to exploit data that are not only numerical or sym-

bolic. We term such data complex data. Their avail-

ability is now very common, especially since the

broad development of the World Wide Web. For ex-

ample, a medical ﬁle is usually constituted of several

pieces of data under various forms. A patient’s med-

ical history might be recorded as plain text. Various

biological exam results might be indicated in many

ways. The ﬁle could also include radiographies (im-

ages) or echographies (video sequences). Successive

diagnoses and therapies might be recorded as text or

audio documents, etc. Another example could be

a collection of web documents concerning a given

topic, which would be available under various formats

(videos, images, sounds, texts, etc.).

Complex data might be structured or not, and are

often located in different and heterogeneous data

sources. Browsing these data necessitates an adapted

approach to help collect, integrate, structure and even-

tually analyze them. A data warehousing solution is

interesting in this context, though adaptations are ob-

viously necessary to take into account data complex-

ity. Measures might not necessarily be numerical, for

instance. Data volumetry and dating are also other ar-

guments in favor of the warehousing approach. Fur-

thermore, complex data produce different kinds of in-

formation that are represented as metadata. These

metadata, along with domain-speciﬁc knowledge, are

essential when processing complex data and play an

important role when integrating, managing, or ana-

lyzing them. Hence, metadata need to be given even

more importance than in classical data warehousing.

The notion of complex data is not straightforward.

To clarify this concept, we propose in this paper one

deﬁnition that presents the different aspects we iden-

tify in complex data (Section 2). This deﬁnition is,

to the best of our knowledge, somewhat complete and

pertinent, but its scope could certainly be widened.

We also propose a general architecture framework

370

Darmont J., Boussaïd O., Ralaivao J. and Aouiche K. (2005).

AN ARCHITECTURE FRAMEWORK FOR COMPLEX DATA WAREHOUSES.

In Proceedings of the Seventh International Conference on Enterprise Information Systems, pages 370-373

DOI: 10.5220/0002538503700373

 SciTePress

for warehousing complex data (Section 3). This

model heavily relies on metadata and domain-speciﬁc

knowledge. It also rests on the XML language that

we use for different purposes: to store complex data,

if necessary; to store metadata and knowledge about

these complex data; and to facilitate communication

between the different warehousing processes — ETL

(Extract, Transform, Load) and integration, adminis-

tration and monitoring, and analysis and usage. We

ﬁnally conclude this paper and provide research per-

spectives (Section 4).

2 A DEFINITION OF COMPLEX

DATA

Many researchers in several communities start to

claim they work on complex data. However, this

emerging concept of complex data varies a lot, even

within a single research community such as the data-

base community. Hence, in a ﬁrst step, we performed

an extensive litterature study to identify all the differ-

ent sorts of data researchers dealt with. We particu-

larly, but not exclusively, focused on publications and

events that explicitely mentionned the terms “com-

plex data”, which particularly emerge in the data min-

ing ﬁeld (Ganc¸arski and Trousse, 2004). After com-

piling all this information, we were able to propose

a ﬁrst deﬁnition and concluded that data could be

qualiﬁed as complex if they were: (1) multiformat,

i.e., represented in various formats (databases, texts,

images, sounds, videos...); and/or (2) multistructure,

i.e., diversely structured (relational databases, XML

document repositories...); and/or (3) multisource, i.e.,

originating from several different sources (distributed

databases, the Web...); and/or (4) multimodal, i.e., de-

scribed through several channels or points of view (ra-

diographies and audio diagnosis of a physician, data

expressed in different scales or languages...); and/or

(5) multiversion, i.e., changing in terms of deﬁnition

or value (temporal databases, periodical surveys...).

However, it appeared in subsequent meetings with

fellow researchers that this ﬁrst deﬁnition was not

sufﬁcient to cover the wide variety of complex data.

It could indeed be viewed as an axis of complex-

ity, among other axes dealing with data semantics or

processing, for instance (Figure 1). Data volumetry

could also be such an axis. Though data volume is

not an expression of intrinsic complexity if viewed in

terms of database tuples, it becomes a complex prob-

lem to deal with in statistics or data mining when it is

the number of attributes that increases. In conclusion,

we deﬁne in this section a framework that helps iden-

tifying what we term complex data. However, since

this deﬁnition cannot be exhaustive, we leave it open

to new axes of complexity.

Complex

data

Semantics

Content

Interpretation

Context

Syntax

Form

Heterogeneity

Structure

Processing

...

Figure 1: Axes of data complexity

3 COMPLEX DATA WAREHOUSE

ARCHITECTURE

FRAMEWORK

In opposition to classical solutions, complex data

warehouse architectures may be numerous and very

different from one another. However, two approaches

seem to emerge.

The ﬁrst, main family of architectures is data-

driven and based on a classical, centralized data ware-

house where data are the main focus. XML doc-

ument warehouses (Xyleme, 2001; Baril and Bel-

lahs

ene, 2003; H

ummer et al., 2003; Nassis et al.,

2004) are a examples of such solutions. They often

exploit XML views, which are XML documents gen-

erated from whole XML documents and/or parts of

XML documents. A data cube is then a set of XML

views.

The second family of architectures includes so-

lutions based on virtual warehousing, which are

process-driven and where metadata play a preemi-

nent role. These solutions are based on mediator-

wrapper approaches and exploit distributed data

sources. These sources’ schemas are one of the main

information mediators exploit to answer user queries.

Data are collected and multidimensionnally modeled

(as data cubes, to constitute OLAP analysis con-

texts) on the ﬂy to answer a given decision support

need (Ammoura et al., 2001).

Whatever the type of architecture, the various

processes used in data warehousing always deal with

metadata and domain-speciﬁc knowledge, in order to

achieve a better exploitation and good performances.

Note that complex data are generally represented by

descriptors that may either be low-level information

(an image’s size, an audio ﬁle’s duration, the speed

of a video sequence...) or relate to semantics (rela-

tionships between objects in a picture, topic of an au-

dio recording, identiﬁcation of a character in a video

AN ARCHITECTURE FRAMEWORK FOR COMPLEX DATA WAREHOUSES

371

sequence...). Processing the data thus turns out to

process their descriptors. Original data are stored,

for instance as binary large objects (BLOBs), and can

also be exploited to extract information that could en-

rich their own caracteristics (descriptors and meta-

data).

The architecture framework we propose for com-

plex data warehousing (Figure 2) exploits the XML

language. Using XML indeed facilitates the inte-

gration of heterogeneous data from various sources

into the warehouse; the exploitation of metadata

and knowledge (namely regarding the application do-

main) within the warehouse; and data modeling and

storage. The presence of metadata and knowledge

in the data warehouse is aimed at improving global

performance, even if their actual integration is still

the subject of several research projects (McBrien and

Poulovassilis, 2001; Baril and Bellahs

ene, 2003; Shah

and Chirkova, 2003).

This architecture framework is essentially made of:

the data warehouse kernel, which may be either ma-

terialized as an XML warehouse, or virtual (where

cubes are computed at run time); data sources; source

type drivers that notably include mapping speciﬁca-

tions between the sources and XML; and a metadata

and knowledge base layer that includes three submod-

ules related to three management processes.

The three processes for managing a data warehouse

are: the ETL and integration process that feeds the

warehouse with source data from operational data-

bases (DS Op) by using drivers that are speciﬁc to

each source type (ST ); the administration and mon-

itoring process (MD&KR) that manages metadata

and knowledge (the administrator interacts with the

data warehouse through this process); and the analy-

sis and usage process that runs user queries, produces

reports, builds data cubes, supports OLAP, etc. Each

of these processes exploits and updates the metadata

and the knowledge base. There are four types of

ﬂows: the external ﬂow, which includes the ETL and

integration ﬂow and the exploitation (analysis and us-

age) ﬂow (the warehouse may thus be considered as

a black box); the internal ﬂow, between the ware-

house kernel and the metadata and knowledge base

layer and between the metadata and knowledge base

layer and the source type drivers; the metadata and

knowledge management and maintenance ﬂow, which

acquires new knowledge and enriches existing knowl-

edge; and the reference ﬂow, which illustrates the fact

that the external ﬂow always refers to the metadata

and knowledge base layer for integration, ETL, and

analysis and usage in general.

Note that analysis results under the form of cubes,

reports, queries, or any other intermediary results may

constitute new data sources (DS Res) that may be

reintegrated into the warehouse.

Though our proposal is only an architecture frame-

work, it helps us formalizing the warehousing process

of complex data as a whole. Thus, we are able to iden-

tify the issues to be solved. We can also point out the

great importance of metadata in managing and ana-

lyzing complex data. Furthermore, piloting and syn-

chronizing the data warehouse processes we identify

in this framework is a whole problematic in itself. Op-

timization techniques will be necessary to achieve an

efﬁcient management of data and metadata. Commu-

nication techniques, presumably based on known pro-

tocols, will also be needed to build up efﬁcient data

exchange solutions.

4 CONCLUSION AND

PERSPECTIVES

We addressed in this paper the problem of warehous-

ing complex data. We ﬁrst clariﬁed the concept of

complex data by providing a precise, though open,

deﬁnition of complex data. Then we presented a

general architecture framework for warehousing com-

plex data. It heavily relies on metadata and domain-

speciﬁc knowledge, which we identify as a key ele-

ment in complex warehousing, and rests on the XML

language, which helps storing data, metadata and

knowledge, and facilitates communication between

the various warehousing processes. This proposal

takes into account the two main possible families of

architectures for complex data warehousing (namely

virtual data warehousing and centralized, XML ware-

housing). Finally, we rapidly presented the main is-

sues in complex data warehousing, especially regard-

ing data integration, the modeling of complex data

cubes, and performance.

This study opens many research perspectives. Up

to now, our work mainly focused on the integration

of complex data in an ODS. Though we also worked

on the muldimensional modeling of complex data,

this was our ﬁrst signiﬁcant advance into the actual

warehousing of complex data. In order to test and

reﬁne our hypotheses in the ﬁeld, we plan to apply

our proposals on three different application domains

we currently work on (medicine, banking and geogra-

phy). Such practical applications should help us de-

vise solutions about the many issues regarding meta-

data management and performance, and experiment

both the virtual and XML warehousing solutions.

One of our important perspectives deals with the

selection of a representation mode for metadata and

domain-speciﬁc knowledge. Knowledge related to

the application domain is actually an operational in-

formation about complex data. It may be considered

as metadata. In order to remain in the XML-based,

homogeneous environment of our architecture frame-

work, the formalisms that seem best-ﬁtted to repre-

ICEIS 2005 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

372

Figure 2: Complex data warehouse architecture framework

sent metadata are XML and RDF (Resource Descrip-

tion Framework) schemas. These tools are appro-

priate to represent both low-level and semantic de-

scriptors. Furthermore, they are adapted to reasoning

for metadata exploitation. The Common Warehouse

Metamodel (CWM), an OMG standard for data ware-

houses (OMG, 2003), could also help us managing

metadata and knowledge. But can the CWM meta-

models integrate the performance factors of a com-

plex data warehouse? Should these metamodels be

extended or would it be more interesting to propose

new submodels instead? These are largely open ques-

tions.

REFERENCES

Ammoura, A., Zaiane, O. R., and Goebel, R. (2001). To-

wards a Novel OLAP Interface for Distributed Data

Warehouses. In 3rd International Conference on Data

Warehousing and Knowledge Discovery (DaWaK 01),

Munich, Germany, volume 2114 of LNCS, pages 174–

185.

Baril, X. and Bellahs

ene, Z. (2003). Designing and Manag-

ing an XML Warehouse, pages 455–473. XML Data

Management. Addison Wesley.

Ganc¸arski, P. and Trousse, B., editors (2004). Complex

data mining in a KDD process (EGC 04 workshop),

Clermont-Ferrand, France.

ummer, W., Bauer, A., and Harde, G. (2003). XCube:

XML for data warehouses. In 6th ACM International

Workshop on Data Warehousing and OLAP (DOLAP

03), New Orleans, USA, pages 33–40.

Inmon, W. H. (2002). Building the Data Warehouse. John

Wiley & Sons, third edition.

Kimball, R. and Ross, M. (2002). The Data Warehouse

Toolkit: The Complete Guide to Dimensional Model-

ing. John Wiley & Sons, second edition.

McBrien, P. and Poulovassilis, A. (2001). A Semantic

Approach to Integrating XML and Structured Data

Sources. In 13th International Conference on Ad-

vanced Information Systems Engineering (CAiSE 01),

Interlaken, Switzerland, volume 2068 of LNCS, pages

330–345.

Nassis, V., Rajugan, R., Dillon, T. S., and Rahayu, W. J.

(2004). Conceptual Design of XML Document Ware-

houses. In 6th International Conference on Data

Warehousing and Knowledge Discovery (DaWaK 04),

Zaragoza, Spain, pages 1–14.

OMG (2003). Common Warehouse Metamodel (CWM)

Speciﬁcation version 1.1. Object Management Group.

Shah, A. and Chirkova, R. (2003). Improving Query Perfor-

mance Using Materialized XML Views: A Learning-

Based Approach. In 1st International Workshop on

XML Schema and Data Management (XSDM 03 - ER

03 Workshops), Chicago, USA, volume 2814 of LNCS,

pages 297–310.

Xyleme, L. (2001). A Dynamic Warehouse for XML

Data of the Web. IEEE Data Engineering Bulletin,

24(2):40–47.

AN ARCHITECTURE FRAMEWORK FOR COMPLEX DATA WAREHOUSES

373