DATA ACCESS THROUGH A DYNAMIC DATA MODEL
A Concept for Accessing Heterogenic Data Structures in RDF Databases
Alexander Wendt, Benjamin Dönz and Dietmar Bruckner
Institute of Computer Technology, Vienna University of Technology, Gusshausstrasse 27-29, A-1040 Vienna, Austria
Keywords: Data model, Ontology, OWL, RDF, Graph, Triple-store, Query, Concept, SPARQL, Protégé.
Abstract: This paper introduces an alternative method for using ontologies to create a dynamic data model for RDF
databases or other schema-less databases. The main challenge is how to continuously adapt the data model
and its queries to new data, which may be imported with any given structure.
1 INTRODUCTION
When using schema-less NoSQL databases (Levitt,
2010), (Stonebraker, 2010), the possibility to import
data in its original structure is opened. Often, this
data is imported from flat tables without further
normalization. The advantage of such an import is
that any type of data can be imported and that the
effort of pre-processing data is reduced. Instead, this
work is transferred from the import phase to the data
access. The challenge in building a relational data
model is to create a model, which is robust and
general enough to be able to accommodate new data.
The challenge in building systems with schema-less
databases is to find a way to give the user access to
all the data in a heterogenic structure within a single
request to the system.
The dynamic data model introduced in this paper
provides a method for implementing ontologies as
an additional layer between the user interface and a
database with heterogenic structure. The challenge
and the motivation for developing such a model can
be summarized by five requirements:
For a schema-less database, where it is possible
to import data in its original structure.
It shall be possible to access all data by the user
with a simple query.
The data model shall be exchangeable and
expandable at any time.
It shall be possible for the user to adapt and
create new queries if new sources are added or
new demands emerge.
It shall be investigated how ontologies can be
used to classify and sort the data in a schema-
less database.
2 BACKGROUND
Evaluation of the concept will be done by
implementing the dynamic data model on a system,
where the mentioned requirements are present.
2.1 System Environment and Data
Sources
The evaluation system can be categorized as a
decision support system (Shim et al., 2002). Its goal
is to provide the user with an overview of the
situation of a geographical region in the hypothetical
case of a disaster. This will make it possible to
estimate consequences of a disaster as well as
providing insights for strategic planning.
The system is a web-based solution, where a
common data store is accessible by several users via
a web interface. The user interface provides
possibilities to execute queries and display the
results on a map (Google Maps, (n. d.)) and in
tabular form. New sources and data structures can be
added during operation. For this system, the RDF
(Resource Description Framework) database
Allegrograph (AllegroGraph, (n. d.)) was used. It
supports geospatial queries and several
programming interfaces, e.g. Java Jena, Java
Sesame, Python and Lisp.
Typical questions in case of an epidemic, which
should be answered by combining different data
sources in this system are: “In a certain region, how
many hospitals are there per inhabitant?” or “What
could be the potential economic loss for companies
in the grocery branch in a given region, when all
fowl in another region would be lost?”. Such
439
Wendt A., Dönz B. and Bruckner D..
DATA ACCESS THROUGH A DYNAMIC DATA MODEL - A Concept for Accessing Heterogenic Data Structures in RDF Databases.
DOI: 10.5220/0003873704390444
In Proceedings of the 4th International Conference on Agents and Artificial Intelligence (ICAART-2012), pages 439-444
ISBN: 978-989-8425-95-9
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)
questions are answered by combining several data
sources and by performing calculations within the
queries.
The available data sources can be categorized
into the following four domains: Political regions,
demography, company data and branch data for
categorizing companies. The original data sources
are collected in six tables. Additionally, political
maps are available as shape-files (ESRI, 1998). They
are imported from csv files, where each row defines
the subject URI, each column the predicate URI and
the cell value the object literal. The cell value can
also be converted to an own URI.
In this paper, the dynamic data model will be
explained with an example of the domain “Company
data”. Companies have provided financial data like
profit, asset and revenue. The example is based on
the querying of some attributes of that company
data. In the original input table, revenue is available
as pairs of “year1” and “revenue1”, “year2” and
“revenue2” and so on. “Asset” and “Profit” refer to
the latest year of revenue, which is “year1”.
2.2 Potential Usage of Ontologies
The source data structure uses several predicates and
subjects for the same thing, in this case pairs of
attributes like “year1” and “revenue1”. One of the
advantages of ontology-based models is the potential
use of reasoners for classifying data. In projects like
HarmonISA (HarmonISA, (n. d.)) the task is to
classify land types (grassland, forest, sea). The main
task for the reasoner is to classify new data from
different sources into a skeleton ontology, see
(Peedell et al., 2005) based on its attributes. This is
used to merge data sources and models and to query
the whole system, which contains several models
with a single query. In the query, everything that
fulfills certain criteria is queried independent of the
reference model.
Another application of reasoners is presented in
(Fallahi et al., 2008, p. 354). In this service oriented
architecture, the reasoner is used for matchmaking
of requests to services. Each available service is
modeled in an ontology. The requests, which are
also modeled in the ontology, are more specialized
than the services and are classified into classes by
the inference engine. From the potential services,
which could fulfill the request, the best match is
used for the task.
In our system and in the example with company
data, the reasoner could be used to classify
companies into e. g. small, middle and large sized
companies depending on certain defined criteria. A
company could be defined as anything that has some
values from the classes “address”, “employees” and
“revenue”. However, in order to do that, the
“address” of a company must not be a literal of the
class “company”, which happens at the import of flat
tables (unprocessed in its original form), but it has to
be assigned its own class “address”. A new layer of
hierarchy has to be inserted between the company
URI and the actual address values. The flat data
structure in the database would have to be
normalized like in relational databases; i.e. literals
would have to be transformed to URIs. Otherwise,
an ontology model would be highly populated with
only a few subjects and several predicates
connecting them to objects or in this case to literals.
As long as the data is not processed, this type of
classification does not make much sense here. In
order to still be able to combine the heterogenic data
sources in flat tables, the dynamic data model was
developed.
3 CONCEPT
This data model is based on two main concepts: The
creation of artificial classes and the creation of
database queries by combining elements of subject-
predicate relations. If the content of the model is
queried, a new query is generated with the concepts
of the model and used for the actual database. The
main function of the data model is to provide a
flexible way to automatically generate queries for
the database by considering the requests of the user.
This is done by defining a meta-ontology, which
consists of the following classes: Class,
SubjectClass, QueryConcept, SubQueryConcept,
Group, AtomQuery. The classes and instances in the
meta-model are completely separated from the
classes and instances in the actual database. The
only thing they share is the RDF-database as a
storage medium. In Figure 1, the database is shown
to the left with a class “Class1”, an instance
“Instance1”, two literals “Literal1” and “Literal2”,
which are connected with “Instance1” via the
predicates “Predicate1” and “Predicate2”.
Within the meta-ontology, on the right side of
Figure 1, instances of Class (Meta-Ontology classes
are written in italic) are created, which represent
subdomains in the database and are usually defined
from the predicates in the database, i.e. the
predicates in the actual databases are transferred into
instances of Class in the model. The instances of
Class are independent of the real classes in the
database (to the left in Figure 1). In Figure 1, the real
ICAART 2012 - International Conference on Agents and Artificial Intelligence
440
Figure 1: Creation of artificial classes for the data model.
class “Class1” is created as “Class1”, which is an
instance of Class. SubjectClass instances are
created, which are mapped to the real existing
classes in the database. “Class1” in the database is
represented here by the SubjectClass instance
“SubjectClassInstance”. This is done by adding the
URL and name of “Class1” to
“SubjectClassInstance”. SubjectClass instances are
then mapped to the instances of Class with the
predicate “belongsTo”. In that way, the data model
is independent of the existing classes. Further, the
predicates in the database e. g. “Predicate1” can also
be created as classes. This is useful when bundling
data. The instances of Class are directly accessible
on the user interface. By selecting an instance from
Class the possibility to execute queries assigned to
these instances, is given. The instances of Class can
be seen as categorizer for queries.
In the company data example, the following
instances are created in Class: “Organization”,
“Finance”, “Revenue”, “Profit” and “Asset”. In the
database, the class “Organization” can be found, but
not the class “Finance”. “Finance” is created as a
new hierarchical layer between the “Organization”
and its attributes “Revenue”, “Profit” and “Asset”.
“Revenue” is made an instance of Class, as it
contains pairs of “revenue1” and “year1” and so on.
Although the database is flat, the user interface gets
a hierarchical structure.
In order to query the data, QueryConcepts are
defined and assigned to the corresponding instances
of Class. A QueryConcept is a template of the
structure of a query for the database. It works like a
function with filter parameters as inputs and a
SPARQL-query string as output.
The idea of using QueryConcepts origins from
(Lutz, 2007), (Klien et al., 2004) and (Bügel et al.,
2007), where a concept with the same name is
applied for service discovery. Services in systems
and requirements stated by the user can be described
using ontologies. It is possible to compare service
descriptions with the requirements’ descriptions of a
certain operation. According to (Lutz, 2007, p. 14),
domain ontologies contain the primitives of a
domain and provide a shared vocabulary for services
in the system. Application ontologies, which are
derived from a domain ontology, contain necessary
constraints for a certain operation of the system.
They allow the creation of semantic queries (Lutz,
2007, p. 21), which contain information about
requirements and conditions from a certain user
request. Semantic advertisements are equivalent to
the semantic queries, but are created for each
provided service. The semantic queries and semantic
advertisements are matched against each other, in
order to select the best fitting service for a selected
operation. It is possible for a user to select a domain,
choose an operation and set constraints for that
operation via a QueryConcept. In the dynamic data
model, only the idea to define a domain and then to
select an operation is used.
A QueryConcept can be explained bottom-up,
starting with the smallest element, the AtomQuery.
An AtomQuery
is a representation equal to an
SPARQL statement of one subject and predicate (see
below). The subject is related to one instance of
SubjectClass. This is a subject-predicate relation.
The constraints for the object are set by filters. An
DATA ACCESS THROUGH A DYNAMIC DATA MODEL - A Concept for Accessing Heterogenic Data Structures in
RDF Databases
441
Figure 2: The structure of a QueryConcept.
example of the structure of a QueryConcept is
shown in Figure 2.
In the company data example, the predicate
“revenue1” and its domain instance “Company” is
represented as an instance of SubjectClass, forming
such an AtomQuery. The AtomQuery additionally
contains information about the data type of the
predicate for use in filters. As seen in the result
section, the order of the SPARQL-Statements plays
a major role for the performance of the system.
Therefore, it is also possible to define a certain order
for a statement, which can be adapted for each
QueryConcept.
One or more AtomQueries are assigned to a
Group. Within a Group, each AtomQuery is
connected by a logical conjunction (AND, in
SPARQL “.”).
The purpose of the Group is to allow parallel
querying of the same thing, which is addressed by
different predicates.
One or more Groups define a SubQueryConcept.
Within the SubQueryConcept, the Groups are
connected with a logical disjunction (OR, in
SPARQL “UNION”). In this way, several Groups
are queried in parallel but with common result
columns. This process simulates the use of a higher
normal form in the database. The result format and
column names of the executed queries for each
group are the same for all groups.
Finally, the SubQueryConcepts define the
QueryConcept, where the SubQueryConcepts are
connected with a logical conjunction (AND, in
SPARQL “.”). It allows adding common predicates
of parallel groups to the query without having to
include them in each group.
The construction of a QueryConcept can be
demonstrated with the company data example. An
AtomQuery contains the predicates “year1” and
“revenue1”. The subject of the statement is an
instance of the real class “Company” (in the
database). The AtomQueries are assigned the Group
“revenueyear1”. The next Group “revenueyear2” is
made up of “year2” and “revenue2”. The
SubQueryConcept “revenueyears” is a union of the
Groups “revenueyear1” and “revenueyear2”.
Another SubQueryConcept is the company address
“companyaddress”, which contains one Group. This
Group contains the AtomQueries for “Street”, “ZIP-
Code” and “City”. The SubQueryConcepts
“companyaddress” and “revenueyears” now define
the
QueryConcept “companyrevenueaddress”. The
result of the executed QueryConcept contains the
following columns: “Street”, “ZIP-Code”, “City”,
“year” and “revenue”. The code parts below, show
how a standalone Atomquery, Group,
SubQueryConcept and QueryConcept could look
like:
Atomquery:
SELECT ?year WHERE {?company x:year1
?year}
Group:
SELECT ?year ?revenue WHERE {?company
x:year1 ?year; x:revenue1 ?revenue}
SubQueryConcept:
SELECT ?year ?revenue WHERE {{?company
x:year1 ?year; x:revenue1 ?revenue} UNION
{?company x:year2 ?year; x:revenue2
?revenue}}
ICAART 2012 - International Conference on Agents and Artificial Intelligence
442
QueryConcept:
SELECT ?name ?revenue ?year WHERE
{?company x:name1 ?name. {?company
x:year1 ?year; x:revenue1 ?revenue} UNION
{?company x:year2 ?year; x:revenue2
?revenue}}
As a consequence of the independence from the
actual database structure, it enables the use of
several different data models at the same time. The
data model is customizable for each user. It allows
data to be imported in its original structure. After the
import of the original data, where flat tables are
imported, the data model can be generated in an
OWL-editor like Stanford Protégé (Stanford
Protégé, (n. d.)). There, the data model has to be
manually adapted for the new data sources.
Afterwards, it can be imported into its own
namespace and context in the database.
4 IMPLEMENTATION AND
RESULTS
The dynamic data model was implemented with an
Allegrograph RDF database on a set of about
300 000 Austrian companies. A Java Sesame API
was used to access the database. In the software,
each class of the meta-ontology was equally
modeled in Java. From these classes, the user
interface was automatically generated based on the
data model. In order to use the model, inputs are
given. In the user interface (see Figure 3), the user
selects the class (here equivalent to a domain),
which contains the requested query. After getting the
class of interest, all available QueryConcepts are
shown. The user selects a QueryConcept, fills out
the constraints and generates the query. The query is
executed in the database returning a result table
and/or geographic structures on the map.
Figure 3: A screenshot of the user interface.
Tests on an Intel® Xeon® 3.0 GHz processor
with 4 GB RAM did show that it was possible to
construct several types of SPARQL queries, by
combining the “building blocks”. Also JOINs can be
created efficiently. The main issue when using the
dynamic data model was performance, but that is an
issue for SPARQL and triple stores in general.
Often, due to the long query times, the execution
time was not satisfactory and the query did not
complete. Here, the query complexity was too high
due to the use of several joins as well as keywords
like ORDER BY. In order to be able to complete a
query, the LIMIT keyword had to be used.
Furthermore, one of the most important performance
factors, which is can be optimized, is the order of the
single statements within the query (Stocker et al.,
2008), (Vidal et al., 2010). Therefore, statement
ordering was considered within the data model. For
instance, if the order of the statements was
considered, a certain query did complete within 2 s,
else it did not complete before execution time-out (4
min). This shows the impact of wrong statement
ordering.
The usage of Stanford Protégé as a data model
editor has both advantages and drawbacks. Users,
which are supposed to modify the model, complain
about the large effort, in order to understand how to
use the editor and to create queries. On the other
side, for a person with basic knowledge about
ontologies, Stanford Protégé provides a very fast
method for constructing database queries. It is done
by populating the data model with instances without
the need of knowing SPARQL.
5 CONCLUSIONS AND
OUTLOOK
The purpose of the dynamic data model was to fulfil
the requirements stated in Section 1. Results showed
that it is possible to adapt the data model on a
running system when new data is imported into the
schema-less database. The user just needs to add the
new structure to the model and creates new
SubQueryConcepts for the corresponding
QueryConcepts replacing the model in the database.
This way, the model is exchangeable and
expandable at any time. A user who knows Stanford
Protégé or any other OWL editor can add and
modify each query by manipulating the instances of
the meta-model. Finally, it could be shown that it is
not useful to implement the reasoner for this type of
data but this approach provides an alternative for
accessing data. The main issue is the performance
DATA ACCESS THROUGH A DYNAMIC DATA MODEL - A Concept for Accessing Heterogenic Data Structures in
RDF Databases
443
and the high effort to understand how to use
Stanford Protégé. It is possible to optimize the
queries by changing the order of the statements.
Further development of the dynamic data model
would be to involve functions from other programs
or databases, i.e. to let the result table of a
QueryConcept be the input of another QueryConcept
containing functions. An AtomQuery could contain
the link to an operation instead of the link to a
predicate in the database. That way, the dynamic
data model would be more powerful in calculation
and simulation tasks. Another use could be to
transport the dynamic data model to a relational
database instead of the RDF-Triple store and use it
as a query editor. The queries would not be
SPARQL, but SQL. This would solve the
performance problem and the RDF-Triple store
would only be used as storage of the data model.
ACKNOWLEDGEMENTS
This work was partly supported by the FFG
(Austrian Research Funding Organization) [819065].
REFERENCES
AllegroGraph (n. d.), AllegroGraph® RDFStore 4.2.1.
Retrieved June 14, 2011, from http://www.franz.com
/agraph/allegrograph
Bügel, U., Hilbring, D., Denzer, R. (2007). Application of
Semantic Services in ORCHESTRA, at the
International Symposium on Environmental Software
Systems (ISESS), Prague, May 22-25, 2007, Czech
Republic
ESRI, Environmental Systems Research Institute, Inc.
(1998). ESRI Shapefile Technical Description, an
ESRI White Paper—July 1998. Retrieved June 21,
2011, from http://www.esri.com/library/whitepapers/
pdfs/shapefile.pdf, USA
Fallahi, G. R., Frank, A. U., Mesgari, M. S., Rajabifard, A.
(2008). An ontological structure for semantic
interoperability of GIS and environmental modeling,
in International Journal of Applied Earth Observation
and Geoinformation 10 (2008) 342–357, doi:10.1016/
j.jag.2008.01.001
Goolge Maps (n. d.), Google Maps. Retrieved June 21,
2011, from http://maps.google.com
HarmonISA (n. d.), HarmonISA. Retrieved June 21, 2011,
from http://www.isamap.info/html/harmonisa.html
Klien, E., Lutz, M., Kuhn, W. (2004), Ontology-based
discovery of geographic information services—An
application in disaster management, in Computers,
Environment and Urban Systems 30 (2006) 102–123,
doi:10.1016/j.compenvurbsys.2005.04.002
Levitt, N. (2010). Will NoSQL Databases Live Up to
Their Promise? in Computer 43 Issue:2, doi:10.1109/
MC.2010.58
Lutz, M. (2007). Ontology-Based Descriptions for
Semantic Discovery and Composition of
Geoprocessing Services, in Geoinformatica, 11:1-36,
doi:10.1007/s10707-006-7635-9
Peedell, S., Friis-Christensen, A., Schade, S. (2005).
Approaches to Solve Schema Heterogeneity at the
European Level, 11
th
EC GI & GIS
Shim, J. P., Warkentin, M., Courtney, J. F., Power, D. J.,
Sharda, R., Carlsson, C. (2002). Past, present, and
future of decision support technology, in Decision
Support Systems 33 page 111 –126, doi:10.1016/S0
167-9236(01)00139-7
Stanford Protégé. (n. d.). welcome to protégé. Retrieved
June 21, 2011, from http://protege.stanford.edu
Stocker, M., Seaborne, A., Bernstein, A., Kiefer, C.,
Reynolds, D. (2008). SPARQL Basic Graph Pattern
Optimization Using Selectivity Estimation, in
Proceeding of the 17th international conference on
World Wide Web WWW 08 (2008), page 595-604,
ACM Press, ISBN: 9781605580852, doi: 10.1145/136
7497.1367578
Stonebraker, M. (2010). SQL databases v. NoSQL
databases, in Communications of the ACM Volume 53
Issue 4, New York, NY, USA
Vidal, M., Ruckhaus, E., Lampo, T., Martínez, A., Sierra,
J., Polleres, A. (2010). On the efficiency of joining
group patterns in SPARQL queries, in Proceedings of
the 7th European Semantic Web Conference
(ESWC2010), Heraklion, Greece. Springer
ICAART 2012 - International Conference on Agents and Artificial Intelligence
444