Citable by Design

A Model for Making Data in Dynamic Environments Citable

Stefan Pr

oll

and Andreas Rauber

1,2

SBA Research, Vienna, Austria

Technical University of Vienna, Vienna, Austria

Keywords:

Dynamic Data Citation, Relational Databases, SQL, Persistent Identiﬁers.

Abstract:

Data forms the basis for research publications. But still the focus of researchers is a paper based publication,

data is rather seen as a supplement that could be offered as a download, often without further comments.

Yet validation, veriﬁcation, reproduction and re-usage of existing knowledge can only be applied when the

research data is accessible and identiﬁable. For this reason, precise data citation mechanisms are required,

that allow reproducing experiments with exactly the same data basis. In this paper, we propose a model that

enables to cite, identify and reference speciﬁc data sets within their dynamic environments. Our model allows

the selection of subsets that support experiment veriﬁcation and result re-utilisation in different contexts. The

approach is based on assigning persistent identiﬁers to timestamped queries which are executed against times-

tamped and versioned databases. This facilitates transparent implementation and scalable means to ensure

identical result sets being delivered upon re-invocation of the query.

1 INTRODUCTION

Scientiﬁc research has fully arrived in the digital age,

where researchers have powerful infrastructures at

their ﬁngertips (Hey et al., 2009). Within the area

of eScience, increasingly complex experiments are

based on large data sets. Many scientists still have

their primary focus on the paper based publication. In

principle, these publications remained conceptually

the same as they have been since decades. Despite

the fact that it has never been so easy to publish not

only the results in a written format, but also the under-

lying data that were the foundation of these results,

little attention is attributed towards the research data.

Many funding bodies, such as the FWF

in Austria or

the European Union

, but also governments and Jour-

nals e.g. Nature

demand or at least recommend the

availability of data and other material, that is required

for the re-execution of an experiment. So far data is

often considered as a supplement or metadata to the

publication, that has to be cited in its entirety. Thus

http://www.fwf.ac.at/en/downloads/pdf/free-research-

needs-the-free-circulation-of-ideas.pdf

http://ec.europa.eu/research/science-society/

document library/pdf 06/recommendation-access-and-

preservation-scientiﬁc-information en.pdf

http://www.nature.com/authors/policies/availability.html

although several approaches address the data citation

problem, there are open issues speciﬁcally concern-

ing the scalable and machine-readable citation of sub-

sets of potentially dynamically changing and evolving

and growing data sets. If data is deposited, it often is

submitted in large, indivisible units and often offered

as a download. Data only will be reused if it can be

utilised within different scientiﬁc contexts. Hence a

more ﬂexible way of citing also speciﬁc subsets is re-

quired.

Data sets need to be identiﬁable in order to fos-

ter reuse, enable validation, re-production and re-

execution of scientiﬁc experiments. We propose a

model for citing subsets of large scale research data.

In this paper our focus is speciﬁcally on relational

database management systems (RDBMS), which al-

low to deﬁne precise subsets with the SQL language.

We concentrate on the queries and their results, not on

the large, indivisible data dumps as a basis for refer-

ence. Our model increases the scalability of data ci-

tation by assigning unique identiﬁers only to queries

used for selecting the data used in subsequent exper-

iments. Being based upon temporal database aspects

and unambiguous result presentation, citing only the

query persistently is sufﬁcient for our model. It guar-

antees not only consistent result sets across time, but

also consistent result lists.

206

Pröll S. and Rauber A..

Citable by Design - A Model for Making Data in Dynamic Environments Citable.

DOI: 10.5220/0004589102060210

In Proceedings of the 2nd International Conference on Data Technologies and Applications (DATA-2013), pages 206-210

ISBN: 978-989-8565-67-9

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

The remainder of this paper is structured as fol-

lows: Section 2 provides an overview of current data

citation practises and motivates the need for a new

model for data citation. Section 3 introduces a model

for citing data in dynamic environments. The model

is described for relational databases and generalised

for generic data sources. Section 4 concludes the pa-

per and provides an outlook of our future work.

2 HOW DATA IS CITED TODAY

Publications increasingly contain references to data

that was used or generated during the research that

substantiates the work. However, research data sets

are often treated as one entity, i.e. indivisible, static

and referencable as one unit. In many cases, data is

referenced bibliographically. As a minimum (Brase,

2009), the following metadata about a data set are re-

quired (Australian National Data Service, 2011): au-

thor, title, date, publisher, identiﬁer and access infor-

mation. The data itself is then often deposited at an

institutional site and referenced by providing an URL.

Obviously this mechanism is not suitable for sustain-

able data citation for several reasons. Uniform Re-

source Locators (URL)

have not been designed to be

stable for the long term. As their name implies, URLs

refer to a location, not the object itself. As a result,

many URLs that served as data citation reference are

not accessible any more. Either because the author

of the data set left the institution and the Web page

was taken down, or because the server moved and the

location changed.

To overcome the problem of changing locations,

the concept of persistent identiﬁers was introduced.

Persistent identiﬁers (PIDs) provide unique identiﬁca-

tion of digital objects and reliable locations of Inter-

net resources. PIDs require organisational effort for

the management for the linking between the data and

the identiﬁer. Also, services for locating and access-

ing objects are necessary. The organisations provid-

ing these services are denoted Registration Author-

ities (RA). These RAs are responsible for the long

term access, resolution and maintenance of the iden-

tiﬁers they issued for digital objects. There exist dif-

ferent solutions for the implementation of persistent

identiﬁers. The authors of (Hans-Werner Hilse, 2006)

provide an overview of the most common approaches.

In (Bellini et al., 2008), six steps are identiﬁed for im-

plementing a persistent identiﬁer system:

1. Select the resource that needs persistent identiﬁ-

cation and deﬁne the granularity.

www.ietf.org/rfc/rfc1738.txt

2. Decide which RA is trustworthy and suitable

3. Deﬁne resolution granularity and access rights

4. Assign a resource name register the object

5. Execute resolution service

6. Maintain the link between PID and the resource

Although persistent identiﬁers solve the problem

with locations of digital objects, there are drawbacks

for dynamic data. As stated in the enumeration above,

the granularity of the identiﬁers can be adjusted to

the requirements of the data set. Subsets require their

own identiﬁcation and metadata. Assigning persistent

identiﬁers (PIDs) to data portions of ﬁner granularity,

i.e. database rows or even cells would require enor-

mous numbers of unique identiﬁers and yield infeasi-

ble citations. PID approaches are suited very well for

static data, which should only serve as reference point

once it has been created. Using the identiﬁcation and

additional metadata is sufﬁcient to search, identify,

and retrieve data again. However, many settings re-

quire us to go beyond these limitations and introduce

scalable and machine-actionable methods that can be

used in dynamically changing, very large databases.

Also, many data sets continue to grow and are up-

dated as the data sets are used in experiments. In or-

der to enable data citation in dynamic environments

versioning support is required. Furthermore, different

stakeholders may be interested in diverse portions of

the data. Hence, clearly deﬁned subsets of the data

need to be identiﬁable and citable as well. These are

some reasons why PIDs assigned to entire data sets or

databases are not sufﬁcient for several applications.

3 CITING DYNAMIC DATA

In many cases research data is not just static. It can

change and evolve during the time, records can be

updated or deleted. To understand which data actu-

ally was involved in an experiment and to reference

that data, a new model is required. In order to be

able to unambiguously and transparently cite subsets

of data under such conditions, the following require-

ments need to be met:

1. Subsets of large data collections can be referenced

2. Dynamic data can be handled

3. Scalability is enabled

4. Implementation is transparent

The ﬁrst requirement covers the reuse of data,

which enables to perform new analysis on old data

and therefore generate new knowledge. The second

CitablebyDesign-AModelforMakingDatainDynamicEnvironmentsCitable

207

requirement covers the capability of citing dynami-

cally changing data. Data sources can potentially be

huge in size. Citing individual attributes and cells

would require enormous numbers of unique identi-

ﬁers and yield infeasible citations. Hence the third re-

quirement covers scalable solutions, feasible to deal

with large data sources. The fourth requirement re-

gards usability. Only if a solution is pragmatic and

transparent, it will be accepted. The proposed re-

quirements are valid for all kinds of research data for-

mats. We demonstrate and motivate the model that

we propose by uses relational databases for tackling

these four requirements. Section 3.3 then introduces

a generic model that can be used for other data such as

ﬂat ﬁles, streaming data or various other data formats.

3.1 Dynamic Data Citation using

Relational Databases

Research data is often stored in relational database

management systems (RDBMS). The results that they

deliver are the basis for further processing. We con-

centrate on the queries and their results, not on the

large, indivisible data portions as a basis for refer-

ence. Our model increases the scalability of data cita-

tion by assigning unique identiﬁers only to the query

itself. Furthermore our model increases the preserva-

tion awareness or readiness of research projects. Our

model provides guidance on how to enhance the data

model used for processing research data, in order to

ensure it can be reliably cited and re-used in the fu-

ture.

Relational database management systems

(RDBMS) support many of our requirements off

the shelf. These databases can be used to retrieve

arbitrary subsets of data. Hence we concentrated on

this database model for a ﬁrst pilot study before dis-

cussing the general applicability. Our model is based

on timestamped SELECT-Statements and versioned

data. Queries can be used in order to persistently

identify subsets of arbitrary complexity and size.

The dynamic nature of research databases requires

mechanisms that allow to trace and monitor all

changes that occurred during time. Hence, temporal

aspects have to be included in the model. This timing

information needs to be stored on each UPDATE,

INSERT or DELETE statement for the affected

records, enabling to trace all changes that occurred.

As relational database systems are set based, sorting

is not an inherent criteria automatically. Therefore,

we need to specify stable sorting criteria that are

automatically applied to the subsets. Depending

on the size of the data set, the schema and the

complexity of the query, the retrieval of the result set

can challenging. If these properties are met, citing

only the query persistently is sufﬁcient to meet our

requirements. It guarantees not only consistent result

sets across time, but also consistent result lists even

in case of none or ambiguous result set sorting in

the initial query, even in the case of migration to a

different DBMS.

3.2 A Basic Model for Citing Data Sets

in Relational Databases

In timestamped RDBMS, timestamps are provided for

all records. This ensures that speciﬁc versions of data

can be retrieved without having to stall the database

tables for additional data. As records can change, they

need to be versioned, i.e. all changes that affect the

data need to be traceable. This entails that statements

such as DELETE or UPDATE must not to destroy the

data, but rather set markers that indicate that a record

has been marked for deletion or that it as has been

updated by a more recent version.

The construction of subsets of complex databases

can be easily be achieved by issuing SELECT-Queries

against the RDBMS. To enable the data citation facil-

ities, the SQL-Query has also to be augmented with a

timestamp. This timestamp maps the subset to a spe-

ciﬁc state of the data. As the records in the database

can be altered individually, it needs to be ensured that

the correct version that was valid at the query’s times-

tamp is selected for inclusion in the subset. Hence the

timestamp of the query can be used to retrieve arbi-

trary subsets of a speciﬁc version of the data.

There are several possibilities how this version

information can be implemented (Snodgrass, 1999).

The temporal timestamp contains the explicit date at

which the data has been changed. Suitable times-

tamps are dates that are granular enough to capture

the point in time that enables to differentiate between

two versions of data. The actual chronon to be picked

depends on the potential frequency of changes in data,

which is not a trivial task(Jensen and Lomet, 2001).

Thus granularity can range from days to milliseconds.

Snodgrass et al. differentiate between valid time and

transaction time (Jensen et al., 1993). Valid time

refers to the period until the data was considered a

true fact in the database. Transaction time refers to

the time when the change occurred on the system,

independent of its temporal meaning for the actual

data. The valid time concept is a reference to the

real world, the transaction time only refers to the sys-

tem time, at which a change of data was manifested.

Both concepts could be used for managing versions

in our model. As we are interested in the state of the

database at a given point in time, the transaction time

DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications

208

concept is clearly better suited.

It is essential being able to identify all records

uniquely. This property can also be handled by any

RDBMS with the concept of primary keys. Hence

our citable database schema requires each table to be

equipped with a primary key. Primary keys are by

deﬁnition unique, hence it allows to specify a unique

sorting of the records to be included into the subset.

To achieve this stable sorting, each query needs to

specify a standard sorting order based on the primary

keys.

These queries themselves need to be stored and

augmented with a timestamp that reﬂects the time

when the query was issued. The query’s timestamp

deﬁnes what versions of the records are included in

the subset. A hash function over the SELECT-Query

allows to identify queries that have already been is-

sued against the system. Then a mechanism to iden-

tify the queries and the subsets they produced is re-

quired. In this case, PIDs become very useful, as a

query that identiﬁes a precise subset is static. If no

changes of the records have occurred between two

runs of the identical query, the same PID needs to be

assigned to both runs of the query.

Figure 1 illustrates the interaction of the compo-

nents of the framework. The database contains all

records of a data set and maintains their versions.

Queries are stored with a timestamp of their issuing

in the Query Store. This ensures that a subset can be

reproduced by knowing the query and the time of its

execution. The citation is done by using PIDs. The

PID Store enables to identify queries again and reuse

the subsets created by the query.

Figure 1: Data Citation Model for Relational Databases

with PIDs and Versioned Data.

It is easily possible to automate the creation of

timestamps for data altering events as well as for

queries. This allows to implement dynamic data ci-

tation transparently, i.e. no speciﬁc action is required

on the user side: whenever a researcher selects subset

of data for an experiment, the data is returned with a

PID. This ensures that upon re-invoking the query, the

PID is identiﬁed and an identical set is returned, even

in identical order.

The model we introduced in this section describes

how arbitrary subsets of data in potentially large rela-

tional databases can be created and retrieved at a later

point of time.

3.3 Generalisation: Expanding the

Model to Data Sources

The model that enables dynamic data citation in-

troduced in Section 3.2 is not limited to relational

databases. As nicely generalised from (Pr

oll and

Rauber, 2013) in (Moore, 2013), the core concepts

themselves can be mapped to other data models as

well. The following requirements enable dynamic

data citation on a generic level:

1. Uniquely identiﬁable data records

2. Time stamps of data

3. Versioned data, considering markings of deleted,

altered or inserted data records

4. Query language for constructing subsets

5. Persistent query store that keeps queries and the

timestamp of their issuing

6. An identiﬁcation mechanism for queries, that en-

ables access

The basic requirements are uniquely identiﬁable

data records, that can be included in subsets of data.

These records that form a subset need to be identiﬁ-

able on an individual level. Furthermore, a version-

ing scheme must be available. These versions should

reﬂect events such as insertion, updates or deletion.

Hence no change on the data must be lost, regard-

less what data model is used. The versioning mech-

anism should include timestamps that allow to derive

the set of valid records at a given point of time. For

constructing subsets, the data source must provide a

query language, which is powerful enough to select

speciﬁed records based on precise criteria. To en-

able citation of subsets, it is sufﬁcient to store the

queries that led to the subsets and combine them with

the timestamp. This timestamp provides the mapping

between the query and the different versions of the

records. This query is the key to the subset. Hence

the query needs to be identiﬁable in order to retrieve

the subset at a later point in time. With the require-

ments introduced, arbitrary data sources can be cited.

The model based on these requirements allows to cite

data that is evolving within the data source.

CitablebyDesign-AModelforMakingDatainDynamicEnvironmentsCitable

209

4 CONCLUSIONS AND FUTURE

WORK

Digitally driven research is a rather young discipline

that evolves fast. As a result the tools and the data are

rarely developed with a focus of long term awareness.

What matters most to researchers is fast results and

prompt publications. If the data they produce today

can be understood, interpreted or even accessed in the

future is not addressed with the same attention. We

want to change this paradigm and highlight the need

for preservation aware research data.

Therefore we introduced a model for citing data in

dynamically changing environments. We described

how the model can be applied to relational database

management systems and extended the framework to

generic data sources. We identiﬁed requirements that

enable data sources to provide citable subsets of data.

Once the framework has been applied, most parts can

be automated, hence transparent data citation capa-

bilities are easy to offer. The easier and transparently

this citation process can be implemented, the higher

is the acceptance among the target audience and the

designated community.

The concepts are currently considered to be ad-

dressed as part of a larger working group within the

Research Data Alliance (RDA

). Our goal is to pro-

vide proofs of concept, mock-ups and prototype im-

plementations, that can be tested and used by the com-

munity within the near future. A ﬁrst prototype will

be implemented by inserting the query re-writing and

time-stamped storage of the query in the JDBC layer

and testing it on several data sources used for scien-

tiﬁc experiments. Future work will focus on other

data formats that are widely used within research.

This includes specialized ﬁle formats from various

disciplines and areas.

Besides these criteria introduced in 3.2, there are

additional considerations that have to be made. The

requirements mentioned so far only consider internal

properties of the system the data resides in. It is clear

that external inﬂuences that can alter data, but are not

recognised by the data storage system, need to be pre-

vented. Furthermore, side effects that depend on the

query system, the query language or speciﬁc proper-

ties of the data sets need to be removed in order to

enable reproducibility. If the query language provides

functions that are based on non-deterministic calcula-

tions, they have to be treated. This includes all sorts of

randomised functions (e.g. a random number genera-

tor) or relative time speciﬁcations (e.g. CURDATE()).

Such operations hinder the re-execution of a query for

http://forum.rd-alliance.org

retrieving the exact same result, as they depend on ex-

ternal inﬂuences. How this issue can be mitigated will

be part of our future work. Schema or format changes

are a challenge that needs to be addressed.

ACKNOWLEDGEMENTS

Part of this work was supported by the projects

APARSEN and TIMBUS, partially funded by the EU

under the FP7 contracts 269977 and 269940.

REFERENCES

Australian National Data Service (2011). Data Citation

Awareness. http://ands.org.au/guides/data-citation-

awareness.pdf.

Bellini, E., Cirinn, C., and Lunghi, M. (2008). Persistent

identiﬁers distributed system for cultural heritage dig-

ital objects. In iPRES 2008: The Fifth International

Conference on Preservation of Digital Objects.

Brase, J. (2009). DataCite - A Global Registration Agency

for Research Data. In COINFO 2009: Proceedings of

the Fourth International Conference on Cooperation

and Promotion of Information Resources in Science

and Technology, Washington, DC, USA. IEEE Com-

puter Society.

Hans-Werner Hilse, J. K. (2006). Implementing Persistent

Identiﬁers: Overview of concepts, guidelines and rec-

ommendations. Consortium of European Research Li-

braries, London.

Hey, T., Tansley, S., and Tolle, K., editors (2009). The

Fourth Paradigm: Data-Intensive Scientiﬁc Discov-

ery. Microsoft Research.

Jensen, C., Soo, M., and Snodgrass, R. (1993). Unifying

temporal data models via a conceptual model. Infor-

mation Systems, 19:513–547.

Jensen, C. S. and Lomet, D. B. (2001). Transaction times-

tamping in (temporal) databases. In Proceedings

of the International Conference on Very Large Data

Bases, pages 441–450.

Moore, R. (2013). Workﬂow virtualization. In (Pr

oll and

Rauber, 2013). Research Data Alliance - Launch and

First Plenary March 18-20, 2013, Gothenburg, Swe-

den.

oll, S. and Rauber, A. (2013). BoF-Session on Data Cita-

tion. Research Data Alliance - Launch and First Ple-

nary March 18-20, 2013, Gothenburg, Sweden.

Snodgrass, R. (1999). Developing Time-Oriented Database

Applications in SQL. Morgan Kaufmann.

DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications

210