A Scalable Framework for Dynamic Data Citation of Arbitrary Structured Data

Stefan Pröll, Andreas Rauber

2014

Abstract

Sharing research data is becoming increasingly important as it enables peers to validate and reproduce data driven experiments. Also exchanging data allows scientists to reuse data in different contexts and gather new knowledge from available sources. But with increasing volume of data, researchers need to reference exact versions of datasets. Until now access to research data often based on single archives of data files where versioning and subsetting support is limited. In this paper we introduce a mechanism that allows researchers to create versioned subsets of research data which can be cited and shared in a lightweight manner. We demonstrate a prototype that supports researchers in creating subsets based on filtering and sorting source data. These subsets can be cited for later reference and reuse. The system produces evidence that allows users to verify the correctness and completeness of a subset based on cryptographic hashing. We describe a replication scenario for enabling scalable data citation in dynamic contexts.

References

  1. Bakhtiari, S., Safavi-Naini, R., Pieprzyk, J., et al. (1995). Cryptographic hash functions: A survey. Centre for Computer Security Research, Department of Computer Science, University of Wollongong, Australie.
  2. Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere, P. (2011). The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011).
  3. CODATA-ICSTI (2013). Out of cite, out of mind: The current state of practice, policy, and technology for the citation of data. CODATA-ICSTI Task Group on Data Citation Standards and Practices.
  4. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The WEKA data mining software: an update. SIGKDD Explor. Newsl., 11(1):10-18.
  5. Klima, V. (2005). Finding md5 collisions on a notebook pc using multi-message modifications. IACR Cryptology ePrint Archive, 2005:102.
  6. Lawrence, B., Jones, C., Matthews, B., Pepler, S., and Callaghan, S. (2011). Citation and peer review of data: Moving towards formal data publication. International Journal of Digital Curation, 6(2):4-37.
  7. Li, Y., Swarup, V., and Jajodia, S. (2005). Fingerprinting relational databases: Schemes and specialties. Dependable and Secure Computing, IEEE Transactions on, 2(1):34-45.
  8. Narasimha, M. and Tsudik, G. (2006). Authentication of outsourced databases using signature aggregation and chaining. In Lee, M., Tan, K.-L., and Wuwongse, V., editors, Database Systems for Advanced Applications, volume 3882 of Lecture Notes in Computer Science, pages 420-436. Springer Berlin Heidelberg.
  9. Parsons, M. A., Duerr, R., and Minster, J.-B. (2010). Data citation and peer review. Eos, Transactions American Geophysical Union, 91(34):297-298.
  10. Paskin, N. (2010). Digital Object Identifier (DOI) System. Encyclopedia of library and information sciences, 3:1586-1592.
  11. Prö ll, S. and Rauber, A. (2013a). Citable by Design - A Model for Making Data in Dynamic Environments Citable. In 2nd International Conference on Data Management Technologies and Applications (DATA2013), Reykjavik, Iceland.
  12. Prö ll, S. and Rauber, A. (2013b). Scalable Data Citation in Dynamic, Large Databases: Model and Reference Implementation. In IEEE International Conference on Big Data 2013 (IEEE BigData 2013), Santa Clara, CA, USA.
  13. Shafranovich, Y. (2005). Common Format and MIME Type for Comma-Separated Values (CSV) Files. RFC 4180.
  14. Wang, X., Feng, D., Lai, X., and Yu, H. (2004). Collisions for Hash Functions MD4, MD5, HAVAL-128 and RIPEMD.
Download


Paper Citation


in Harvard Style

Pröll S. and Rauber A. (2014). A Scalable Framework for Dynamic Data Citation of Arbitrary Structured Data . In Proceedings of 3rd International Conference on Data Management Technologies and Applications - Volume 1: DATA, ISBN 978-989-758-035-2, pages 223-230. DOI: 10.5220/0004991802230230


in Bibtex Style

@conference{data14,
author={Stefan Pröll and Andreas Rauber},
title={A Scalable Framework for Dynamic Data Citation of Arbitrary Structured Data},
booktitle={Proceedings of 3rd International Conference on Data Management Technologies and Applications - Volume 1: DATA,},
year={2014},
pages={223-230},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004991802230230},
isbn={978-989-758-035-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of 3rd International Conference on Data Management Technologies and Applications - Volume 1: DATA,
TI - A Scalable Framework for Dynamic Data Citation of Arbitrary Structured Data
SN - 978-989-758-035-2
AU - Pröll S.
AU - Rauber A.
PY - 2014
SP - 223
EP - 230
DO - 10.5220/0004991802230230