A CKAN Plugin for Data Harvesting to the Hadoop Distributed File System

Robert Scholz, Nikolay Tcholtchev, Philipp Lämmel, Ina Schieferdecker

2017

Abstract

Smart Cities will mainly emerge around the opening of large amounts of data, which are currently kept closed by various stakeholders within an urban ecosystem. This data requires to be cataloged and made available to the community, such that applications and services can be developed for citizens, companies and for optimizing processes within a city itself. In that scope, the current work seeks to develop concepts and prototypes, in order to enable and demonstrate, how data cataloging and data storage can be merged towards the provisioning of large amounts of data in urban environments. The developed concepts, prototype, case study and belonging evaluations are based on the integration of common technologies from the domains of Open Data and large scale data processing in data centers, namely CKAN and Hadoop.

References

  1. Ahuja, S. P. and Moore, B. 2013. State of big data analysis in the cloud. Network and Communication Technologies, 2(1), 62.
  2. Alinat, P. and Pierrel, J. 1993. Esprit II project 5516 Roars: robust analytic speech recognition system.
  3. Amazon.com. Amazon Simple Storage Service. Available at: https://aws.amazon.com/de/s3/ [Accessed on 20 Feburary 2016].
  4. Bristol City Council. 2015. Bristol Open Data Portal. Available at: https://www.bristol.gov.uk/dataprotection-foi/open-data [Accessed on 20 Feburary 2016].
  5. CKAN Association. CKAN Overview. 2015. Available at: http://ckan.org [Accessed on 20 Feburary 2016].
  6. CKAN Association. DataStore Extension. 2017. Available at: http://docs.ckan.org/en/latest/maintaining/datastore.ht ml [Accessed on 20 Feburary 2016].
  7. Fan, H., Ramaraju, A., McKenzie, M., Golab, W., & Wong, B. 2015. Understanding the causes of consistency anomalies in Apache Cassandra. Proceedings of the VLDB Endowment, 8(7), 810-813.
  8. H2O.ai. AirlinesWithWeatherDemo. 2016. Available at: https://github.com/h2oai/sparklingwater/tree/master/examples/ [Accessed on 20 Feburary 2016].
  9. Khan, Z., Anjum, A., Soomro, K., and Tahir, M. A. 2015. Towards cloud based big data analytics for smart future cities. Journal of Cloud Computing, 4(1), 1.
  10. Kreps, J., Narkhede, N. and Rao, J. 2011, June. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB (pp. 1-7).
  11. Lapi, E., Tcholtchev, N., Bassbouss, L., Marienfeld, F. and Schieferdecker, I. 2012, July. Identification and utilization of components for a linked open data platform. In Computer Software and Applications Conference Workshops (COMPSACW), 2012 IEEE 36th Annual (pp. 112-115). IEEE.
  12. Liu, Z., Li, H. and Miao, G. 2010, August. MapReducebased backpropagation neural network over large scale mobile data. In 2010 Sixth International Conference on Natural Computation (Vol. 4, pp. 1726-1730). IEEE.
  13. Marienfeld, F., Schieferdecker, I., Lapi, E., and Tcholtchev, N. 2013, August. Metadata aggregation at GovData. de: an experience report. In Proceedings of the 9th International Symposium on Open Collaboration (p. 21). ACM.
  14. Matheus, R. and Manuella, M. 2014. Case study: open government data in Rio de Janeiro City. Open Research Network.
  15. Mercader, A. et al. 2012. ckanext-harvest - remote harvesting extension. Available at: https://github.com/ ckan/ckanext-harvest [Accessed on 20 Feburary 2016].
  16. Momjian, B. 2001. PostgreSQL: introduction and concepts (Vol. 192). New York: Addison-Wesley.
  17. Red Hat, Inc. Using Hadoop with CephFS.. 2014. Available at: http://docs.ceph.com/docs/jewel/cephfs/hadoop [Accessed on 20 Feburary 2016].
  18. Rosado, T. and Bernardino, J., 2014, July. An overview of openstack architecture. In Proceedings of the 18th International Database Engineering & Applications Symposium (pp. 366-367). ACM.
  19. Shvachko, K., Kuang, H., Radia, S. and Chansler, R. 2010, May. The hadoop distributed file system. In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST) (pp. 1-10). IEEE.
  20. Steinberger, R., Ebrahim, M., Poulis, A., Carrasco-Benitez, M., Schlüter, P., Przybyszewski, M., & Gilbro, S. 2014. An overview of the European Union's highly multilingual parallel corpora. Language Resources and Evaluation, 48(4), 679-707.
  21. Thaha, A.F., Singh, M., Amin, A.H., Ahmad, N.M. and Kannan, S., 2014, December. Hadoop in openstack: Data-location-aware cluster provisioning. In Information and Communication Technologies (WICT), 2014 Fourth World Congress on (pp. 296- 301). IEEE.
  22. The Apache Software Foundation. WebHDFS REST API. 2013. Available at: http://hadoop.apache.org/docs/ r1.0.4/webhdfs.html [Accessed on 20 Feburary 2016].
  23. The Apache Software Foundation. Apache Spark: Lightning-fast cluster computing. 2016. Available at: http://spark.apache.org/ [Accessed on 20 Feburary 2016].
  24. The Apache Software Foundation. Hadoop Project Webpage. 2017a. Available at: http://hadoop.apache.org/ [Accessed on 20 Feburary 2016].
  25. The Apache Software Foundation. Apache Cassandra. 2017b. Available at: http://cassandra.apache.org/ [Accessed on 20 Feburary 2016].
  26. Tierney, B., Kissel, E., Swany, M. and Pouyoul, E. 2012. Efficient data transfer protocols for big data. In EScience (e-Science), 2012 IEEE 8th International Conference on (pp. 1-9). IEEE.
  27. Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. and Maltzahn, C. 2006, November. Ceph: A scalable, highperformance distributed file system. In Proceedings of the 7th symposium on Operating systems design and implementation (pp. 307-320). USENIX Association.
  28. Winn, J. 2013. Research Data Management using CKAN: A Datastore, Data Repository and Data Catalogue. IASSIST Conference.
  29. Wuebker, J, Ney, H and Zens, R. 2012. Fast and scalable decoding with language model look-ahead for phrasebased statistical machine translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2.
Download


Paper Citation


in Harvard Style

Scholz R., Tcholtchev N., Lämmel P. and Schieferdecker I. (2017). A CKAN Plugin for Data Harvesting to the Hadoop Distributed File System . In Proceedings of the 7th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-758-243-1, pages 47-56. DOI: 10.5220/0006230200470056


in Bibtex Style

@conference{closer17,
author={Robert Scholz and Nikolay Tcholtchev and Philipp Lämmel and Ina Schieferdecker},
title={A CKAN Plugin for Data Harvesting to the Hadoop Distributed File System},
booktitle={Proceedings of the 7th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},
year={2017},
pages={47-56},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006230200470056},
isbn={978-989-758-243-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - A CKAN Plugin for Data Harvesting to the Hadoop Distributed File System
SN - 978-989-758-243-1
AU - Scholz R.
AU - Tcholtchev N.
AU - Lämmel P.
AU - Schieferdecker I.
PY - 2017
SP - 47
EP - 56
DO - 10.5220/0006230200470056