From Group-by to Accumulation: Data Aggregation Revisited

Alexandr Savinov

2017

Abstract

Most of the currently existing query languages and data processing frameworks rely on one or another form of the group-by operation for data aggregation. In this paper, we critically analyze properties of this operation and describe its major drawbacks. We also describe an alternative approach to data aggregation based on accumulate functions and demonstrate how it can solve these problems. Based on this analysis, we argue that accumulate functions should be preferred to group-by as the main operation for data aggregation.

References

  1. Abiteboul, S., Fischer, P.C., Schek, H.-J., 1989. Nested Relations and Complex Objects in Databases (LNCS). Springer, Berlin.
  2. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M., 2015. Spark SQL: Relational Data Processing in Spark. In SIGMOD 2015.
  3. Codd, E., 1970. A Relational Model for Large Shared Data Banks. Communications of the ACM, 13(6), 377- 387.
  4. Database Languages|SQL, ISO/IEC 9075-*:2003, 2003.
  5. Dean, J, Ghemawat, S., 2004. MapReduce: Simplified data processing on large clusters. OSDI'04, 137-150.
  6. Kerschberg, L., Pacheco, J.E.S., 1976. A Functional Data Base Model. Report No. 2/1976, Departamento de Informatica, Pontificia Universidade Catolica - Rio de Janeiro, Brazil.
  7. McKinney, W., 2010. Data Structures for Statistical Computing in Python. In Proc. 9th Python in Science Conference (SciPy 2010), 51-56.
  8. McKinney, W., 2011. pandas: a Foundational Python Library for Data Analysis and Statistics. In Proc. PyHPC 2011.
  9. Savinov, A., 2016a. Joins vs. Links or Relational Join Considered Harmful. Internet of Things and Big Data (IoTBD'2016), 362-368
  10. Savinov, A., 2016b. DataCommandr: Column-Oriented Data Integration, Transformation and Analysis. Internet of Things and Big Data (IoTBD'2016), 339- 347.
  11. Savinov, A., 2016c. Concept-oriented model: The functional view. arXiv preprint arXiv:1606.02237 [cs.DB] 2016 https://arxiv.org/abs/1606.02237
  12. Sibley, E.H., Kerschberg, L., 1977. Data architecture and data model considerations. In Proceedings of the AFIPS Joint Computer Conferences, 85-96.
  13. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I., 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012.
Download


Paper Citation


in Harvard Style

Savinov A. (2017). From Group-by to Accumulation: Data Aggregation Revisited . In Proceedings of the 2nd International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS, ISBN 978-989-758-245-5, pages 370-379. DOI: 10.5220/0006359803700379


in Bibtex Style

@conference{iotbds17,
author={Alexandr Savinov},
title={From Group-by to Accumulation: Data Aggregation Revisited},
booktitle={Proceedings of the 2nd International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS,},
year={2017},
pages={370-379},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006359803700379},
isbn={978-989-758-245-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 2nd International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS,
TI - From Group-by to Accumulation: Data Aggregation Revisited
SN - 978-989-758-245-5
AU - Savinov A.
PY - 2017
SP - 370
EP - 379
DO - 10.5220/0006359803700379