Integrating Lightweight Compression Capabilities into Apache Arrow

Juliana Hildebrandt; Dirk Habich; Wolfgang Lehner

doi:10.5220/0009820100550066

Integrating Lightweight Compression Capabilities into Apache Arrow

Juliana Hildebrandt, Dirk Habich, Wolfgang Lehner

2020

Abstract

With the ongoing shift to a data-driven world in almost all application domains, the management and in particular the analytics of large amounts of data gain in importance. For that reason, a variety of new big data systems has been developed in recent years. Aside from that, a revision of the data organization and formats has been initiated as a foundation for these big data systems. In this context, Apache Arrow is a novel cross-language development platform for in-memory data with a standardized language-independent columnar memory format. The data is organized for efficient analytic operations on modern hardware, whereby Apache Arrow only supports dictionary encoding as a specific compression approach. However, there exists a large corpus of lightweight compression algorithms for columnar data which helps to reduce the necessary memory space as well as to increase the processing performance. Thus, we present a flexible and language-independent approach integrating lightweight compression algorithms into the Apache Arrow framework in this paper. With our so-called ArrowComp approach, we preserve the unique properties of Apache Arrow, but enhance the platform with a large variety of lightweight compression capabilities.

Download

Paper Citation

in Harvard Style

Hildebrandt J., Habich D. and Lehner W. (2020). Integrating Lightweight Compression Capabilities into Apache Arrow.In Proceedings of the 9th International Conference on Data Science, Technology and Applications - Volume 1: DATA, ISBN 978-989-758-440-4, pages 55-66. DOI: 10.5220/0009820100550066

in Bibtex Style

@conference{data20,
author={Juliana Hildebrandt and Dirk Habich and Wolfgang Lehner},
title={Integrating Lightweight Compression Capabilities into Apache Arrow},
booktitle={Proceedings of the 9th International Conference on Data Science, Technology and Applications - Volume 1: DATA,},
year={2020},
pages={55-66},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0009820100550066},
isbn={978-989-758-440-4},
}

in EndNote Style

TY - CONF

JO - Proceedings of the 9th International Conference on Data Science, Technology and Applications - Volume 1: DATA,
TI - Integrating Lightweight Compression Capabilities into Apache Arrow
SN - 978-989-758-440-4
AU - Hildebrandt J.
AU - Habich D.
AU - Lehner W.
PY - 2020
SP - 55
EP - 66
DO - 10.5220/0009820100550066