Low Cost Big Data Solutions: The Case of Apache Spark on Beowulf Clusters

Marin Fotache, Marius-Iulian Cluci, Valerică Greavu-Şerban

2020

Abstract

With distributed computing platforms deployed on affordable hardware, Big Data technologies have democratised the processing of huge volumes of structured and semi-structured data. Still, the costs of installing and operating even relatively small cluster of commodity servers or the cost of hiring cloud resources could prove inaccessible for many companies and institutions. This paper builds two predictive models for estimating the main drivers of the data processing performance for one of the most popular Big Data system (Apache Spark) deployed on gradually increased number of nodes of a Beowulf cluster. Data processing performance was estimated by randomly generated SparkSQL queries on TPC-H database schema, with variable number of joins (including self-joins), predicates, groups, aggregate functions and subqueries included in FROM clause. Using two machine learning techniques, random forest and extreme gradient boosting, predictive models tried to estimate the query duration on predictors related to cluster setup and query structure and also to assess the importance of predictors for the outcome variability. Results were positive and encouraging for extending the cluster number of nodes and the database scale.

Download


Paper Citation


in Harvard Style

Fotache M., Cluci M. and Greavu-Şerban V. (2020). Low Cost Big Data Solutions: The Case of Apache Spark on Beowulf Clusters.In Proceedings of the 5th International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS, ISBN 978-989-758-426-8, pages 327-334. DOI: 10.5220/0009407903270334


in Bibtex Style

@conference{iotbds20,
author={Marin Fotache and Marius-Iulian Cluci and Valerică Greavu-Şerban},
title={Low Cost Big Data Solutions: The Case of Apache Spark on Beowulf Clusters},
booktitle={Proceedings of the 5th International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS,},
year={2020},
pages={327-334},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0009407903270334},
isbn={978-989-758-426-8},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 5th International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS,
TI - Low Cost Big Data Solutions: The Case of Apache Spark on Beowulf Clusters
SN - 978-989-758-426-8
AU - Fotache M.
AU - Cluci M.
AU - Greavu-Şerban V.
PY - 2020
SP - 327
EP - 334
DO - 10.5220/0009407903270334