using (mainly off the office hours) a small subset of 
modest organisational workstations.  
Even if the variability of some predictors (such as 
number of cluster nodes) was low, both machine 
learning models have good results in predicting the 
query duration based on main query and cluster 
parameters. Random Forests model performed 
slightly better than the xgboost model, with the 
concordance correlation coefficient above 90% and 
the R
2
 about 85%.   
Variable’s importance provided by both models 
suggest, as expected, that the query complexity 
(approximated the necessary Spark tasks for query 
completion and the number of joins) is the main 
driver of query performance. Also, the database size 
was ranked as an important predictor. 
Unexpectedly, predictors such as the cluster 
number of nodes, the gap between the cluster memory 
and the database size, the tuples grouping and group 
filtering, the cluster manager were qualified as less 
important (in the outcome variability) by the both 
models.   
Some further research directions may include: 
  Increasing the number of cluster nodes; 
  Running the queries on TPC-H databases with 
larger sizes; 
  Adding Kubernetes as a cluster manager in 
order to have a whole image of all the available 
resource managers; 
  Making optimization to the JVM, the garbage 
collection, and OS parameter for accelerating 
Spark performance; 
  Assess the performance of other Spark features 
such as   Streaming, Machine Learning and 
GraphX in order to see how they perform on a 
Beowulf cluster; 
  Test with the dataset in other formats not just 
the default generated by TCP-H: AVRO, 
Parquet, blob storage and AWS S3, to see if 
there are any performance gains;  
  Diversify the hardware resources and storage 
types (e.g. add SSDs or RAID configuration); 
  Take into account the hardware bottlenecks 
which might occur during the testing, and 
quantification their effect on performance; 
Run the queries on other Big Data systems (such 
as Hive and Pig) to compare the performance; 
Overall results suggest that running SQL queries 
on Spark using modest Beowulf clusters is a viable 
solution, but this need subsequent comparisons with 
other Big Data solutions, on disk (e.g. Hive, Pig) or 
in-memory (e.g. in-memory features of SQL servers, 
MemSQL, VoltDB, Impala). 
 
REFERENCES 
Armbrust, M., Das, T., Torres, J., Yavuz, B., Zhu, S., Xin, 
R., Ghodsi, A., Stoica, I. & Zaharia, M., 2018. 
Structured Streaming: A Declarative API for Real-
Time Applications in Apache Spark, Proc. of the 
SIGMOD'18, 601-613. 
Assunção, M.D. et al., 2015. Big Data computing and 
clouds: Trends and future directions. Journal of 
Parallel and Distributed Computing, 2015, 79, 3-15. 
Breiman, L., 2001. Random Forests, Machine Learning, 45, 
pp.2-32 
Chaowei, Y., Huang, Q., Li, Z., Liu, K., Hu, F., 2017. Big 
Data and cloud computing: innovation opportunities 
and challenges. International Journal of Digital Earth, 
10, 13-53. 
Chen, Q., Wang, K., Bian, Z., Cremer, I., Xu, G. and Guo, 
Y., 2016. Simulating Spark Cluster for Deployment 
Planning, Evaluation and Optimization, SIMULTECH 
2016, SCITEPRESS, 33-43 
Chen, T. & He, T., 2019. xgboost: Extreme Gradient 
Boosting. R package version 0.90.0.2., 
https://CRAN.R-project.org/package=xgboost 
Chen, T., Guestrin C., 2016. XGboost: a scalable tree 
boosting system. Proc. of the 22nd ACM SIG KDD 
International conference on Knowledge Discovery and 
Data Mining. ACM Press, 785–94. 
Chiba, T., Onodera, T., 2016. Workload characterization 
and optimization of TPC-H queries on Apache Spark., 
Proc. of the ISPASS 2016, 112-121 
Cluci, M.I., Fotache, M., Greavu-Șerban, V., 2019. Data 
Processing Performance of Apache Spark on Beowulf 
Clusters. An Overview. In Proc. of the 34th IBIMA 
Conference 
Cutler A., Cutler D.R., Stevens J.R., 2012. Random Forests. 
In: Zhang C., Ma Y. (eds) Ensemble Machine Learning. 
Springer, Boston, MA 
Fotache, M., Hrubaru, I., 2016. Performance Analysis of 
Two Big Data Technologies on a Cloud Distributed 
Architecture. Results for Non-Aggregate Queries on 
Medium-Sized Data. Scientific Annals of Economics 
and Business, 63(SI), 21-50 
Fotache, M., Tică, A., Hrubaru, I., Spînu, M.T., 2018a. Big 
Data Proprietary Platforms. The Case of Oracle 
Exadata, Review of Economic and Business Studies, 11 
(1), 45-78 
Fotache, M., Greavu-Șerban, V., Hrubaru, I., Tică, A., 
2018b. Big Data Technologies on Commodity 
Workstations. A Basic Setup for Apache Impala. Proc. 
of the 19th International Conference on Computer 
Systems and Technologies (CompSysTech'18), ACM 
Press 
Friedman, J., Hastie, T., Tibshirani, R., 2000. Additive 
logistic regression: a statistical view of boosting. The 
Annals of Statistics, 28(2), 337–407. 
GCP, 2019. Google Cloud Platform blog and 
documentation, [Online], [Retrieved September 22, 
2019], https://cloud.google.com/blog/products/gcp/. 
Gopalani, S., Arora, R.R., 2015. Comparing Apache Spark 
and Map Reduce with Performance Analysis using K-