Simulating Spark Cluster for Deployment Planning, Evaluation and Optimization

Qian Chen, Kebing Wang, Zhaojuan Bian, Illia Cremer, Gen Xu, Yejun Guo

2016

Abstract

As the most active project in the Hadoop ecosystem these days (Zaharia, 2014), Spark is a fast and general purpose engine for large-scale data processing. Thanks to its advanced Directed Acyclic Graph (DAG) execution engine and in-memory computing mechanism, Spark runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk (Apache, 2016). However, Spark performance is impacted by many system software, hardware and dataset factors especially memory and JVM related, which makes capacity planning and tuning for Spark clusters extremely difficult. Current planning methods are mostly estimation based and are highly dependent on experience and trial-and-error. These approaches are far from efficient and accurate, especially with increasing software stack complexity and hardware diversity. Here, we propose a novel Spark simulator based on CSMethod (Bian et al., 2014), extension with a fine-grained multi-layered memory subsystem, well suitable for Spark cluster deployment planning,performance evaluation and optimization before system provisioning. The whole Spark application execution life cycle is simulated by the proposed simulator, including DAG generation, Resilient Distributed Dataset (RDD) processing and block management. Hardware activities derived from these software operations are dynamically mapped onto architecture models for processors, storage, and network devices. Performance behaviour of cluster memory system at multiple layers (Spark, JVM, OS, hardware) are modeled as an enhanced fine-grained individual global library. Experimental results with several popular Spark micro benchmarks and a real case IoT workloads demonstrate that our Spark Simulator achieves high accuracy with an average error rate below 7%. With light weight computing resource requirement (a laptop is enough) our simulator runs at the same speed level than native execution on multi-node high-end cluster.

References

  1. http://spark-summit.org/wp-content/uploads/2014/07/ Sparks-Role-in-the-Big-Data-Ecosystem-MateiZaharia1.pdf.
  2. Zhaojuan Bian, Kebing Wang, Zhihong Wang, Gene Munce, Illia Cremer, Wei Zhou, Qian Chen, Gen Xu, 2014. “Simulating big data clusters for system planning, evaluation and optimization,” ICPP-2014, September 9-12, 2014, Minneapolis, MN, USA.
  3. Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, Scott; Stoica, Ion, 2013. "Shark: SQL and Rich Analytics at Scale". SIGMOD.
  4. Matei Zaharia, 2011. Spark: In-Memory Cluster Computing for Iterative and Interactive Applications. Invited Talk at NIPS 2011 Big Learning Workshop: Algorithms, Systems, and Tools for Learning at Scale.
  5. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica, 2010. “Spark: Cluster Computing with Working Sets” HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing pages 10-10, 2010, CA, USA.
  6. Apache Software Foundation, 27 February 2014. "The Apache Software Foundation Announces Apache Spark as a Top-Level Project". Retrieved 4 March 2014.
  7. Wagner Kolberga, Pedro de B. Marcosa, Julio C.S. Anjosa, Alexandre K.S. Miyazakia, Claudio R. Geyera, Luciana B. Arantesb, 2013. “MRSG - a MapReduce simulator over SimGrid,” Parallel Computing Volume 39 Issue 4-5, Pages 233-244, April, 2013.
  8. Wang, G., Butt, A. R., Pandey, P., and Gupta, K., 2011. “A simulation approach to evaluating design decisions in MapReduce setups,” Proceedings of the 17th Annual Meeting of the IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS 7811), London, UK, 2011.
  9. Palson R Kennedy and T V Gopal, 2013. “A MR simulator in facilitating cloud computing,” International Journal of Computer Applications 72(5):43-49, June 2013. Published by Foundation of Computer Science, New York, USA.
  10. A. Verma, L. Cherkasova, and R.H. Campbell, 2011. “Play It Again, SimMR!”Proc. IEEE Int'l Conf. Cluster Computing (Cluster 7811).
  11. Steven S. Skiena, 2008. The algorithm design manual Springer.
  12. https://databricks.com/blog/2015/04/24/recentperformance-improvements-in-apache-spark-sqlpython-dataframes-and-more.html.
  13. P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G.Ha_llberg, J. Ho_gberg, F. Larsson, A. Moestedt, and B. Werner, 2002. Simics: A full system simulation platform. IEEE Computer, 35(2):50-58, February 2002.
  14. Edgar A. Leon, Rolf Riesen, Patric G. Bridges, Arthur B. Maccabe, 2009. “Instruction-Level Simulation of a Cluster at Scale” HPCC, Nov 14-20, 2009, Portland, OR, USA.
Download


Paper Citation


in Harvard Style

Chen Q., Wang K., Bian Z., Cremer I., Xu G. and Guo Y. (2016). Simulating Spark Cluster for Deployment Planning, Evaluation and Optimization . In Proceedings of the 6th International Conference on Simulation and Modeling Methodologies, Technologies and Applications - Volume 1: SIMULTECH, ISBN 978-989-758-199-1, pages 33-43. DOI: 10.5220/0005952300330043


in Bibtex Style

@conference{simultech16,
author={Qian Chen and Kebing Wang and Zhaojuan Bian and Illia Cremer and Gen Xu and Yejun Guo},
title={Simulating Spark Cluster for Deployment Planning, Evaluation and Optimization},
booktitle={Proceedings of the 6th International Conference on Simulation and Modeling Methodologies, Technologies and Applications - Volume 1: SIMULTECH,},
year={2016},
pages={33-43},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005952300330043},
isbn={978-989-758-199-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Simulation and Modeling Methodologies, Technologies and Applications - Volume 1: SIMULTECH,
TI - Simulating Spark Cluster for Deployment Planning, Evaluation and Optimization
SN - 978-989-758-199-1
AU - Chen Q.
AU - Wang K.
AU - Bian Z.
AU - Cremer I.
AU - Xu G.
AU - Guo Y.
PY - 2016
SP - 33
EP - 43
DO - 10.5220/0005952300330043