PROCESSING WIKIPEDIA DUMPS - A Case-study Comparing the XGrid and MapReduce Approaches

Dominique Thiébaut, Yang Li, Diana Jaunzeikare, Alexandra Cheng, Ellysha Raelen Recto, Gillian Riggs, Xia Ting Zhao, Tonje Stolpestad, Cam Le T. Nguyen

2011

Abstract

We present a simple comparison of the performance of three different cluster platforms: Apple’s XGrid, and Hadoop the open-source version of Google’s MapReduce as the total execution time taken by each to parse a 27-GByte XML dump of the English Wikipedia. A local hadoop cluster of Linux workstation, as well as an Elastic MapReduce cluster rented from Amazon are used. We show that for this specific workload, XGrid yields the fastest execution time, with the local Hadoop cluster a close second. The overhead of fetching data from Amazon’s Simple Storage System (S3), along with the inability to skip the reduce, sort, and merge phases on Amazon penalizes this platform targeted for much larger data sets.

References

  1. Amazon (2002). http://aws.amazon.com/.
  2. Amazon (2002). http://aws.amazon.com/.
  3. Baldeschwieler, E. (2008). Yahoo! launches world's largest hadoop production application. http://developer.yahoo.net/blogs/hadoop/2008/02/ yahoo-worlds-largest-production-hadoop.html.
  4. Baldeschwieler, E. (2008). Yahoo! launches world's largest hadoop production application. http://developer.yahoo.net/blogs/hadoop/2008/02/ yahoo-worlds-largest-production-hadoop.html.
  5. Cloudera (2009). Cloudera. http://cloudera.com/.
  6. Cloudera (2009). Cloudera. http://cloudera.com/.
  7. Dean, J. and Ghemawat, S. (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1):72-77.
  8. Dean, J. and Ghemawat, S. (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1):72-77.
  9. Hughes, B. (2006). Building computational grids with apple's xgrid middleware. In Buyya, R. and Ma, T., editors, Fourth Australasian Symposium on Grid Computing and e-Research (AusGrid 2006), volume 54 of CRPIT, pages 47-54, Hobart, Australia. ACS.
  10. Hughes, B. (2006). Building computational grids with apple's xgrid middleware. In Buyya, R. and Ma, T., editors, Fourth Australasian Symposium on Grid Computing and e-Research (AusGrid 2006), volume 54 of CRPIT, pages 47-54, Hobart, Australia. ACS.
  11. Iosup, A. and Epema, D. (2006). Grenchmark: A framework for analyzing, testing, and comparing grids. In CCGRID 7806: Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid, pages 313-320, Washington, DC, USA. IEEE Computer Society.
  12. Iosup, A. and Epema, D. (2006). Grenchmark: A framework for analyzing, testing, and comparing grids. In CCGRID 7806: Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid, pages 313-320, Washington, DC, USA. IEEE Computer Society.
  13. Kokaly, M., Al-Azzoni, I., and Down, D. G. (2009). Mgst: A framework for performance evaluation of desktop grids. Parallel and Distributed Processing Symposium, International, 0:1-8.
  14. Kokaly, M., Al-Azzoni, I., and Down, D. G. (2009). Mgst: A framework for performance evaluation of desktop grids. Parallel and Distributed Processing Symposium, International, 0:1-8.
  15. MediaWiki (2002). http://www.mediawiki.com/.
  16. MediaWiki (2002). http://www.mediawiki.com/.
  17. Nokia (2005). Nokia Qt cross-platform application and UI framework library. http://qt.nokia.com/.
  18. Nokia (2005). Nokia Qt cross-platform application and UI framework library. http://qt.nokia.com/.
  19. O'Malley, O. and Murthy, A. (2009). Hadoop sorts a petabyte in 16.25 hours and a terabyte in 62 seconds. http://developer.yahoo.net/blogs/hadoop/2009/05/ hadoop sorts a petabyte in 162.html.
  20. O'Malley, O. and Murthy, A. (2009). Hadoop sorts a petabyte in 16.25 hours and a terabyte in 62 seconds. http://developer.yahoo.net/blogs/hadoop/2009/05/ hadoop sorts a petabyte in 162.html.
  21. Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., and Stonebraker, M. (2009). A comparison of approaches to large-scale data analysis. In SIGMOD Conference, pages 165-178.
  22. Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., and Stonebraker, M. (2009). A comparison of approaches to large-scale data analysis. In SIGMOD Conference, pages 165-178.
  23. Raicu, I., Dumitrescu, C., Ripeanu, M., and Foster, I. (2006). The design, performance, and use of diperf: An automated distributed performance testing framework. In the Journal of Grid Computing, Special Issue on Global and Peer-to-Peer Computing.
  24. Raicu, I., Dumitrescu, C., Ripeanu, M., and Foster, I. (2006). The design, performance, and use of diperf: An automated distributed performance testing framework. In the Journal of Grid Computing, Special Issue on Global and Peer-to-Peer Computing.
  25. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., and Kozyrakis, C. (2007). Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA 7807: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13-24, Washington, DC, USA. IEEE Computer Society.
  26. Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., and Kozyrakis, C. (2007). Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA 7807: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13-24, Washington, DC, USA. IEEE Computer Society.
  27. White, T. (2009). Hadoop: The Definitive Guide. O'Reilly, first edition.
  28. White, T. (2009). Hadoop: The Definitive Guide. O'Reilly, first edition.
Download


Paper Citation


in Harvard Style

Thiébaut D., Li Y., Jaunzeikare D., Cheng A., Raelen Recto E., Riggs G., Ting Zhao X., Stolpestad T. and Le T. Nguyen C. (2011). PROCESSING WIKIPEDIA DUMPS - A Case-study Comparing the XGrid and MapReduce Approaches . In Proceedings of the 1st International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-8425-52-2, pages 391-396. DOI: 10.5220/0003385603910396


in Harvard Style

Thiébaut D., Li Y., Jaunzeikare D., Cheng A., Raelen Recto E., Riggs G., Ting Zhao X., Stolpestad T. and Le T. Nguyen C. (2011). PROCESSING WIKIPEDIA DUMPS - A Case-study Comparing the XGrid and MapReduce Approaches . In Proceedings of the 1st International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-8425-52-2, pages 391-396. DOI: 10.5220/0003385603910396


in Bibtex Style

@conference{closer11,
author={Dominique Thiébaut and Yang Li and Diana Jaunzeikare and Alexandra Cheng and Ellysha Raelen Recto and Gillian Riggs and Xia Ting Zhao and Tonje Stolpestad and Cam Le T. Nguyen},
title={PROCESSING WIKIPEDIA DUMPS - A Case-study Comparing the XGrid and MapReduce Approaches},
booktitle={Proceedings of the 1st International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},
year={2011},
pages={391-396},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003385603910396},
isbn={978-989-8425-52-2},
}


in Bibtex Style

@conference{closer11,
author={Dominique Thiébaut and Yang Li and Diana Jaunzeikare and Alexandra Cheng and Ellysha Raelen Recto and Gillian Riggs and Xia Ting Zhao and Tonje Stolpestad and Cam Le T. Nguyen},
title={PROCESSING WIKIPEDIA DUMPS - A Case-study Comparing the XGrid and MapReduce Approaches},
booktitle={Proceedings of the 1st International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},
year={2011},
pages={391-396},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003385603910396},
isbn={978-989-8425-52-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 1st International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - PROCESSING WIKIPEDIA DUMPS - A Case-study Comparing the XGrid and MapReduce Approaches
SN - 978-989-8425-52-2
AU - Thiébaut D.
AU - Li Y.
AU - Jaunzeikare D.
AU - Cheng A.
AU - Raelen Recto E.
AU - Riggs G.
AU - Ting Zhao X.
AU - Stolpestad T.
AU - Le T. Nguyen C.
PY - 2011
SP - 391
EP - 396
DO - 10.5220/0003385603910396


in EndNote Style

TY - CONF
JO - Proceedings of the 1st International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - PROCESSING WIKIPEDIA DUMPS - A Case-study Comparing the XGrid and MapReduce Approaches
SN - 978-989-8425-52-2
AU - Thiébaut D.
AU - Li Y.
AU - Jaunzeikare D.
AU - Cheng A.
AU - Raelen Recto E.
AU - Riggs G.
AU - Ting Zhao X.
AU - Stolpestad T.
AU - Le T. Nguyen C.
PY - 2011
SP - 391
EP - 396
DO - 10.5220/0003385603910396