Comparison of Distributed Computing Approaches to Complexity of n-gram Extraction

Sanzhar Aubakirov, Paulo Trigo, Darhan Ahmed-Zaki

Abstract

In this paper we compare different technologies that support distributed computing as a means to address complex tasks. We address the task of n-gram text extraction which is a big computational given a large amount of textual data to process. In order to deal with such complexity we have to adopt and implement parallelization patterns. Nowadays there are several patterns, platforms and even languages that can be used for the parallelization task. We implemented this task on three platforms: (1) MPJ Express, (2) Apache Hadoop, and (3) Apache Spark. The experiments were implemented using two kinds of datasets composed by: (A) a large number of small files, and (B) a small number of large files. Each experiment uses both datasets and the experiment repeats for a set of different file sizes. We compared performance and efficiency among MPJ Express, Apache Hadoop and Apache Spark. As a final result we are able to provide guidelines for choosing the platform that is best suited for each kind of data set regarding its overall size and granularity of the input data.

References

  1. Andres, B. P. and A, B. (2013). Perusal on hadoop small file problem. In Perusal on Hadoop small file problem . IJCSEITR.
  2. Andrews, B. P. and Binu, A. (2013). Perusal on hadoop small file problem. In IJCSEITR. TJPRC.
  3. Berberich, K. and Bedathur, S. (2013). Computing n-gram statistics in mapreduce. In EDBT 7813 Proceedings of the 16th International Conference on Extending Database Technology. EDBT.
  4. Brants, T., Popat, A. C., Xu, P., Och, F. J., and Dean, J. (2007). Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. EMNLP-CoNLL.
  5. Lin, J. and Dyer, C. (2010). An ontology-based approach to text summarization. In Data-Intensive Text Processing with MapReduce. Morgan and Claypool.
  6. Riedl, M. and Biemann, C. (2012). Text segmentation with topic models. In JLCL. JLCL.
  7. Vorapongkitipun, C. and Nupairoj, N. (2014). Improving performance of small-file accessing in hadoop. In JCSSE. JCSSE.
  8. Zaharia, M., Das, T., Li, H., Shenker, S., and Stoica, I. (2012). Discretized streams: An efficient and faulttolerant model for stream processing on large clusters. In HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Computing. HotCloud.
  9. Zaharia, M., Das, T., Li, H., Shenker, S., and Stoica, I. (2013). Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the TwentyFourth ACM Symposium on Operating Systems Principles. SOSP.
Download


Paper Citation


in Harvard Style

Aubakirov S., Trigo P. and Ahmed-Zaki D. (2016). Comparison of Distributed Computing Approaches to Complexity of n-gram Extraction . In Proceedings of the 5th International Conference on Data Management Technologies and Applications - Volume 1: DATA, ISBN 978-989-758-193-9, pages 25-30. DOI: 10.5220/0005943000250030


in Bibtex Style

@conference{data16,
author={Sanzhar Aubakirov and Paulo Trigo and Darhan Ahmed-Zaki},
title={Comparison of Distributed Computing Approaches to Complexity of n-gram Extraction},
booktitle={Proceedings of the 5th International Conference on Data Management Technologies and Applications - Volume 1: DATA,},
year={2016},
pages={25-30},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005943000250030},
isbn={978-989-758-193-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 5th International Conference on Data Management Technologies and Applications - Volume 1: DATA,
TI - Comparison of Distributed Computing Approaches to Complexity of n-gram Extraction
SN - 978-989-758-193-9
AU - Aubakirov S.
AU - Trigo P.
AU - Ahmed-Zaki D.
PY - 2016
SP - 25
EP - 30
DO - 10.5220/0005943000250030