MINING ON THE CLOUD - K-means with MapReduce

Ilias K. Savvas, M-Tahar Kechadi

2012

Abstract

The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. The huge collections of raw data require fast and accurate mining process in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this paper, we developed a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique proved its efficiency.

References

  1. AH (2011). Apache Hadoop. http://hadoop.apache.org.
  2. Dean, J. and Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107-113.
  3. Ekanayake, J., Gunarathne, T., Fox, G., Balkir, A. S., Poulain, C., Araujo, N., and Barga, R. (2009). Dryadlinq for scientific analyses. Fifth IEEE International Conference on e-Science (E-SCIENCE 7809) , pages 329-336.
  4. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S., Qiu, J., and Fox, G. (2010). Twister: a runtime for iterative mapreduce. 19th ACM International Symposium on High Performance Distributed Computing, pages 810-818.
  5. Gorton, I., Greenfield, P., Szalay, A., and Williams, R. (2008). Data-intensive computing in the 21st century. IEEE Computer, pages 78-80.
  6. Guha, S., Rastogi, R., and Shim, K. (1998). Cure: an efficient clustering algorithm for large databases. ACM SIGMOD International Conference on Management of Data, pages 73-84.
  7. (2011). Hadoop MapReduce.
  8. Jiang, W., Ravi, V., and Agrawal, G. (2009). Comparing map-reduce and freeride for data-intensive applications. IEEE International Conference on Cluster Computing and Workshops, pages 1-10.
  9. Jin, R., Goswami, A., and Agrawal, G. (2006). Fast and exact out-of-core and distributed k-means clustering. Knowledge and Information Systems, 10(1):17-40.
  10. LLoyd, S. P. (1982). Least squares quantization in pcm. IEEE Transactions on Information Theory, 28:129- 137.
  11. LSST (2011). Large Synoptic Survey Telescope. http:// www.lsst.org/lsst.
  12. McCallum, A., k. Nigam, and Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 169-178.
  13. S.Ibrahim, Jin, H., Lu, L., Qi, L., Wu, S., and Shi, X. (2009). Evaluating mapreduce on virtual machines: The hadoop case. 1st International Conference on Cloud Computing, Springer: Cloud Computing 5931:519-528.
  14. SKA (2011). Square www.skatelescope.org/.
  15. Zhao, W., Ma, H., and He, Q. (2009). Parallel k-means clustering based on mapreduce. 1st International Conference on Cloud Computing, Springer: Cloud Computing 5931:674-679.
Download


Paper Citation


in Harvard Style

K. Savvas I. and Kechadi M. (2012). MINING ON THE CLOUD - K-means with MapReduce . In Proceedings of the 2nd International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-8565-05-1, pages 413-418. DOI: 10.5220/0003927204130418


in Bibtex Style

@conference{closer12,
author={Ilias K. Savvas and M-Tahar Kechadi},
title={MINING ON THE CLOUD - K-means with MapReduce},
booktitle={Proceedings of the 2nd International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},
year={2012},
pages={413-418},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003927204130418},
isbn={978-989-8565-05-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 2nd International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
TI - MINING ON THE CLOUD - K-means with MapReduce
SN - 978-989-8565-05-1
AU - K. Savvas I.
AU - Kechadi M.
PY - 2012
SP - 413
EP - 418
DO - 10.5220/0003927204130418