EINCKM: An Enhanced Prototype-based Method for Clustering Evolving Data Streams in Big Data

Ammar Al Abd Alazeez, Sabah Jassim, Hongbo Du

2017

Abstract

Data stream clustering is becoming an active research area in big data. It refers to group constantly arriving new data records in large chunks to enable dynamic analysis/updating of information patterns conveyed by the existing clusters, the outliers, and the newly arriving data chunk. Prototype-based algorithms for solving the problem have their promises for simplicity and efficiency. However, existing implementations have limitations in relation to quality of clusters, ability to discover outliers, and little consideration of possible new patterns in different chunks. In this paper, a new incremental algorithm called Enhanced Incremental K-Means (EINCKM) is developed. The algorithm is designed to detect new clusters in an incoming data chunk, merge new clusters and existing outliers to the currently existing clusters, and generate modified clusters and outliers ready for the next round. The algorithm applies a heuristic-based method to estimate the number of clusters (K), a radius-based technique to determine and merge overlapped clusters and a variance-based mechanism to discover the outliers. The algorithm was evaluated on synthetic and real-life datasets. The experimental results indicate improved clustering correctness with a comparable time complexity to existing methods dealing with the same kind of problems.

References

  1. Aggarwal, C. , Han, J., Wang, J. and Yu, P., 2003. A Framework for Clustering Evolving Data Streams. Proceedings of the 29th VLDB Conference, Germany.
  2. Bhatia, S.K. and Louis, S., 2004. Adaptive K-Means Clustering. American Association for Artificial Intelligence.
  3. Cao, F., Ester, M., Qian, W. and Zhou, A., 2006. Densitybased clustering over an evolving data stream with noise. Proceedings of the Sixth SIAM International Conference on Data Mining, 2006, pp.328-339.
  4. Chakraborty, S. and Nagwani, N.K., 2011. Analysis and Study of Incremental K-Means. Springer-Verlag Berlin Heidelberg, pp.338-341.
  5. Dempster, A.P., Laird, N.M. and Rubin, D.B., 1977. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), pp.1-38.
  6. Ester, M., Kriegel, H.-P., Sander, J. and Xu, X., 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. the 2nd International Conference on Know- ledge Discovery and Data Mining, 2, pp.226-231.
  7. Guha, S., Mishra, N., Motwani, R. and O'Callaghan, L., 2000. Clustering Data Streams. IEEE FOCS Conference, pp.359-366.
  8. Islam, M.Z., 2013. A Cloud Based Platform for Big Data Science. Department of Computer and Information Science, Linköping University, pp.1-57.
  9. Karypis, Y.Z. and George, 2001. Technical Report Criterion Functions for Document Clustering: Experiments and Analysis. , pp.1-30.
  10. Kodinariya, T.M. and Makwana, P.R., 2013. Review on determining number of Cluster in K-Means Clustering. International Journal of Advance Research in Computer Science and Management Studies, 1(6), pp.2321-7782.
  11. Kremer, H., Kranen, P., Jansen, T., Seidl, T., Bifet, A., Holmes, G. and Pfahringer, B., 2011. An effective evaluation measure for clustering on evolving data streams. Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 7811, pp.868-876. Available at: http://eprints.pascal-network.org/archive/00008693/.
  12. Liu, C., Ranjan, R., Zhang, X., Yang, C., Georgakopoulos, D. and Chen, J., 2013. Public Auditing for Big Data Storage in Cloud Computing - A Survey. 2013 IEEE 16th International Conference on Computational Science and Engineering, pp.1128-1135. Available at: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm? arnumber=6755345.
  13. MacQueen, J., 1967. Some Methods for classification and Analysis of Multivariate Observations. 5th Berkeley Symposium on Mathematical Statistics and Probability 1967, 1(233), pp.281-297. Available at: http://projecteuclid.org/euclid.bsmsp/1200512992.
  14. Nguyen, H.L., Woon, Y.K. and Ng, W.K., 2015. A survey on data stream clustering and classification. Knowledge and Information Systems, pp.535-569. Available at: http://dx.doi.org/10.1007/s10115-014- 0808-1.
  15. Ntoutsi, I., Spiliopoulou, M. and Theodoridis, Y., 2009. Tracing cluster transitions for different cluster types. Control and Cybernetics, 38(1), pp.239-259.
  16. Oliveira, M. and Gama, J., 2012. A Framework to Monitor Clusters' Evolution Applied to Economy and Finance Problems. Intell. Data Anal. 16, 1, 93-111.
  17. Olshannikova, E., Ometov, A. and Koucheryavy, Y., 2014. Towards Big Data Visualization for Augmented Reality. 2014 IEEE 16th Conference on Business Informatics, pp.33-37. Available at: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm? arnumber=6904299.
  18. Pio, G., Lanotte, P. F., Ceci, M. and Malerba, D., 2014. Mining Temporal Evolution of Entities in a Stream of Textual Documents. Springer International Publishing Switzerland 2014, pp.50-60.
  19. Silva, J., Faria, E., Barros, R., Hruschka, E. and Carvalho, A., 2013. Data Stream Clustering?: A Survey. ACM Computing Surveys (CSUR), pp.1-37.
  20. Yogita and Toshniwal, D., 2012. Clustering Techniques for Streaming Data - A Survey. , pp.951-956.
  21. Zhang, T., Ramakrishnan, R. and Livny, M., 1996. BIRCH: An Efficient Data Clustering Databases Method for Very Large Databases. ACM SIGMOD International Conference on Management of Data, 1, pp.103-114.
Download


Paper Citation


in Harvard Style

Al Abd Alazeez A., Jassim S. and Du H. (2017). EINCKM: An Enhanced Prototype-based Method for Clustering Evolving Data Streams in Big Data . In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-222-6, pages 173-183. DOI: 10.5220/0006196901730183


in Bibtex Style

@conference{icpram17,
author={Ammar Al Abd Alazeez and Sabah Jassim and Hongbo Du},
title={EINCKM: An Enhanced Prototype-based Method for Clustering Evolving Data Streams in Big Data},
booktitle={Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2017},
pages={173-183},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006196901730183},
isbn={978-989-758-222-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - EINCKM: An Enhanced Prototype-based Method for Clustering Evolving Data Streams in Big Data
SN - 978-989-758-222-6
AU - Al Abd Alazeez A.
AU - Jassim S.
AU - Du H.
PY - 2017
SP - 173
EP - 183
DO - 10.5220/0006196901730183