Efficient Evidence Accumulation Clustering for Large Datasets

Diogo Silva, Helena Aidos, Ana Fred

2016

Abstract

The unprecedented collection and storage of data in electronic format has given rise to an interest in automated analysis for generation of knowledge and new insights. Cluster analysis is a good candidate since it makes as few assumptions about the data as possible. A vast body of work on clustering methods exist, yet, typically, no single method is able to respond to the specificities of all kinds of data. Evidence Accumulation Clustering (EAC) is a robust state of the art ensemble algorithm that has shown good results. However, this robustness comes with higher computational cost. Currently, its application is slow or restricted to small datasets. The objective of the present work is to scale EAC, allowing its applicability to big datasets, with technology available at a typical workstation. Three approaches for different parts of EAC are presented: a parallel GPU K-Means implementation, a novel strategy to build a sparse CSR matrix specialized to EAC and Single-Link based on Minimum Spanning Trees using an external memory sorting algorithm. Combining these approaches, the application of EAC to much larger datasets than before was accomplished.

References

  1. Alted, F., Vilata, I., and Others (2002). PyTables: Hierarchical Datasets in Python.
  2. Fred, A. (2001). Finding consistent clusters in data partitions. Multiple classifier systems, pages 309-318.
  3. Fred, A. N. L. and Jain, A. K. (2005). Combining multiple clusterings using evidence accumulation. IEEE Trans. Pattern Anal. Mach. Intell., 27(6):835-850.
  4. Gower, J. C. and Ross, G. J. S. (1969). Minimum Spanning Trees and Single Linkage Cluster Analysis. Journal of the Royal Statistical Society, 18(1):54-64.
  5. Jain, A. K. (2010). Data clustering: 50 years beyond kmeans. Pattern Recognition Letters, 31(8):651-666.
  6. Jones, E., Oliphant, T., Peterson, P., and Others (2001). SciPy: Open source scientific tools for Python.
  7. Lichman, M. (2013). {UCI} Machine Learning Repository.
  8. Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J. (2008). Nvidia tesla: A unified graphics and computing architecture. IEEE micro, (2):39-55.
  9. Lourenc¸o, A., Fred, A. L. N., and Jain, A. K. (2010). On the scalability of evidence accumulation clustering. Proc. - Int. Conf. on Pattern Recognition, 0:782-785.
  10. Meila, M. (2003). Comparing clusterings by the variation of information. Learning theory and Kernel machines, Springer:173-187.
  11. Sibson, R. (1973). SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal, 16(1):30-34.
  12. Sirotkovi, J., Dujmi, H., and Papi, V. (2012). K-means image segmentation on massively parallel GPU architecture. In MIPRO 2012, pages 489-494.
  13. Strehl, A. and Ghosh, J. (2002). Cluster Ensembles - A Knowledge Reuse Framework for Combining Multiple Partitions. J. Mach. Learn. Res., 3:583-617.
  14. Zechner, M. and Granitzer, M. (2009). Accelerating kmeans on the graphics processor via CUDA. In INTENSIVE 2009, pages 7-15.
Download


Paper Citation


in Harvard Style

Silva D., Aidos H. and Fred A. (2016). Efficient Evidence Accumulation Clustering for Large Datasets . In Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-173-1, pages 367-374. DOI: 10.5220/0005770803670374


in Bibtex Style

@conference{icpram16,
author={Diogo Silva and Helena Aidos and Ana Fred},
title={Efficient Evidence Accumulation Clustering for Large Datasets},
booktitle={Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2016},
pages={367-374},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005770803670374},
isbn={978-989-758-173-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Efficient Evidence Accumulation Clustering for Large Datasets
SN - 978-989-758-173-1
AU - Silva D.
AU - Aidos H.
AU - Fred A.
PY - 2016
SP - 367
EP - 374
DO - 10.5220/0005770803670374