CLUSTER ENSEMBLE SELECTION - Using Average Cluster Consistency

F. Jorge F. Duarte, João M. M. Duarte, M. Fátima C. Rodrigues, Ana L. N. Fred

2009

Abstract

In order to combine multiple data partitions into a more robust data partition, several approaches to produce the cluster ensemble and various consensus functions have been proposed. This range of possibilities in the multiple data partitions combination raises a new problem: which of the existing approaches, to produce the cluster ensembles’ data partitions and to combine these partitions, best fits a given data set. In this paper, we address the cluster ensemble selection problem. We proposed a new measure to select the best consensus data partition, among a variety of consensus partitions, based on a notion of average cluster consistency between each data partition that belongs to the cluster ensemble and a given consensus partition. We compared the proposed measure with other measures for cluster ensemble selection, using 9 different data sets, and the experimental results shown that the consensus partitions selected by our approach usually were of better quality in comparison with the consensus partitions selected by other measures used in our experiments.

References

  1. Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec., 27(2):94-105.
  2. Calinski, R. (1974). A dendrite method for cluster analysis. Communications in statistics, 3:1-27.
  3. Chou, C., Su, M., and Lai, E. (2004). A new cluster validity measure and its application to image compression. Pattern Analysis and Applications, 7:205-220.
  4. Davies, D. and Bouldin, D. (1979). A cluster separation measure. IEEE Transaction on Pattern Analysis and Machine Intelligence, 1(2).
  5. Duarte, F. J., Fred, A. L. N., Rodrigues, M. F. C., and Duarte, J. (2006). Weighted evidence accumulation clustering using subsampling. In Sixth International Workshop on Pattern Recognition in Information Systems.
  6. Dunn, J. (1974). Well separated clusters and optimal fuzzy partitions. J. Cybern, 4:95-104.
  7. Ester, M., Kriegel, H.-P., Jörg, S., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise.
  8. Fern, X. and Brodley, C. (2004). Solving cluster ensemble problems by bipartite graph partitioning. In ICML 7804: Proceedings of the twenty-first international conference on Machine learning, page 36, New York, NY, USA. ACM.
  9. Fred, A. L. N. (2001). Finding consistent clusters in data partitions. In MCS 7801: Proceedings of the Second International Workshop on Multiple Classifier Systems, pages 309-318, London, UK. Springer-Verlag.
  10. Fred, A. L. N. and Jain, A. K. (2005). Combining multiple clusterings using evidence accumulation. IEEE Trans. Pattern Anal. Mach. Intell., 27(6):835-850.
  11. Guha, S., Rastogi, R., and Shim, K. (1998). Cure: an efficient clustering algorithm for large databases. In SIGMOD 7898: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pages 73-84, New York, NY, USA. ACM.
  12. Hadjitodorov, S. T., Kuncheva, L. I., and Todorova, L. P. (2006). Moderate diversity for better cluster ensembles. Inf. Fusion, 7(3):264-275.
  13. Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2001). Clustering algorithms and validity measures. In Tutorial paper in the proceedings of the SSDBM 2001 Conference.
  14. Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification.
  15. Hubert, L. and Schultz, J. (1975). Quadratic assignment as a general data-analysis strategy. British Journal of Mathematical and Statistical Psychology, 29:190- 241.
  16. Jouve, P. and Nicoloyannis, N. (2003). A new method for combining partitions, applications for distributed clustering. In International Workshop on Paralell and Distributed Machine Learning and Data Mining (ECML/PKDD03), pages 35-46.
  17. Karypis, G., Eui, and News, V. K. (1999). Chameleon: Hierarchical clustering using dynamic modeling. Computer, 32(8):68-75.
  18. Kaufman, L. and Roussesseeuw, P. (1990). Finding groups in data: an introduction to cluster analysis. Wiley.
  19. King, B. (1973). Step-wise clustering procedures. Journal of the American Statistical Association, (69):86-101.
  20. Kuncheva, L. and Hadjitodorov, S. (2004). Using diversity in cluster ensembles. volume 2, pages 1214-1219 vol.2.
  21. Macqueen, J. B. (1967). Some methods of classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathemtical Statistics and Probability, pages 281-297.
  22. Maulik, U. and Bandyopadhyay, S. (2002). Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(12):1650-1654.
  23. Ng, R. T. and Han, J. (2002). Clarans: A method for clustering objects for spatial data mining. IEEE Trans. on Knowl. and Data Eng., 14(5):1003-1016.
  24. Sneath, P. and Sokal, R. (1973). Numerical taxonomy. Freeman, London, UK.
  25. Strehl, A. and Ghosh, J. (2003). Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 3:583-617.
  26. Topchy, A., Jain, A. K., and Punch, W. (2003). Combining multiple weak clusterings. pages 331-338.
  27. Topchy, A., Minaei-Bidgoli, B., Jain, A. K., and Punch, W. F. (2004a). Adaptive clustering ensembles. In ICPR 7804: Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 1, pages 272-275, Washington, DC, USA. IEEE Computer Society.
  28. Topchy, A. P., Jain, A. K., and Punch, W. F. (2004b). A mixture model for clustering ensembles. In Berry, M. W., Dayal, U., Kamath, C., and Skillicorn, D. B., editors, SDM. SIAM.
  29. Wang, W., Yang, J., and Muntz, R. R. (1997). Sting: A statistical information grid approach to spatial data mining. In VLDB 7897: Proceedings of the 23rd International Conference on Very Large Data Bases, pages 186-195, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  30. Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301):236-244.
  31. Xie, X. and Beni, G. (1991). A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13:841-847.
Download


Paper Citation


in Harvard Style

Jorge F. Duarte F., M. M. Duarte J., Fátima C. Rodrigues M. and L. N. Fred A. (2009). CLUSTER ENSEMBLE SELECTION - Using Average Cluster Consistency . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009) ISBN 978-989-674-011-5, pages 85-95. DOI: 10.5220/0002308500850095


in Bibtex Style

@conference{kdir09,
author={F. Jorge F. Duarte and João M. M. Duarte and M. Fátima C. Rodrigues and Ana L. N. Fred},
title={CLUSTER ENSEMBLE SELECTION - Using Average Cluster Consistency},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)},
year={2009},
pages={85-95},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002308500850095},
isbn={978-989-674-011-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)
TI - CLUSTER ENSEMBLE SELECTION - Using Average Cluster Consistency
SN - 978-989-674-011-5
AU - Jorge F. Duarte F.
AU - M. M. Duarte J.
AU - Fátima C. Rodrigues M.
AU - L. N. Fred A.
PY - 2009
SP - 85
EP - 95
DO - 10.5220/0002308500850095