ECD Test: An Empirical Way based on the Cumulative Distributions to Evaluate the Number of Clusters for Unsupervised Clustering

Dylan Molinié, Kurosh Madani

2022

Abstract

Unsupervised clustering consists in blindly gathering unknown data into compact and homogeneous groups; it is one of the very first steps of any Machine Learning approach, whether it is about Data Mining, Knowledge Extraction, Anomaly Detection or System Modeling. Unfortunately, unsupervised clustering suffers from the major drawback of requiring manual parameters to perform accurately; one of them is the expected number of clusters. This parameter often determines whether the clusters will relevantly represent the system or not. From literature, there is no universal fashion to estimate this value; in this paper, we address this problem through a novel approach. To do so, we rely on a unique, blind clustering, then we characterize the so-built clusters by their Empirical Cumulative Distributions that we compare to one another using the Modified Hausdorff Distance, and we finally regroup the clusters by Region Growing, driven by these characteristics. This allows to rebuild the feature space’s regions: the number of expected clusters is the number of regions found. We apply this methodology to both academic and real industrial data, and show that it provides very good estimates of the number of clusters, no matter the dataset’s complexity nor the clustering method used.

Download


Paper Citation


in Harvard Style

Molinié D. and Madani K. (2022). ECD Test: An Empirical Way based on the Cumulative Distributions to Evaluate the Number of Clusters for Unsupervised Clustering. In Proceedings of the 3rd International Conference on Innovative Intelligent Industrial Production and Logistics - Volume 1: ETCIIM, ISBN 978-989-758-612-5, pages 279-290. DOI: 10.5220/0011562500003329


in Bibtex Style

@conference{etciim22,
author={Dylan Molinié and Kurosh Madani},
title={ECD Test: An Empirical Way based on the Cumulative Distributions to Evaluate the Number of Clusters for Unsupervised Clustering},
booktitle={Proceedings of the 3rd International Conference on Innovative Intelligent Industrial Production and Logistics - Volume 1: ETCIIM,},
year={2022},
pages={279-290},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011562500003329},
isbn={978-989-758-612-5},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 3rd International Conference on Innovative Intelligent Industrial Production and Logistics - Volume 1: ETCIIM,
TI - ECD Test: An Empirical Way based on the Cumulative Distributions to Evaluate the Number of Clusters for Unsupervised Clustering
SN - 978-989-758-612-5
AU - Molinié D.
AU - Madani K.
PY - 2022
SP - 279
EP - 290
DO - 10.5220/0011562500003329