Index Clustering: A Map-reduce Clustering Approach using Numba

Xinyu Chen, Trilce Estrada

Abstract

Clustering high-dimensional data is often a crucial step of many applications. However, the so called "Curse of dimensionality" is a challenge for most clustering algorithms. In such high-dimensional spaces, distances between points tend to be less meaningful and the spaces become sparse. Such sparsity needs more data points to characterize the similarities so more distance comparisons are computed. Many approaches have been proposed for reduction of dimensionality, such as sub-space clustering, random projection clustering, and feature selection technique. However, approaches like these become unfeasible in scenarios where data is geographically distributed or cannot be openly used across sites. To deal with the location and privacy issues as well as mitigate the expensive distance computation, we propose an index-based clustering algorithm that generates a spatial \emph{key} for each data point across all dimensions without needing an explicit knowledge of the other data points. Then it performs a conceptual Map-Reduce procedure in the index space to form a final clustering assignment. Our results show that this algorithm is linear and can be parallelized and executed independently across points and dimensions. We present a Numba implementation and preliminary study of this algorithm's capabilities and limitations.

Download


Paper Citation


in Harvard Style

Chen X. and Estrada T. (2017). Index Clustering: A Map-reduce Clustering Approach using Numba . In Proceedings of the 6th International Conference on Data Science, Technology and Applications - Volume 1: DATA, ISBN 978-989-758-255-4, pages 233-240. DOI: 10.5220/0006437402330240


in Bibtex Style

@conference{data17,
author={Xinyu Chen and Trilce Estrada},
title={Index Clustering: A Map-reduce Clustering Approach using Numba},
booktitle={Proceedings of the 6th International Conference on Data Science, Technology and Applications - Volume 1: DATA,},
year={2017},
pages={233-240},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006437402330240},
isbn={978-989-758-255-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Data Science, Technology and Applications - Volume 1: DATA,
TI - Index Clustering: A Map-reduce Clustering Approach using Numba
SN - 978-989-758-255-4
AU - Chen X.
AU - Estrada T.
PY - 2017
SP - 233
EP - 240
DO - 10.5220/0006437402330240