# Index Clustering: A Map-reduce Clustering Approach using Numba

### Xinyu Chen, Trilce Estrada

#### Abstract

Clustering high-dimensional data is often a crucial step of many applications. However, the so called "Curse of dimensionality" is a challenge for most clustering algorithms. In such high-dimensional spaces, distances between points tend to be less meaningful and the spaces become sparse. Such sparsity needs more data points to characterize the similarities so more distance comparisons are computed. Many approaches have been proposed for reduction of dimensionality, such as sub-space clustering, random projection clustering, and feature selection technique. However, approaches like these become unfeasible in scenarios where data is geographically distributed or cannot be openly used across sites. To deal with the location and privacy issues as well as mitigate the expensive distance computation, we propose an index-based clustering algorithm that generates a spatial \emph{key} for each data point across all dimensions without needing an explicit knowledge of the other data points. Then it performs a conceptual Map-Reduce procedure in the index space to form a final clustering assignment. Our results show that this algorithm is linear and can be parallelized and executed independently across points and dimensions. We present a Numba implementation and preliminary study of this algorithm's capabilities and limitations.

Download#### Paper Citation

#### in Harvard Style

Chen X. and Estrada T. (2017). **Index Clustering: A Map-reduce Clustering Approach using Numba** . In *Proceedings of the 6th International Conference on Data Science, Technology and Applications - Volume 1: DATA,* ISBN 978-989-758-255-4, pages 233-240. DOI: 10.5220/0006437402330240

#### in Bibtex Style

@conference{data17,

author={Xinyu Chen and Trilce Estrada},

title={Index Clustering: A Map-reduce Clustering Approach using Numba},

booktitle={Proceedings of the 6th International Conference on Data Science, Technology and Applications - Volume 1: DATA,},

year={2017},

pages={233-240},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0006437402330240},

isbn={978-989-758-255-4},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the 6th International Conference on Data Science, Technology and Applications - Volume 1: DATA,

TI - Index Clustering: A Map-reduce Clustering Approach using Numba

SN - 978-989-758-255-4

AU - Chen X.

AU - Estrada T.

PY - 2017

SP - 233

EP - 240

DO - 10.5220/0006437402330240