# A COMPARATIVE EVALUATION OF PROXIMITY MEASURES FOR SPECTRAL CLUSTERING

### Nadia Farhanaz Azam, Herna L. Viktor

#### Abstract

A cluster analysis algorithm is considered successful when the data is clustered into meaningful groups so that the objects in the same group are similar, and the objects residing in two different groups are different from one another. One such cluster analysis algorithm, the spectral clustering algorithm, has been deployed across numerous domains ranging from image processing to clustering protein sequences with a wide range of data types. The input, in this case, is a similarity matrix, constructed from the pair-wise similarity between the data objects. The pair-wise similarity between the objects is calculated by employing a proximity (similarity, dissimilarity or distance) measure. It follows that the success of a spectral clustering algorithm therefore heavily depends on the selection of the proximity measure. While, the majority of prior research on the spectral clustering algorithm emphasizes the algorithm-specific issues, little research has been performed on the evaluation of the performance of the proximity measures. To this end, we perform a comparative and exploratory analysis on several existing proximity measures to evaluate their suitability for the spectral clustering algorithm. Our results indicate that the commonly used Euclidean distance measure may not always be a good choice especially in domains where the data is highly imbalanced and the correct clustering of the boundary objects are crucial. Furthermore, for numeric data, measures based on the relative distances often yield better results than measures based on the absolute distances, specifically when aiming to cluster boundary objects. When considering mixed data, the measure for numeric data has the highest impact on the final outcome and, again, the use of the Euclidian measure may be inappropriate.

#### References

- Abou-Moustafa, K. T. and Ferrie, F. P. (2007). The minimum volume ellipsoid metric. In Proceedings of the 29th DAGM conference on Pattern recognition, pages 335-344, Berlin, Heidelberg. Springer-Verlag.
- Aiello, M., Andreozzi, F., Catanzariti, E., Isgro, F., and Santoro, M. (2007). Fast convergence for spectral clustering. In ICIAP 7807: Proceedings of the 14th International Conference on Image Analysis and Processing, pages 641 - 646, Washington, DC, USA. IEEE Computer Society.
- Asuncion, A. and Newman, D. (2007). UCI machine learning repository.
- Bach, F. R. and Jordan, M. I. (2003). Learning spectral clustering. In Advances in Neural Information Processing Systems 16: Proceedings of the 2003 conference, pages 305-312. Citeseer.
- Bach, F. R. and Jordan, M. I. (2006). Learning spectral clustering, with application to speech separation. J. Mach. Learn. Res., 7:1963-2001.
- Boslaugh, S. and Watters, P. A. (2008). Statistics in a nutshell. O'Reilly & Associates, Inc., Sebastopol, CA, USA.
- Costa, I. G., de Carvalho, F. A. T., and de Souto, M. C. P. (2002). Comparative study on proximity indices for cluster analysis of gene expression time series. Journal of Intelligent and Fuzzy Systems: Applications in Engineering and Technology, 13(2-4):133 - 142.
- Everitt, B. S. (1980). Cluster Analysis. Edward Arnold and Halsted Press, 2nd edition.
- Filzmoser, P., Garrett, R., and Reimann, C. (2005). Multivariate outlier detection in exploration geochemistry. Computers and Geosciences, 31(5):579-587.
- Fischer, I. and Poland, J. (2004). New methods for spectral clustering. Technical Report IDSIA-12-04, IDSIA.
- Han, J. and Kamber, M. (2006). Data Mining: Concepts and Techniques, 2nd Ed. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
- Heinz, G., Peterson, L. J., Johnson, R. W., and Kerk, C. J. (2003). Exploring relationships in body dimensions. Journal of Statistics Education, 11(2).
- Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: a review. ACM Computing Surveys, 31(3):264-323.
- Japkowicz, N. and Shah, M. (2011). Performance Evaluation for Classification A Machine Learning and Data Mining Perspective (in progress): Chapter 6: Statistical Significance Testing.
- Kaufman, L. and Rousseeuw, P. (2005). Finding Groups in Data: An Introduction to Cluster Analysis. WileyInterscience.
- Kubat, M., Holte, R. C., and Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2 - 3):195 - 215.
- Larose, D. T. (2004). Discovering Knowledge in Data: An Introduction to Data Mining. Wiley-Interscience.
- Lee, S. and Verri, A. (2002). Pattern Recognition With Support Vector Machines: First International Workshop, Svm 2002, Niagara Falls, Canada, August 10, 2002: Proceedings. Springer.
- Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4):395-416.
- Meila, M. and Shi, J. (2001). A random walks view of spectral segmentation. In International Conference on Artificial Intelligence and Statistics (AISTAT), pages 8- 11.
- Ng, A. Y., Jordan, M. I., and Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. In T. G. Dietterich, S. B. and Ghahramani, Z., editors, Advances in Neural Information Processing Systems, volume 14, pages 849-856.
- Paccanaro, A., Casbon, J. A., and Saqi, M. A. (2006). Spectral clustering of protein sequences. Nucleic Acids Res, 34(5):1571-1580.
- Shi, J. and Malik, J. (2000). Normalized Cuts and Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888-905.
- Steinbach, M., Karypis, G., and Kumar, V. (2000). A comparison of document clustering techniques. KDD Workshop on Text Mining.
- Teknomo, K. (2007). Similarity Measurement. Website.
- Verma, D. and Meila, M. (2001). A comparison of spectral clustering algorithms.
- Webb, A. R. (2002). Statistical Pattern Recognition, 2nd Edition. John Wiley & Sons.
- Witten, I. H. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2 edition.

#### Paper Citation

#### in Harvard Style

Farhanaz Azam N. and Viktor H. (2011). **A COMPARATIVE EVALUATION OF PROXIMITY MEASURES FOR SPECTRAL CLUSTERING** . In *Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)* ISBN 978-989-8425-79-9, pages 30-41. DOI: 10.5220/0003649000300041

#### in Bibtex Style

@conference{kdir11,

author={Nadia Farhanaz Azam and Herna L. Viktor},

title={A COMPARATIVE EVALUATION OF PROXIMITY MEASURES FOR SPECTRAL CLUSTERING},

booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)},

year={2011},

pages={30-41},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0003649000300041},

isbn={978-989-8425-79-9},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)

TI - A COMPARATIVE EVALUATION OF PROXIMITY MEASURES FOR SPECTRAL CLUSTERING

SN - 978-989-8425-79-9

AU - Farhanaz Azam N.

AU - Viktor H.

PY - 2011

SP - 30

EP - 41

DO - 10.5220/0003649000300041