Decision Trees and Data Preprocessing to Help Clustering Interpretation

Olivier Parisot, Mohammad Ghoniem, Benoît Otjacques

2014

Abstract

Clustering is a popular technique for data mining, knowledge discovery and visual analytics. Unfortunately, cluster assignments can be difficult to interpret by a human analyst. This difficulty has often been overcome by using decision trees to explain cluster assignments. The success of this approach is however subject to the legibility of the obtained decision trees. In this work, we propose an evolutionary algorithm to cleverly preprocess the data before clustering in order to obtain clusters that are simpler to interpret with decision trees. A prototype has been implemented and tested to show the benefits of the approach.

References

  1. Aggarwal, C. and Reddy, C. (2013). Data Clustering: Algorithms and Applications. Chapman & Hall/CRC D.M. and K.D. Series. Taylor & Francis.
  2. Bache, K. and Lichman, M. (2013). UCI machine learning repository.
  3. Barlow, T. and Neville, P. (2001). Case study: Visualization for decision tree analysis in data mining. In Proceedings of INFOVIS'01, INFOVIS 7801, pages 149-, Washington, DC, USA. IEEE Computer Society.
  4. BDack, T., Hoffmeister, F., and Schwefel, H. (1991). A survey of evolution strategies.
  5. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification And Regression Trees. Chapman and Hall, New York.
  6. Breslow, L. A. and Aha, D. W. (1997). Simplifying decision trees: A survey. Knowl. Eng. Rev., 12(1):1-40.
  7. Derrac, J., Cornelis, C., Garca, S., and Herrera, F. (2012). Enhancing evolutionary instance selection algorithms by means of fuzzy rough set based feature selection. Information Sciences, 186(1):73 - 92.
  8. Dolnicar, S. and Grün, B. (2008). Challenging factorcluster segmentation. Journal of Travel Research, 47(1):63-71.
  9. Engels, R. and Theusinger, C. (1998). Using a data metric for preprocessing advice for data mining applications. In ECAI, pages 430-434.
  10. Famili, A., Shen, W.-M., Weber, R., and Simoudis, E. (1997). Data preprocessing and intelligent data analysis. Intelligent Data Analysis, 1(1-4):3 - 23.
  11. Gan, G., Ma, C., and Wu, J. (2007). Data clustering - Theory, Algorithms and Applications. ASA-SIAM series on statistics and applied probability. SIAM.
  12. Herman, I., Delest, M., and Melancon, G. (1998). Tree visualisation and navigation clues for information visualisation. Technical report, Amsterdam, Netherlands.
  13. Jaccard, P. (1908). Nouvelles recherches sur la distribution florale. Bulletin de la Société vaudoise des sciences naturelles. Impr. Réunies.
  14. Jain, A. K. (2010). Data clustering: 50 years beyond kmeans. Pattern Recognition Letters, 31(8):651-666.
  15. Kandel, S., Heer, J., Plaisant, C., Kennedy, J., van Ham, F., Riche, N. H., Weaver, C., Lee, B., Brodbeck, D., and Buono, P. (2011). Research directions in data wrangling: Visualizations and transformations for usable and credible data. Information Visualization, 10(4):271-288.
  16. Keim, D., Andrienko, G., Fekete, J.-D., Görg, C., Kohlhammer, J., and Melanc¸on, G. (2008). Information visualization. chapter Visual Analytics: Definition, Process, and Challenges, pages 154-175. SpringerVerlag, Berlin, Heidelberg.
  17. Kotsiantis, S. (2013). Decision trees: a recent overview. Artificial Intelligence Review, 39(4):261-283.
  18. Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi-disciplinary survey. Data Min. Knowl. Discov., 2(4):345-389.
  19. O'Madadhain, J., Fisher, D., White, S., and Boey, Y. (2003). The JUNG (Java Universal Network/Graph) framework. Technical report, UCI-ICS.
  20. Parisot, O., Bruneau, P., Didry, Y., and Tamisier, T. (2013). User-driven data preprocessing for decision support. In Luo, Y., editor, CDVE, volume 8091 of LNCS, pages 81-84. Springer Berlin Heidelberg.
  21. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann.
  22. Qyu, M., Davis, S., and Ikem, F. (2004). Evaluation of clustering techniques in data mining tools.
  23. Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53-65.
  24. Rudolph, G. (1996). Convergence of evolutionary algorithms in general search spaces. In Proceedings of IEEE Int. Conf. on Evolutionary Computation, pages 50-54.
  25. Stiglic, G., Kocbek, S., Pernek, I., and Kokol, P. (2012). Comprehensive decision tree models in bioinformatics. PLoS ONE, 7(3):e33812.
  26. Torgo, L. (1998). Regression datasets. www.dcc.fc.up.pt/ ltorgo/Regression/DataSets.html.
  27. van den Elzen, S. and van Wijk, J. (2011). Baobabview: Interactive construction and analysis of decision trees. In Visual Analytics Science and Technology (VAST), 2011 IEEE Conference on, pages 151-160.
  28. Wagner, S. and Wagner, D. (2007). Comparing clusterings: an overview. Universität Karlsruhe.
  29. Wang, S. and Wang, H. (2007). Mining data quality in completeness. In ICIQ, pages 295-300.
  30. Witten, I. H., Frank, E., and Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools and Techniques. Elsevier.
Download


Paper Citation


in Harvard Style

Parisot O., Ghoniem M. and Otjacques B. (2014). Decision Trees and Data Preprocessing to Help Clustering Interpretation . In Proceedings of 3rd International Conference on Data Management Technologies and Applications - Volume 1: DATA, ISBN 978-989-758-035-2, pages 48-55. DOI: 10.5220/0005001300480055


in Bibtex Style

@conference{data14,
author={Olivier Parisot and Mohammad Ghoniem and Benoît Otjacques},
title={Decision Trees and Data Preprocessing to Help Clustering Interpretation},
booktitle={Proceedings of 3rd International Conference on Data Management Technologies and Applications - Volume 1: DATA,},
year={2014},
pages={48-55},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005001300480055},
isbn={978-989-758-035-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of 3rd International Conference on Data Management Technologies and Applications - Volume 1: DATA,
TI - Decision Trees and Data Preprocessing to Help Clustering Interpretation
SN - 978-989-758-035-2
AU - Parisot O.
AU - Ghoniem M.
AU - Otjacques B.
PY - 2014
SP - 48
EP - 55
DO - 10.5220/0005001300480055