# AN HYBRID APPROACH TO FEATURE SELECTION FOR MIXED CATEGORICAL AND CONTINUOUS DATA

### Gauthier Doquire, Michel Verleysen

#### Abstract

This paper proposes an algorithm for feature selection in the case of mixed data. It consists in ranking independently the categorical and the continuous features before recombining them according to the accuracy of a classifier. The popular mutual information criterion is used in both ranking procedures. The proposed algorithm thus avoids the use of any similarity measure between samples described by continuous and categorical attributes, which can be unadapted to many real-world problems. It is able to effectively detect the most useful features of each type and its effectiveness is experimentally demonstrated on four real-world data sets.

#### References

- Asuncion, A. and Newman, D. (2007). UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences, available at http://www.ics.uci.edu/ mlearn/MLRepository.html.
- Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5:537-550.
- Bellman, R. (1961). Adaptive Control Processes: A Guided Tour. Princeton University Press.
- Boriah, S., Chandola, V., and Kumar, V. (2008). Similarity measures for categorical data: A comparative evaluation. In SDM'08, pages 243-254.
- Fleuret, F. (2004). Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res., 5:1531-1555.
- Gómez-Verdejo, V., Verleysen, M., and Fleury, J. (2009). Information-theoretic feature selection for functional data classification. Neurocomputing, 72:3580-3589.
- Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157-1182.
- Hall, M. A. (2000). Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings of ICML 2000, pages 359-366. Morgan Kaufmann Publishers Inc.
- Hu, Q., Liu, J., and Yu, D. (2008). Mixed feature selection based on granulation and approximation. Know.- Based Syst., 21:294-304.
- Kozachenko, L. F. and Leonenko, N. (1987). Sample estimate of the entropy of a random vector. Problems Inform. Transmission, 23:95-101.
- Kraskov, A., Stögbauer, H., and Grassberger, P. (2004). Estimating mutual information. Physical review. E, Statistical, nonlinear, and soft matter physics, 69(6 Pt 2).
- Kwak, N. and Choi, C.-H. (2002). Input feature selection by mutual information based on parzen window. IEEE Trans. Pattern Anal. Mach. Intell., 24:1667-1671.
- Parzen, E. (1962). On the estimation of a probability density function and mode. Annals of Mathematical Statistics, 33:1065-1076.
- Peng, H., Long, F., and Ding, C. (2005). Feature selection based on mutual information: Criteria of max-dependency, max-relevance and minredundancy. IEEE Trans. Pattern Anal. Mach. Intell., 27:1226-1238.
- Rossi, F., Lendasse, A., Franc¸ois, D., Wertz, V., and Verleysen, M. (2006). Mutual information for the selection of relevant variables in spectrometric nonlinear modelling. Chemometrics and Intelligent Laboratory Systems, 80(2):215-226.
- Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27:379- 423.
- Tang, W. and Mao, K. Z. (2007). Feature selection algorithm for mixed data with both nominal and continuous features. Pattern Recogn. Lett., 28:563-571.
- Verleysen, M. (2003). Learning high-dimensional data. Limitations and Future Trends in Neural Computation, 186:141-162.
- Wilson, D. R. and Martinez, T. R. (1997). Improved heterogeneous distance functions. J. Artif. Int. Res., 6:1-34.

#### Paper Citation

#### in Harvard Style

Doquire G. and Verleysen M. (2011). **AN HYBRID APPROACH TO FEATURE SELECTION FOR MIXED CATEGORICAL AND CONTINUOUS DATA** . In *Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)* ISBN 978-989-8425-79-9, pages 386-393. DOI: 10.5220/0003634903940401

#### in Bibtex Style

@conference{kdir11,

author={Gauthier Doquire and Michel Verleysen},

title={AN HYBRID APPROACH TO FEATURE SELECTION FOR MIXED CATEGORICAL AND CONTINUOUS DATA},

booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)},

year={2011},

pages={386-393},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0003634903940401},

isbn={978-989-8425-79-9},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)

TI - AN HYBRID APPROACH TO FEATURE SELECTION FOR MIXED CATEGORICAL AND CONTINUOUS DATA

SN - 978-989-8425-79-9

AU - Doquire G.

AU - Verleysen M.

PY - 2011

SP - 386

EP - 393

DO - 10.5220/0003634903940401