Novel Feature Selection Methods for High Dimensional Data

Verónica Bolón-Canedo, Noelia Sánchez-Maroño, Amparo Alonso-Betanzos

2014

Abstract

Over the last few years, the dimensionality of datasets involved in data mining applications has increased dramatically. In this situation, feature selection becomes indispensable as it allows for dimensionality reduction and relevance detection. This paper is devoted to study the impact of feature selection on high-dimensonal data as well as to present novel methods. After demonstrating the adequacy of feature selection on real applications, new methods are described which cover different topics such as ensemble learning, distributed learning, scalability of algorithms or cost-based feature selection.

References

  1. Bolón-Canedo, V., Peteiro-Barral, D., Alonso-Betanzos, A., Guijarro-Berdin˜as, B., and Sánchez-Maron˜o, N. (2011a). Scalability analysis of ann training algorithms with feature selection. In Advances in Artificial Intelligence, pages 84-93. Springer.
  2. Bolón-Canedo, V., Peteiro-Barral, D., Remeseiro, B., Alonso-Betanzos, A., Guijarro-Berdinas, B., Mosquera, A., Penedo, M. G., and Sánchez-Maron˜o, N. (2012). Interferential tear film lipid layer classification: an automatic dry eye test. In Tools with Artificial Intelligence (ICTAI), 2012 IEEE 24th International Conference on, volume 1, pages 359-366. IEEE.
  3. Bolón-Canedo, V., Porto-Díaz, I., Sánchez-Maron˜o, N., and Alonso-Betanzos, A. (2014). A framework for cost-based feature selection. Pattern Recognition (In press).
  4. Bolón-Canedo, V., Rego-Fernández, D., Peteiro-Barral, D., Alonso-Betanzos, A., Guijarro-Berdinas, B., and Sánchez-Maron˜o, N. (2013). On the scalability of filter techniques for feature selection on big data. IEEE Computational Intelligence Magazine Special Issue on Computational Intelligence in Big Data (Under Review).
  5. Bolón-Canedo, V., Sánchez-Maron˜o, N., and AlonsoBetanzos, A. (2009). A combination of discretization and filter methods for improving classification performance in kdd cup 99 dataset. In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, pages 359-366. IEEE.
  6. Bolón-Canedo, V., Sánchez-Maron˜o, N., and AlonsoBetanzos, A. (2010a). On the effectiveness of discretization on gene selection of microarray data. In International Joint Conference on Neural Networks. IJCNN 2010, pages 3167-3174. IEEE.
  7. Bolón-Canedo, V., Sánchez-Maron˜o, N., and AlonsoBetanzos, A. (2011b). Feature selection and classification in multiple class datasets: An application to kdd cup 99 dataset. Expert Systems with Applications, 38(5):5947-5957.
  8. Bolón-Canedo, V., Sánchez-Maron˜o, N., and AlonsoBetanzos, A. (2011c). On the behavior of feature selection methods dealing with noise and relevance over synthetic scenarios. In Neural Networks (IJCNN), The 2011 International Joint Conference on, pages 1530- 1537. IEEE.
  9. Bolón-Canedo, V., Sánchez-Maron˜o, N., and AlonsoBetanzos, A. (2011d). Toward an ensemble of filters for classification. In Intelligent Systems Design and Applications (ISDA), 2011 11th International Conference on, pages 331-336. IEEE.
  10. Bolón-Canedo, V., Sánchez-Maron˜o, N., and AlonsoBetanzos, A. (2012). An ensemble of filters and classifiers for microarray data classification. Pattern Recognition, 45(1):531-539.
  11. Bolón-Canedo, V., Sánchez-Maron˜o, N., and AlonsoBetanzos, A. (2013a). Data classification using an ensemble of filters. Neurocomputing (In Press).
  12. Bolón-Canedo, V., Sánchez-Maron˜o, N., and AlonsoBetanzos, A. (2013b). A distributed filter approach for microarray data classification. Applied Soft Computing (Under Review).
  13. Bolón-Canedo, V., Sánchez-Maron˜o, N., and AlonsoBetanzos, A. (2013c). A distributed wrapper approach for feature selection. In European Symposium on Artificial Neural Networks, ESANN 2013, pages 173-178.
  14. Bolón-Canedo, V., Sánchez-Maron˜o, N., and AlonsoBetanzos, A. (2013d). A review of feature selection methods on synthetic data. Knowledge and information systems, 34(3):483-519.
  15. Bolón-Canedo, V., Sánchez-Maron˜o, N., and AlonsoBetanzos, A. (2014). mc-relieff: An extension of relieff for cost-based feature selection. In 6th International Conference on Agents and Artificial Intelligence (ICAART) (Accepted).
  16. Bolón-Canedo, V., Sánchez-Maron˜o, N., Alonso-Betanzos, A., Benítez, J., and Herrera, F. (2013). An insight into microarray datasets and feature selection methods: a framework for ongoing studies. Information Sciences (Under review).
  17. Bolón-Canedo, V., Sánchez-Maron˜o, N., Alonso-Betanzos, A., and Hernandez-Pereira, E. (2010b). Feature selection and conversion methods in KDD Cup 99 dataset: A comparison of performance. In Proceedings of the 10th IASTED International Conference, pages 58-66.
  18. Bolón-Canedo, V., Sánchez-Maron˜o, N., and Cervin˜oRabun˜al, J. (2013). Scaling up feature selection: A distributed filter approach. In Advances in Artificial Intelligence, pages 121-130. Springer.
  19. Chidlovskii, B. and Lecerf, L. (2008). Scalable feature selection for multi-class problems. In Machine Learning and Knowledge Discovery in Databases, pages 227- 240. Springer.
  20. Dash, M. and Liu, H. (1997). Feature selection for classification. Intelligent data analysis, 1(3):131-156.
  21. Frank, A. and Asuncion, A. (2010). UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. [Online; accessed December-2013].
  22. Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L. (2006). Feature extraction: foundations and applications, volume 207. Springer.
  23. Hernández-Pereira, E., Bolón-Canedo, V., SánchezMaron˜o, N., Í lvarez-Estévez, D., Moret-Bonillo, V., and Alonso-Betanzos, A. (2014). A comparison of performance of k-complex classification methods using feature selection. Information Sciences (Under review).
  24. Hughes, G. (1968). On the mean accuracy of statistical pattern recognizers. Information Theory, IEEE Transactions on, 14(1):55-63.
  25. Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1):273- 324.
  26. Liu, H. and Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. Knowledge and Data Engineering, IEEE Transactions on, 17(4):491-502.
  27. Loscalzo, S., Yu, L., and Ding, C. (2009). Consensus group stable feature selection. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 567- 576. ACM.
  28. Peng, Y., Wu, Z., and Jiang, J. (2010). A novel feature selection approach for biomedical data classification. Journal of Biomedical Informatics, 43(1):15-23.
  29. Peteiro-Barral, D., Bolon-Canedo, V., Alonso-Betanzos, A., Guijarro-Berdinas, B., and Sánchez-Maron˜o, N. (2012). Scalability analysis of filter-based methods for feature selection. Advances in Smart Systems Research, 2(1):21-26.
  30. Peteiro-Barral, D., Bolón-Canedo, V., Alonso-Betanzos, A., Guijarro-Berdin˜as, B., and Sánchez-Maron˜o, N. (2013). Toward the scalability of neural networks through feature selection. Expert Systems with Applications, 40(8):2807-2816.
  31. Porto-Díaz, I., Bolón-Canedo, V., Alonso-Betanzos, A., and Fontenla-Romero, O. (2011). A study of performance on microarray data sets for a classifier based on information theoretic learning. Neural Networks, 24(8):888-896.
  32. Provost, F. (2000). Distributed data mining: Scaling up and beyond. Advances in distributed and parallel knowledge discovery, pages 3-27.
  33. Rego-Fernández, D., Bolón-Canedo, V., and AlonsoBetanzos, A. (2014). Scalability analysis of mrmr for microarray data. In 6th International Conference
  34. Remeseiro, B., Bolón-Canedo, V., Peteiro-Barral, D., Alonso-Betanzos, A., Guijarro-Berdinas, B., Mosquera, A., Penedo, M. G., and Sánchez-Maron˜o, N. (2013). A methodology for improving tear film lipid layer classification. IEEE Journal of Biomedical and Health Informatics (In Press).
  35. Saeys, Y., Abeel, T., and Van de Peer, Y. (2008). Robust feature selection using ensemble feature selection techniques. In Machine Learning and Knowledge Discovery in Databases, pages 313-325. Springer.
  36. Sánchez-Maron˜o, N., Alonso-Betanzos, A., GarcíaGonzález, P., and Bolón-Canedo, V. (2010). Multiclass classifiers vs multiple binary classifiers using filters for feature selection. In International Joint Conference on Neural Networks. IJCNN 2010, pages 2836-2843. IEEE.
  37. Sonnenburg, S., Franc, V., Yom-Tov, E., and Sebag, M. (2009). PASCAL Large Scale Learning Challenge. Journal of Machine Learning Research.
  38. Spark (n.d.). Apache Spark - Lightning-Fast Cluster Computing. http://spark.incubator.apache.org. [Online; accessed December-2013].
  39. Sun, Y. and Li, J. (2006). Iterative relief for feature weighting. In Proceedings of the 23rd international conference on Machine learning, pages 913-920. ACM.
  40. Sun, Y., Todorovic, S., and Goodison, S. (2008). A feature selection algorithm capable of handling extremely large data dimensionality. In SDM, pages 530-540.
  41. Tuv, E., Borisov, A., Runger, G., and Torkkola, K. (2009). Feature selection with ensembles, artificial variables, and redundancy elimination. The Journal of Machine Learning Research, 10:1341-1366.
  42. Vainer, I., Kraus, S., Kaminka, G. A., and Slovin, H. (2011). Obtaining scalable and accurate classification in largescale spatio-temporal domains. Knowledge and information systems, 29(3):527-564.
  43. Yu, L. and Liu, H. (2004). Efficient feature selection via analysis of relevance and redundancy. The Journal of Machine Learning Research, 5:1205-1224.
  44. Zhang, Y., Ding, C., and Li, T. (2008). Gene selection algorithm by combining relieff and mrmr. BMC genomics, 9(Supl 2):S27.
  45. Zhao, Z. and Liu, H. (2011). Spectral Feature Selection for Data Mining. Chapman & Hall/Crc Data Mining and Knowledge Discovery. Taylor & Francis Group.
Download


Paper Citation


in Harvard Style

Bolón-Canedo V., Sánchez-Maroño N. and Alonso-Betanzos A. (2014). Novel Feature Selection Methods for High Dimensional Data . In Doctoral Consortium - DCAART, (ICAART 2014) ISBN Not Available, pages 3-14


in Bibtex Style

@conference{dcaart14,
author={Verónica Bolón-Canedo and Noelia Sánchez-Maroño and Amparo Alonso-Betanzos},
title={Novel Feature Selection Methods for High Dimensional Data},
booktitle={Doctoral Consortium - DCAART, (ICAART 2014)},
year={2014},
pages={3-14},
publisher={SciTePress},
organization={INSTICC},
doi={},
isbn={Not Available},
}


in EndNote Style

TY - CONF
JO - Doctoral Consortium - DCAART, (ICAART 2014)
TI - Novel Feature Selection Methods for High Dimensional Data
SN - Not Available
AU - Bolón-Canedo V.
AU - Sánchez-Maroño N.
AU - Alonso-Betanzos A.
PY - 2014
SP - 3
EP - 14
DO -