COST SENSITIVE AND PREPROCESSING FOR CLASSIFICATION WITH IMBALANCED DATA-SETS: SIMILAR BEHAVIOUR AND POTENTIAL HYBRIDIZATIONS

Victoria López; Alberto Fernández; María José del Jesus; Francisco Herrera

doi:10.5220/0003751600980107

COST SENSITIVE AND PREPROCESSING FOR CLASSIFICATION WITH IMBALANCED DATA-SETS: SIMILAR BEHAVIOUR AND POTENTIAL HYBRIDIZATIONS

Victoria López, Alberto Fernández, María José del Jesus, Francisco Herrera

2012

Abstract

The scenario of classification with imbalanced data-sets has supposed a serious challenge for researchers along the last years. The main handicap is related to the large number of real applications in which one of the classes of the problem has a few number of examples in comparison with the other class, making it harder to be correctly learnt and, what is most important, this minority class is usually the one with the highest interest. In order to address this problem, two main methodologies have been proposed for stressing the significance of the minority class and for achieving a good discrimination for both classes, namely preprocessing of instances and cost-sensitive learning. The former rebalances the instances of both classes by replicating or creating new instances of the minority class (oversampling) or by removing some instances of the majority class (undersampling); whereas the latter assumes higher misclassification costs with samples in the minority class and seek to minimize the high cost errors. Both solutions have shown to be valid for dealing with the class imbalance problem but, to the best of our knowledge, no comparison between both approaches have ever been performed. In this work, we carry out a full exhaustive analysis on this two methodologies, also including a hybrid procedure that tries to combine the best of these models. We will show, by means of a statistical comparative analysis developed with a large collection of more than 60 imbalanced data-sets, that we cannot highlight an unique approach among the rest, and we will discuss as a potential research line the use of hybridizations for achieving better solutions to the imbalanced data-set problem.

References

Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., and Herrera, F. (2011). KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multi-Valued Logic and Soft Computing, 17(2-3):255-287.
Barandela, R., Sánchez, J. S., García, V., and Rangel, E. (2003). Strategies for learning in class imbalance problems. Pattern Recognition, 36(3):849-851.
Batista, G. E. A. P. A., Prati, R. C., and Monard, M. C. (2004). A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1):20-29.
Bradford, J. P., Kunz, C., Kohavi, R., Brunk, C., and Brodley, C. E. (1998). Pruning decision trees with misclassification costs. In Proceedings of the 10th European Conference on Machine Learning (ECML'98), pages 131-136.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority oversampling technique. Journal of Artificial Intelligent Research, 16:321-357.
Chawla, N. V., Cieslak, D. A., Hall, L. O., and Joshi, A. (2008). Automatically countering imbalance and its empirical relationship to cost. Data Mining and Knowledge Discovery, 17(2):225-252.
Chawla, N. V., Japkowicz, N., and Kotcz, A. (2004). Editorial: special issue on learning from imbalanced data sets. SIGKDD Explorations, 6(1):1-6.
Chen, X., Fang, T., Huo, H., and Li, D. (2011). Graphbased feature selection for object-oriented classification in VHR airborne imagery. IEEE Transactions on Geoscience and Remote Sensing, 49(1):353-365.
Dems?ar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1-30.
Domingos, P. (1999). Metacost: A general method for making classifiers cost-sensitive. In Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (KDD'99), pages 155-164.
Drown, D. J., Khoshgoftaar, T. M., and Seliya, N. (2009). Evolutionary sampling and software quality modeling of high-assurance systems. IEEE Transactions on Systems, Man, and Cybernetics, Part A, 39(5):1097- 1107.
Ducange, P., Lazzerini, B., and Marcelloni, F. (2010). Multi-objective genetic fuzzy classifiers for imbalanced and cost-sensitive datasets. Soft Computing, 14(7):713-728.
Elkan, C. (2001). The foundations of cost-sensitive learning. In Proceedings of the 17th IEEE International Joint Conference on Artificial Intelligence (IJCAI'01), pages 973-978.
Estabrooks, A., Jo, T., and Japkowicz, N. (2004). A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 20(1):18-36.
Fernández, A., del Jesus, M. J., and Herrera, F. (2010). On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets. Information Sciences, 180(8):1268-1291.
Fernández, A., García, S., del Jesus, M. J., and Herrera, F. (2008). A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets and Systems, 159(18):2378-2398.
Fernandez, A., García, S., Luengo, J., Bernadó-Mansilla, E., and Herrera, F. (2010). Genetics-based machine learning for rule induction: State of the art, taxonomy and comparative study. IEEE Transactions on Evolutionary Computation, 14(6):913-941.
García, S., Fernández, A., and Herrera, F. (2009). Enhancing the effectiveness and interpretability of decision tree and rule induction classifiers with evolutionary training set selection over imbalanced problems. Applied Soft Computing, 9:1304-1314.
García, S., Fernández, A., Luengo, J., and Herrera, F. (2009). A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability. Soft Computing, 13(10):959-977.
García, S., Fernández, A., Luengo, J., and Herrera, F. (2010). Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences, 180(10):2044-2064.
García, S. and Herrera, F. (2008). An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. Journal of Machine Learning Research, 9:2607-2624.
Guo, X., Dong, Y. Y. C., Yang, G., and Zhou, G. (2008). On the class imbalance problem. In Proceedings of the 4th International Conference on Natural Computation (ICNC'08), volume 4, pages 192-201.
He, H. and Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263-1284.
Huang, J. and Ling, C. X. (2005). Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 17(3):299-310.
Japkowicz, N. and Stephen, S. (2002). The class imbalance problem: a systematic study. Intelligent Data Analysis Journal, 6(5):429-450.
Kwak, N. (2008). Feature extraction for classification problems and its application to face recognition. Pattern Recognition, 41(5):1718-1734.
Ling, C. X., Yang, Q., Wang, J., and Zhang, S. (2004). Decision trees with minimal costs. In Brodley, C. E., editor, Proceedings of the 21st International Conference on Machine Learning (ICML'04), volume 69 of ACM International Conference Proceeding Series, pages 69- 77. ACM.
Lo, H.-Y., Chang, C.-M., Chiang, T.-H., Hsiao, C.-Y., Huang, A., Kuo, T.-T., Lai, W.-C., Yang, M.-H., Yeh, J.-J., Yen, C.-C., and Lin, S.-D. (2008). Learning to improve area-under-FROC for imbalanced medical data classification using an ensemble method. SIGKDD Explorations, 10(2):43-46.
Mazurowski, M. A., Habas, P. A., Zurada, J. M., Lo, J. Y., Baker, J. A., and Tourassi, G. D. (2008). Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Networks, 21(2-3).
Quinlan, J. (1993). C4.5: Programs for Machine Learning. Morgan Kauffman.
Riddle, P., Segal, R., and Etzioni, O. (1994). Representation design and brute-force induction in a boeing manufacturing domain. Applied Artificial Intelligence, 8:125- 147.
Shaffer, J. (1986). Modified sequentially rejective multiple test procedures. Journal of the American Statistical Association, 81(395):826-831.
Sheskin, D. (2006). Handbook of parametric and nonparametric statistical procedures. Chapman & Hall/CRC.
Su, C.-T. and Hsiao, Y.-H. (2007). An evaluation of the robustness of MTS for imbalanced data. IEEE Transactions on Knowledge and Data Engeneering, 19(10):1321-1332.
Sun, Y., Kamel, M. S., Wong, A. K. C., and Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12):3358- 3378.
Sun, Y., Wong, A. K. C., and Kamel, M. S. (2009). Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23(4):687-719.
Ting, K. M. (2002). An instance-weighting method to induce cost-sensitive trees. IEEE Transactions on Knowledge and Data Engineering, 14(3):659-665.
Turney, P. (1995). Cost-sensitive classification: Empirical evaluation of a hybrid genetic decision tree induction algorithm. Journal of Artificial Intelligence Research, 2:369-409.
Wang, B. and Japkowicz, N. (2004). Imbalanced data set learning with synthetic samples. In Proceedings of the IRIS Machine Learning Workshop.
Weiss, G. M. (2004). Mining with rarity: a unifying framework. SIGKDD Explorations, 6(1):7-19.
Weiss, G. M. and Tian, Y. (2008). Maximizing classifier utility when there are data acquisition and modeling costs. Data Mining and Knowledge Discovery, 17(2):253-282.
Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man and Cybernetics, 2(3):408 -421.
Wu, X. and Kumar, V., editors (2009). The Top ten algorithms in data mining. Data Mining and Knowledge Discovery Series. Chapman and Hall/CRC press.
Yang, Q. and Wu, X. (2006). 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making, 5(4):597- 604.
Zadrozny, B. and Elkan, C. (2001). Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the 7th International Conference on Knowledge Discovery and Data Mining (KDD'01), pages 204-213.
Zadrozny, B., Langford, J., and Abe, N. (2003). Costsensitive learning by cost-proportionate example weighting. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM'03), pages 435-442.
Zhou, Z.-H. and Liu, X.-Y. (2006). Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18(1):63-77.

Download

Paper Citation

in Harvard Style

López V., Fernández A., José del Jesus M. and Herrera F. (2012). COST SENSITIVE AND PREPROCESSING FOR CLASSIFICATION WITH IMBALANCED DATA-SETS: SIMILAR BEHAVIOUR AND POTENTIAL HYBRIDIZATIONS . In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM, ISBN 978-989-8425-99-7, pages 98-107. DOI: 10.5220/0003751600980107

in Bibtex Style

@conference{icpram12,
author={Victoria López and Alberto Fernández and María José del Jesus and Francisco Herrera},
title={COST SENSITIVE AND PREPROCESSING FOR CLASSIFICATION WITH IMBALANCED DATA-SETS: SIMILAR BEHAVIOUR AND POTENTIAL HYBRIDIZATIONS},
booktitle={Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,},
year={2012},
pages={98-107},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003751600980107},
isbn={978-989-8425-99-7},
}

in EndNote Style

TY - CONF
JO - Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,
TI - COST SENSITIVE AND PREPROCESSING FOR CLASSIFICATION WITH IMBALANCED DATA-SETS: SIMILAR BEHAVIOUR AND POTENTIAL HYBRIDIZATIONS
SN - 978-989-8425-99-7
AU - López V.
AU - Fernández A.
AU - José del Jesus M.
AU - Herrera F.
PY - 2012
SP - 98
EP - 107
DO - 10.5220/0003751600980107