A LEARNING METHOD FOR IMBALANCED DATA SETS

Jorge de la Calleja, Olac Fuentes, Jesús González, Rita M. Aceves-Pérez

2009

Abstract

Many real-world domains present the problem of imbalanced data sets, where examples of one class significantly outnumber examples of other classes. This situation makes learning difficult, as learning algorithms based on optimizing accuracy over all training examples will tend to classify all examples as belonging to the majority class. In this paper we introduce a method for learning from imbalanced data sets which is composed of three algorithms. Our experimental results show that our method performs accurate classification in the presence of significant class imbalance and using small training sets.

References

  1. Akbani, R., Kwek, S., and Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. In Proceedings of the European Conference on Machine Learning (ECML), pages 39-50.
  2. Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167.
  3. Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, P. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321-357.
  4. Chawla, N., Lazarevik, A., Hall, L., and Bowyer, K. (2003). Smoteboost: Improving prediction of the minority class in boosting. In Proceedings of the seventh European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), pages 107-119.
  5. de-la Calleja, J. and Fuentes, O. (2007). Automated star/galaxy discrimination in multispectral wide-field images. In Proceedings of the Second International Conference on Computer Vision and Applications (VISAPP), pages 155-160, Barcelona, Spain.
  6. Domingos, P. (1999). Metacost: A general method for making classifiers cost-sensitive. In Knowledge Discovery and Data Mining, pages 155-164.
  7. Fawcett, T. and Provost, F. (1996). Combining data mining and machine learning for effective user profile. In Proceedings of the 2nd International Conference Knowledge Discovery and Data Mining (PKDD), pages 8- 13.
  8. Han, H., Wang, W., and Mao, B. (2005). Borderline-smote: A new over-sampling method in imbalanced data sets learning. In Proceedings of ICIC, pages 878-887.
  9. Japkowicz, N. (2000). The class imbalance problem: Significance and strategies. In Proceedings of the 2000 International Conference on Artificial Intelligence (IC-AI'2000): Special Track on Inductive Learning, pages 111-117.
  10. Japkowicz, N., Myers, C., and Gluck, M. (1995). A novelty detection approach to classification. In Proceedings of the Fourteen Joint Conference on Artificial Inteligence, pages 518-523.
  11. Kubat, M., Holte, R., and Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30:195-215.
  12. Kubat, M. and Matwin, S. (1997). Addressing the curse of imbalanced training sets: One sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML), pages 179-186.
  13. Liu, Y., An, A., and Huang, X. (2006). Boosting predicion accuracy on imbalanced datasets with svm ensembles. In Proceedings of PAKDD, LNAI, number 3918, pages 107-118.
  14. Mitchell, T. (1997). Machine Learning. Prentice Hall.
  15. Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., and Brunk, C. (1994). Reducing misclassification costs. In Proceedings of the Eleventh International Conference on Machine Learning (ICML), pages 217-225.
  16. Riddle, P., Secal, R., and Etzioni, O. (1994). Representation design and bruteforce induction in a boeing manufacturing domain. Applied Artificial Intelligence, 8:125- 147.
  17. Zheng, Z., Wu, X., and Srihari, R. (2004). Feature selection for text categorization on imbalanced data. In Proceedings of the SIGKDD Explorations, pages 80-89.
Download


Paper Citation


in Harvard Style

de la Calleja J., Fuentes O., González J. and M. Aceves-Pérez R. (2009). A LEARNING METHOD FOR IMBALANCED DATA SETS . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009) ISBN 978-989-674-011-5, pages 307-310. DOI: 10.5220/0002305303070310


in Bibtex Style

@conference{kdir09,
author={Jorge de la Calleja and Olac Fuentes and Jesús González and Rita M. Aceves-Pérez},
title={A LEARNING METHOD FOR IMBALANCED DATA SETS},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)},
year={2009},
pages={307-310},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002305303070310},
isbn={978-989-674-011-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2009)
TI - A LEARNING METHOD FOR IMBALANCED DATA SETS
SN - 978-989-674-011-5
AU - de la Calleja J.
AU - Fuentes O.
AU - González J.
AU - M. Aceves-Pérez R.
PY - 2009
SP - 307
EP - 310
DO - 10.5220/0002305303070310