SCUT: Multi-Class Imbalanced Data Classification using SMOTE and Cluster-based Undersampling

Astha Agrawal, Herna L. Viktor, Eric Paquet

2015

Abstract

Class imbalance is a crucial problem in machine learning and occurs in many domains. Specifically, the two-class problem has received interest from researchers in recent years, leading to solutions for oil spill detection, tumour discovery and fraudulent credit card detection, amongst others. However, handling class imbalance in datasets that contains multiple classes, with varying degree of imbalance, has received limited attention. In such a multi-class imbalanced dataset, the classification model tends to favour the majority classes and incorrectly classify instances from the minority classes as belonging to the majority classes, leading to poor predictive accuracies. Further, there is a need to handle both the imbalances between classes as well as address the selection of examples within a class (i.e. the so-called within class imbalance). In this paper, we propose the SCUT hybrid sampling method, which is used to balance the number of training examples in such a multi-class setting. Our SCUT approach oversamples minority class examples through the generation of synthetic examples and employs cluster analysis in order to undersample majority classes. In addition, it handles both within-class and between-class imbalance. Our experimental results against a number of multi-class problems show that, when the SCUT method is used for pre-processing the data before classification, we obtain highly accurate models that compare favourably to the state-of-the-art.

References

  1. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W., 2002. SMOTE: Synthetic minority over-sampling technique. In Journal of Artificial Intelligence Research, Volume 16, pages 321-357.
  2. Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F., 2011. KEEL datamining software tool: data set repository, integration of algorithms and experimental analysis framework. In Journal of Multiple-Valued Logic and Soft Computing, pages 255-287.
  3. Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J, 2009. Modelling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, Volume 47, number 4, pages 547- 553.
  4. Dempster, A. P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm. In Journal of the Royal Statistical Society, Volume 39, Number 1, pages 1-38.
  5. Fernández, A., Jesus, M., Herrera, F., 2010. Multi-class imbalanced data-sets with linguistic fuzzy rule based classification systems based on pairwise learning. In Computational Intelligence for Knowledge-Based System Design, Volume 6178, Number 20, pages 89- 98.
  6. Han, H., Wang, W., Mao, B., 2005. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of International Conference on Advances in Intelligent Computing, Springer, Volume Part I, pages 878-887.
  7. Japkowicz, N., 2001. Concept-learning in the presence of between-class and within-class imbalances. In AI 2001: Lecture Notes in Artificial Intelligence, Volume 2056, Springer, pages 67-77.
  8. Lichman, M., 2013. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
  9. Rahman, M. M., Davis, D. N., 2013. Addressing the class imbalance problem in medical datasets, In International Journal of Machine Learning and Computing, Volume 3, Number 2, pages 224-228.
  10. Ramanan, A., Suppharangsan, S., Niranjan, M., 2007. Unbalanced decision trees for multi-class classification. In ICIIS 2007: IEEE International Conference on Industrial and Information Systems, IEEE Press, pages 291-294.
  11. Sobhani, P., Viktor, H., Matwin, S., 2014. Learning from imbalanced data using ensemble methods and clusterbased undersampling. In PKDD nfMCD 2013: Lecture Notes in Computer Science, Volume 8983, Springer, pages 38-49.
  12. Sun, Y., Kamel, M., Wang, Y., 2006. Boosting for learning multiple classes with imbalanced class distribution. In IEEE ICDM 7806: Proceedings of the Sixth International Conference on Data Mining, IEEE Press, pages 592-602.
  13. Viktor, H.L., Paquet, E. and Zhao, J., 2013. Artificial neural networks for predicting 3D protein structures from amino acid sequences, In IEEE IJCNN: International Joint Conference on Neural Networks, IEEE Press, pages 1790-1797.
  14. Wang, S., Yao, X., 2012. Multi-class imbalance problems: analysis and potential solutions. In IEEE Transactions on Systems, Man, and Cybernetics, Part B, Number 4, pages 1119-1130.
  15. Yen, S. J., Lee, Y. S., 2009. Cluster-based under-sampling approaches for imbalanced data distributions. In Expert Systems with Applications, Volume 36, Number 3, pages 5718-5727.
  16. Zhou, Z., Liu, X., 2010. On multi-class cost sensitive learning. In Computational Intelligence, Volume 26, No 3, pages. 232-257.
Download


Paper Citation


in Harvard Style

Agrawal A., Viktor H. and Paquet E. (2015). SCUT: Multi-Class Imbalanced Data Classification using SMOTE and Cluster-based Undersampling . In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015) ISBN 978-989-758-158-8, pages 226-234. DOI: 10.5220/0005595502260234


in Bibtex Style

@conference{kdir15,
author={Astha Agrawal and Herna L. Viktor and Eric Paquet},
title={SCUT: Multi-Class Imbalanced Data Classification using SMOTE and Cluster-based Undersampling},
booktitle={Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015)},
year={2015},
pages={226-234},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005595502260234},
isbn={978-989-758-158-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2015)
TI - SCUT: Multi-Class Imbalanced Data Classification using SMOTE and Cluster-based Undersampling
SN - 978-989-758-158-8
AU - Agrawal A.
AU - Viktor H.
AU - Paquet E.
PY - 2015
SP - 226
EP - 234
DO - 10.5220/0005595502260234