Authors:
Nada Boudegzdame
1
;
Karima Sedki
1
;
Rosy Tspora
2
;
3
;
4
and
Jean-Baptiste Lamy
1
Affiliations:
1
LIMICS, INSERM, Université Sorbonne Paris Nord, Sorbonne Université, France
;
2
INSERM, Université de Paris Cité, Sorbonne Université, Cordeliers Research Center, France
;
3
HeKA, INRIA, France
;
4
Department of Medical Informatics, Hôpital Européen Georges-Pompidou, AP-HP, France
Keyword(s):
Imbalanced Data, Oversampling, SMOTE, Data Augmentation, Class Imbalance, Machine Learning, Neural Networks, Synthetic Data.
Abstract:
Oversampling algorithms are used as preprocess in machine learning, in the case of highly imbalanced data in an attempt to balance the number of samples per class, and therefore improve the quality of models learned. While oversampling can be effective in improving the performance of classification models on minority classes, it can also introduce several problems. From our work, it came to light that the models learn to detect the noise added by the oversampling algorithms instead of the underlying patterns. In this article, we will define oversampling, and present the most common techniques, before proposing a method for evaluating oversampling algorithms.