An Approach for Improving Oversampling by Filtering out Unrealistic

Synthetic Data

Nada Boudegzdame

1 a

, Karima Sedki

1 b

, Rosy Tspora

2,3,4 c

and Jean-Baptiste Lamy

1 d

LIMICS, INSERM, Universit

e Sorbonne Paris Nord, Sorbonne Universit

e, France

INSERM, Universit

e de Paris Cit

e, Sorbonne Universit

e, Cordeliers Research Center, France

HeKA, INRIA, France

Department of Medical Informatics, H

opital Europ

een Georges-Pompidou, AP-HP, France

Keywords:

Imbalanced Data, Oversampling, SMOTE, Class Imbalance, Data Augmentation, Machine Learning, Neural

Networks, Synthetic Data, Synthetic Sample Detector, Generative Adversarial Networks.

Abstract:

Oversampling algorithms are commonly used in machine learning to address class imbalance by generating

new synthetic samples of the minority class. While oversampling can improve classiﬁcation models’ perfor-

mance on minority classes, our research reveals that models often learn to detect noise generated by oversam-

pling algorithms rather than the underlying patterns. To overcome this issue, this article proposes a method

that involves identifying and ﬁltering unrealistic synthetic data, using advanced technique such a neural net-

work for detecting unrealistic synthetic data samples. This aims to enhance the quality of the oversampled

datasets and improve machine learning models’ ability to uncover genuine patterns. The effectiveness of the

proposed approach is thoroughly examined and evaluated, demonstrating enhanced model performance.

1 INTRODUCTION

Class imbalance is a common challenge in machine

learning, occurring when one class has signiﬁcantly

fewer samples than others. To tackle this issue, over-

sampling techniques, such as the Synthetic Minority

Over-sampling Technique (SMOTE) (Chawla et al.,

2002), have been widely employed.

Oversampling, while enhancing model perfor-

mance on minority classes, presents challenges, espe-

cially in highly imbalanced datasets. In such cases,

the resulting oversampled dataset for the minority

class is predominantly synthetic, overshadowing the

original data. This dominance may lead the model

to prioritize predicting the synthetic nature by cap-

turing noise introduced during oversampling, rather

than discerning the genuine underlying patterns. Con-

sequently, it can result in poor generalization and

suboptimal real-world performance (Tarawneh et al.,

2022; Drummond and Holte, 2003; Chen et al., 2004;

Rodr

ıguez-Torres et al., 2022).

https://orcid.org/0000-0003-1409-6560

https://orcid.org/0000-0002-2712-5431

https://orcid.org/0000-0002-9406-5547

https://orcid.org/0000-0002-5477-180X

This article proposes a methodological approach

to improve synthetic data quality by training a ma-

chine learning model to predict the synthetic status of

each sample. The goal is to identify and ﬁlter unreal-

istic synthetic data, thereby improving overall dataset

quality and enhancing the model’s ability to uncover

genuine underlying patterns. Our study comprehen-

sively investigates the proposed approach’s perfor-

mance on diverse datasets, focusing on its effective-

ness in improving synthetic data quality and enhanc-

ing machine learning model performance on oversam-

pled data. The research aims to contribute effective

strategies for handling class imbalance and overcom-

ing detectability issues associated with synthetic data.

2 BACKGROUND

Various oversampling techniques address class im-

balance, with SMOTE being a prominent method

(Chawla et al., 2002). It interpolates between minor-

ity samples to generate new ones along the connecting

line. Over the years, SMOTE has undergone numer-

ous modiﬁcations and extensions to enhance its ef-

fectiveness, addressing issues such as overﬁtting, data

Boudegzdame, N., Sedki, K., Tspora, R. and Lamy, J.

An Approach for Improving Oversampling by Filtering out Unrealistic Synthetic Data.

DOI: 10.5220/0012325400003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 3, pages 291-298

ISBN: 978-989-758-680-4; ISSN: 2184-433X

291

density, and mixed feature types.

Enhancements to SMOTE include Borderline

SMOTE (Han et al., 2005), focusing on addressing

overﬁtting by generating synthetic samples near the

decision boundary. ADASYN (He et al., 2008) adjusts

the density distribution to generate more samples for

harder-to-learn instances, while Safe-Level SMOTE

(Bunkhumpornpat et al., 2009) reduces misclassiﬁ-

cation risks by focusing on samples near a safe ma-

jority class. SMOTEN (Chawla et al., 2002) extends

SMOTE for datasets with mixed nominal and continu-

ous features, employing a tailored distance metric for

generating synthetic samples speciﬁcally for nominal

features. Additionally, Minority Oversampling Tech-

nique (MOTE) (Huang et al., 2006) generates syn-

thetic samples exclusively for those misclassiﬁed by

the current model.

Choosing an oversampling method requires

thoughtful consideration due to differing strengths

and weaknesses. The chosen technique signiﬁcantly

inﬂuences the model’s performance, emphasizing the

need for a methodological approach to address de-

tectability issues and enhance synthetic data quality.

In addition to oversampling techniques, the emer-

gence of generative adversarial networks (GANs)

(Goodfellow et al., 2014) offer alternative methods

for synthetic data generation. GANs employ a com-

petitive training approach, where two neural networks

are trained jointly: one network generates realistic

synthetic data, while the other network discriminates

between real and synthetic data. They have demon-

strated success in generating complex synthetic data,

such as images and text (Mirza and Osindero, 2014;

Reed et al., 2016; Zhang et al., 2017).

Despite their success in generating realistic data,

GANs struggle with categorical synthetic datasets due

to gradient computation limitations on latent categor-

ical variables. Methods like medGAN (Choi et al.,

2017), which transforms categorical data using au-

toencoders, have been developed to address this lim-

itation. However, medGAN is limited to binary

and count data, leading to the development of MC-

MedGAN (Camino et al., 2018) for multi-categorical

variables.

Our generic method, applicable alongside any

oversampling technique, aims to enhance synthetic

data quality without relying on an internal genera-

tive component. Incorporating our approach into the

oversampling process provides a ﬂexible and effective

solution for improving model performance on imbal-

anced datasets.

3 CHALLENGES OF

OVERSAMPLING

Intuitively, an effective oversampling technique in-

creases the representation of the minority class with-

out merely replicating existing instances. For in-

stance, SMOTE generates synthetic instances by in-

terpolating between existing minority class instances

and their k nearest neighbors. However, oversam-

pling, while enhancing model performance on minor-

ity classes, introduces signiﬁcant challenges.

Firstly, oversampling can induce bias towards the

minority class, causing the model to prioritize it at the

expense of the majority class. This bias can lead to

poor performance on real-world data, where the mi-

nority class is less frequent (Tarawneh et al., 2022;

Drummond and Holte, 2003). Another issue may

arise from potential inconsistencies in data types, as

synthetic data points may deviate from the typical

range or adopt different formats.

Moreover, oversampling is prone to generating

mislabeled samples belonging to the majority class or

creating unrealistic ”noise” samples. It can alter the

data distribution, impacting the representation of dif-

ferent class proportions. Additionally, oversampling

may reduce dataset diversity, potentially leading to

overﬁtting and hindering generalization to new data

by creating synthetic samples closely resembling ex-

isting ones.

A careful assessment of oversampling’s impact on

dataset distribution and diversity is essential to en-

sure the resulting model accurately reﬂects the true

nature of the problem. Additionally, consideration

of the computational costs associated with generating

synthetic data is crucial, especially for large datasets,

given the time-consuming and resource-intensive na-

ture of the process (Chen et al., 2004; Rodr

ıguez-

Torres et al., 2022).

To gain a more profound understanding of why

machine learning models often lean towards learn-

ing the synthetic nature of data over genuine under-

lying patterns, we examine the data distribution in

oversampled data. In slightly imbalanced data, de-

picted in Figure 1a, where only a few synthetic sam-

ples are required for class balance, the overall class

distribution remains largely unchanged. In highly im-

balanced datasets, oversampling becomes particularly

challenging, given that the minority class constitutes

a very small fraction of the data. The substantial num-

ber of synthetic instances required to balance the data

results in the majority of the oversampled data be-

ing synthetic, while the original data makes up only

a small fraction. As illustrated in Figure 1b, the ma-

jority class tends to be equivalent to the original data

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

292

(a) Imbalanced Data.

(b) Highly Imbalanced Data.

Figure 1: Class Distribution in Oversampled Dataset.

class, while the minority class classiﬁcation tends to

be equivalent to the synthetic data class, as the origi-

nal minority class samples contribute negligibly to the

data for that class.

Hence, the quality of synthetic data signiﬁcantly

impacts machine learning model performance, espe-

cially for the minority class. It is crucial for synthetic

data to accurately mirror real-world data. Inclusion of

unrealistic synthetic data may lead the model to mis-

classify the minority class as synthetic, causing it to

predict these instances rather than capturing genuine

underlying patterns. This situation results in a redef-

inition of the learning problem, shifting the focus to

predicting the synthetic nature of the data.

Addressing the challenge of unrealistic synthetic

data is paramount for enhancing the model’s ability

to discern and leverage essential patterns, ultimately

improving performance on real-world datasets. This

is particularly crucial in highly imbalanced scenarios

where there is an increased likelihood of the model

may detecting the synthetic nature. Therefore, gen-

erating realistic synthetic data is vital to mitigate this

issue and ensure overall dataset quality.

4 METHOD

Our solution to improve the quality of synthetic data

is iterative and consists of three main steps, as il-

lustrated in Figure 2. First, we generate synthetic

data using the chosen oversampling technique. The

second step, which occurs only during the initial it-

eration, involves building a machine learning model

trained to predict the synthetic status of each sample

in the dataset. In the third step, this model is then em-

ployed to identify and ﬁlter out **detectable** syn-

thetic data. The predictive model is used to ﬂag and

remove samples classiﬁed as synthetic and unrealis-

tic.

By eliminating these **unrealistic** synthetic

samples, our aim is to enhance the overall quality of

the dataset and mitigate the negative impact of the

noise introduced by these samples, enabling the ma-

chine learning model to focus on the genuine under-

lying patterns in the original data. The following sub-

sections provide a detailed explanation of each step.

4.1 Generation of Synthetic Data

The ﬁrst step of the proposed method consists of gen-

erating synthetic data from the minority class to bal-

ance the overall distribution of classes in the data. For

this step, we can use any existing oversampling tech-

nique. As our method can be adapted to various over-

sampling techniques, it is very ﬂexible which is use-

ful as the choice of the most appropriate technique

depends on the dataset.

4.2 Learning the Detectability of

Synthetic Data

In the second step, we aim to assess the detectabil-

ity of synthetic data and identify unrealistic instances

for subsequent ﬁltering. To achieve this, we formulate

a binary classiﬁcation problem to distinguish between

synthetic and original data samples based on their dis-

tinct characteristics. We ﬁrst prepare the dataset for

the learning phase:

1. Build the Dataset: Remove samples from the

majority class in the oversampled dataset gener-

ated in STEP 1, as the focus is on detecting the

synthetic nature of data generated from the minor-

ity class.

2. Label the Samples: Assign labels to each in-

stance in the reﬁned dataset, indicating whether

it is synthetic (1) or original (0).

This binary classiﬁcation problem, summarized

bellow, aims to train a machine learning model to

An Approach for Improving Oversampling by Filtering out Unrealistic Synthetic Data

293

Figure 2: Illustration of the proposed oversampling ﬁltering technique.

distinguish between synthetic and original instances

based on their distinctive characteristics. By captur-

ing the underlying patterns and features that differ-

entiate synthetic data from original data, the model

could learn to predict the synthetic status of each sam-

ple.

Synthetic Sample Detector

Input: Oversampled data: original and synthetic

data

Output: Is the instance synthetic or original?

This problem formulation served as the founda-

tion for exploring the detectability of synthetic data

and identifying lower-quality instances. The insights

gained from this analysis played a crucial role in guid-

ing the subsequent step of ﬁltering out detectable syn-

thetic data, which aimed to enhance the overall qual-

ity of the synthetic dataset and improve the perfor-

mance of machine learning models on imbalanced

datasets.

The second step is only performed at the ﬁrst iter-

ation. In further iterations, we reuse the same model

generated at the ﬁrst iteration without learning a new

model.

4.3 Filtering out Unrealistic Synthetic

Data

In the ﬁnal step, we employ the synthetic data detector

created in Step 2 to predict the synthetic nature of each

data instance generated in Step 1. Instances identiﬁed

as synthetic data are ﬁltered out, retaining only those

closely resembling the characteristics of the original

data. If the remaining data remains imbalanced, we

iteratively generate additional synthetic data samples,

detecting and ﬁltering out unrealistic synthetic data in

each iteration. This process continues until achieving

a desired balance between the minority and majority

classes. It facilitates the progressive enhancement of

synthetic data quality, diminishing the detectability of

synthetic instances, and consequently, improving the

model’s accuracy in predicting the minority class.

5 EXPERIMENTS

5.1 Experimental Databases

In this experimental study, we assessed our data ﬁl-

tering technique on oversampled data from various

techniques, including SMOTE, Borderline SMOTE,

SMOTEN , ADASYN, across diverse databases. These

databases, carefully selected for their diversity, repre-

sent different real-world problems with varying class

imbalances and domains:

• Credit Card Fraud: Highly imbalanced (0.17%

minority class ratio); excellent representation of

real-world ﬁnancial transactions with infrequent

fraudulent activities.

• Car Insurance Claim: Moderately imbalanced

(6.39% minority class ratio); reﬂecting imbal-

ances in insurance claims between common and

rare cases.

• Anomalies in Wafer Manufacturing: Inter-

mediate imbalance (8.11% minority class ratio);

mimics manufacturing scenarios where detecting

anomalies is essential.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

294

Table 1: Database descriptions.

Database Domain Input types Feature number Minority Class Ratio

Haemorrhage (MIMIC) Medical Boolean 5317 3.00 %

Credit Card Fraud Finance Real 29 0.17 %

Student Dropout Education Boolean, Real 36 32.12 %

Anomalies in Wafer M. Manufacturing Boolean, Real 1558 8.11 %

Car Insurance Claim Insurance Boolean, Real 42 6.39 %

• Haemorrhage (MIMIC): Signiﬁcantly imbal-

anced (3% minority class ratio); representing

haemorrhage risk and non-risk instances using

MIMIC database, mirroring real-world medical

scenarios where certain conditions are infrequent.

• Student Dropout: Relatively balanced yet still

imbalanced (32.12% minority class ratio); per-

taining to the education sector where dropout

events are infrequent compared to student persis-

tence.

These databases cover diverse domains: ﬁnance,

insurance, manufacturing, medical, and education.

They reﬂect real-world situations where rare events

or anomalies occur less frequently than normal in-

stances. By evaluating the ﬁltering technique across

databases with varying minority class ratios, we

gained insights into its ability to effectively identify

and ﬁlter out unrealistic synthetic data. This eval-

uation allowed us to assess the technique’s general-

izability and applicability across different real-world

scenarios.

Table 1 provides a summary of the databases used

in our experiments, including their respective do-

mains, input types, feature numbers, and minority

class ratios.

5.2 Implementation and Performance

Metrics

To assess the performance of our approach, we em-

ployed a neural network with a tailored architecture

for each dataset. The network used LeakyReLU acti-

vation functions to address ”dead” neurons and em-

ployed a sigmoid activation function for the output

layer, well-suited for binary classiﬁcation tasks. We

optimized training using the ReduceLROnPlateau

technique, dynamically adjusting the learning rate to

enhance efﬁciency and prevent convergence plateaus.

During evaluation, we considered precision, re-

call, and F1 score to assess the model’s ability to

correctly identify the minority class (He and Gar-

cia, 2009; Powers, 2011). These metrics collectively

offer a comprehensive understanding of the model’s

strengths and limitations, ensuring a holistic assess-

ment. Relying on a single metric cannot provide a

complete evaluation. These metrics provided valuable

insights into the model’s ability to address imbalanced

class distributions and improve the overall quality of

oversampled data.

Our design choices and selected metrics aimed to

enhance the neural network’s learning capability and

overall effectiveness. The next section presents ex-

perimental results, highlighting our technique’s per-

formance across diverse datasets and varying minor-

ity class ratios.The metrics, together with information

about the imbalance levels in the datasets, will offer

insights into how effectively the technique addresses

imbalanced class distributions.

6 RESULTS

In our experimental study, we rigorously evaluated

the proposed Filtering Oversampling method across

diverse learning problems. We followed a system-

atic approach, assessing machine learning models on

original datasets as a baseline, then evaluating over-

sampled data using various techniques. The ﬁnal

step involved testing models on ﬁltered oversampled

data, enabling a direct comparison with popular over-

sampling techniques (SMOTE, Borderline SMOTE,

SMOTEN, ADASYN). Summary performance metrics

for each experiment are presented in Table 2.

The Filtering Oversampling method with SMOTE

demonstrated signiﬁcant improvements in predicting

haemorrhage risk. The f1 score increased from 0.62

to 0.69, recall improved from 0.60 to 0.82, and ac-

curacy rose from 0.63 to 0.88. In contrast, the stan-

dalone use of SMOTE without ﬁltering resulted in a

much lower F1 score of 0.12, recall of 0.21, precision

of 0.09, and accuracy of 0.88, indicating poor real-

world generalization. Other oversampling techniques,

such as Borderline SMOTE, SMOTEN, and ADASYN,

showed varying degrees of effectiveness but fell short

of the enhancements achieved by the Filtering Over-

sampling method, as a substantial portion of all the

possible synthetic data that can be generated were

detectable, resulting in an incomplete balance of the

An Approach for Improving Oversampling by Filtering out Unrealistic Synthetic Data

295

Table 2: Performance Metrics Comparison of the Filtering Oversampling Method.

Learning Problem Oversampled Filtered F1 score Recall Precision Accuracy

Haemorrhage Risk No - 0.62 0.60 0.65 0.63

Prediction SMOTE no 0.12 0.21 0.09 0.88

SMOTE yes 0.69 0.82 0.60 0.88

Borderline SMOTE no 0.05 0.04 0.09 0.94

Borderline SMOTE yes 0.15 0.23 0.11 0.91

SMOTEN no 0.05 0.21 0.09 0.92

SMOTEN yes 0.11 0.35 0.06 0.80

ADASYN no 0.09 0.11 0.07 0.90

ADASYN yes 0.17 0.19 0.15 0.92

Credit Card Fraud No - 0.23 0.77 0.13 0.99

Detection SMOTE no 0.0042 0.22 0.0021 0.83

SMOTE yes 0.91 0.84 1.0 0.84

Borderline SMOTE no 0.23 0.90 0.13 0.99

Borderline SMOTE yes 0.62 0.81 0.5 0.98

SMOTEN no 0.04 0.86 0.02 0.94

SMOTEN yes 0.19 0.86 0.10 0.98

ADASYN no 0.08 0.94 0.04 0.96

ADASYN yes 0.57 0.82 0.44 0.99

Student Dropout No - 0.61 0.51 0.74 0.73

Prediction SMOTE no 0.72 0.73 0.71 0.77

SMOTE yes 0.78 0.70 0.88 0.84

Borderline SMOTE no 0.80 0.74 0.86 0.85

Borderline SMOTE yes 0.85 0.86 0.85 0.88

SMOTEN no 0.77 0.74 0.81 0.84

SMOTEN yes 0.80 0.81 0.74 0.83

ADASYN no 0.79 0.70 0.91 0.86

ADASYN yes 0.83 0.85 0.80 0.86

Detecting Anomalies in no - 0.47 0.75 0.34 0.86

Wafer Manufacturing SMOTE no 0.50 0.74 0.37 0.89

SMOTE yes 0.57 0.51 0.64 0.93

Borderline SMOTE no 0.60 0.72 0.51 0.92

Borderline SMOTE yes 0.55 0.60 0.51 0.91

SMOTEN no 0.52 0.64 0.44 0.88

SMOTEN yes 0.73 0.90 0.61 0.94

ADASYN no 0.48 0.75 0.35 0.85

ADASYN yes 0.68 0.64 0.72 0.93

Car Insurance Claim No - 0.08 0.11 0.06 0.83

SMOTE no 0.10 0.44 0.06 0.54

SMOTE yes 0.12 0.67 0.07 0.38

Borderline SMOTE no 0.12 0.66 0.06 0.41

Borderline SMOTE yes 0.11 0.60 0.06 0.36

SMOTEN no 0.12 0.65 0.07 0.38

SMOTEN yes 0.11 0.47 0.07 0.53

ADASYN no 0.12 0.61 0.06 0.43

ADASYN yes 0.12 0.61 0.07 0.43

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

296

dataset.

For credit card fraud detection, the original dataset

had limited fraud detection capability (F1 score of

0.23). While SMOTE without ﬁltering led to a

marginal deterioration (F1 score 0.0042), indicating

a redeﬁnition of the learning problem due to the in-

troduced noise. In stark contrast, when combined

with Filtering Oversampling, a substantial perfor-

mance boost was observed, with an F1 score of 0.91,

a recall rate of 0.84, and a precision of 1.0, showcas-

ing its potency, especially in critical applications like

fraud detection. Other methods (Borderline SMOTE,

SMOTEN, ADASYN) exhibited varying degrees of ef-

fectiveness.

In the context of predicting student dropout, the

utilization of Borderline SMOTE initially demon-

strated an impressive improvement F1 score from

0.61 to 0.80, reaching its peak effectiveness at 0.85

when combined with our ﬁltering technique. This

underscores the effectiveness of integrating oversam-

pling with our ﬁltering method, effectively addressing

class imbalance and enhancing predictive accuracy.

In wafer manufacturing anomaly detection, the

baseline F1 score is a modest 0.47. Integrating

ADASYN with our ﬁltering approach leads to a signif-

icant improvement, boosting the F1 score from 0.48

to 0.68. SMOTEN introduces an initial F1 score of

0.52, and our ﬁltering approach further contributes

to a marginal increase, reaching 0.73. These results

highlight the effectiveness of our method in enhanc-

ing anomaly detection for wafer manufacturing.

However, in car insurance claims, the initial F1

score is 0.08, indicating potential for improvement.

Applying SMOTE without ﬁltering leads to a slight

increase in the F1 score to 0.10, while the integration

of our ﬁltering approach with SMOTE further boosts

the F1 score to 0.12. Additionally, both SMOTEN and

Borderline SMOTE exhibit F1 score of 0.12 and 0.11

with and without ﬁltering, respectively. These ﬁnd-

ings highlight the variable effectiveness of ﬁltering in

enhancing the F1 score, dependent on the oversam-

pling method and dataset.

Overall, our experimental results provide strong

evidence that the proposed ﬁltering oversampling

method consistently outperforms both the original

dataset and the widely used methods (SMOTE, Bor-

derline SMOTE, SMOTEN, ADASYN) across a range

of learning problems. These ﬁndings highlight the

effectiveness of our approach in enhancing the per-

formance of imbalanced classiﬁcation tasks. It’s im-

portant to acknowledge that high accuracy observed

when learning on the original data is primarily inﬂu-

enced by the large number of true negatives. There-

fore, it’s crucial to consider multiple performance

metrics, such as F1 score, precision, and recall, to as-

sess the model’s effectiveness in handling imbalanced

datasets.

7 DISCUSSION

In this study, we proposed a novel approach to

enhance oversampling methods through a ﬁltering

mechanism to eliminate unrealistic synthetic data,

resulting in substantial performance improvements.

Rigorous testing highlights the method’s impact on

capturing genuine patterns in the minority class,

thereby improving generalization and real-world per-

formance. As the model relies less on predicting syn-

thetic instances, it gains robustness to handle chal-

lenges in real-world data.

While our method has been applied to well-known

oversampling techniques such as SMOTE, Borderline

SMOTE, SMOTEN, and ADASYN, it is, in essence,

a generic approach adaptable to other oversampling

techniques. The results showcased that this method

excels with highly imbalanced data, which are most

impacted by the noise oversampling, given that the

majority of the oversampled dataset comprises syn-

thetic instances. The extent of improvement achieved

through our ﬁltering method is not quantiﬁed by a

ﬁxed value; it depends on the dataset, the number, and

type of features involved. On small datasets with lim-

ited instances and features, the challenge lies in run-

ning out of synthetic samples to effectively balance

the classes.

It’s crucial to note that the model created in Step

2, trained on data where the synthetic class may dom-

inate, faces challenges in detecting synthetic data due

to class imbalance. Consequently, applying our ﬁlter-

ing technique may not signiﬁcantly improve results

in such scenarios. For certain datasets, like the ”Car

Insurance Claim database” in our experiment, lim-

ited improvement was observed due to exhausted syn-

thetic samples and the inability to generate enough

realistic synthetic samples (i,e undetected synthetic

data) to balance the dataset. This limitation may be at-

tributed to a possible bias toward the unrealistic class,

given its majority representation.

In summary, our proposed method provides a

valuable contribution to mitigate oversampling tech-

nique limitations and enhancing machine learning

model performance on imbalanced datasets. By im-

proving synthetic data quality, it enables more accu-

rate learning and better handling of class imbalance in

real-world applications.

An Approach for Improving Oversampling by Filtering out Unrealistic Synthetic Data

297

8 CONCLUSION AND

PERSPECTIVES

In conclusion, oversampling techniques provide a

valuable approach to address class imbalance in ma-

chine learning. Nevertheless, their effectiveness can

be hindered by the quality of synthetic data gen-

erated during the oversampling process. To over-

come this limitation, the proposed ﬁltering oversam-

pling method selectively ﬁlters out unrealistic syn-

thetic data, thereby enhancing the performance of ma-

chine learning models on imbalanced datasets. This

leads to improved performance on real-world datasets

as the model becomes less reliant on predicting syn-

thetic instances and gains better generalization capa-

bilities beyond the synthetic data distribution.

For future research, promising directions include

incorporating explainability and interpretability as-

pects into the ﬁltering oversampling method. Devel-

oping techniques to understand the impact of ﬁltered

synthetic data on the model’s decision-making pro-

cess can enhance insights and prediction trustworthi-

ness. Additionally, extending the research to multi-

class classiﬁcation problems, beyond initial binary

classiﬁcation tasks, will assess the method’s effective-

ness across a broader range of scenarios.

We aim to advance the understanding and capa-

bilities of handling imbalanced datasets by pursuing

these future research directions, ultimately enhancing

the performance of machine learning models in real-

world applications.

ACKNOWLEDGEMENTS

This work was partially funded by the French Na-

tional Re- search Agency (ANR) through the ABiMed

Project [grant number ANR-20-CE19-0017-02].

REFERENCES

Bunkhumpornpat, C., Sinapiromsaran, K., and Lursinsap,

C. (2009). Safe-level-smote: Safe-level-synthetic mi-

nority over-sampling technique for handling the class

imbalanced problem. In Paciﬁc-Asia Conference on

Knowledge Discovery and Data Mining, pages 475–

482.

Camino, R., Hammerschmidt, C., and State, R. (2018).

Generating multi-categorical samples with generative

adversarial networks. In ICML 2018 Workshop on

Theoretical Foundations and Applications of Deep

Generative Models, pages 1–7.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,

W. P. (2002). Smote: Synthetic minority over-

sampling technique. Journal of Artiﬁcial Intelligence

Research, 16(1):321–357.

Chen, C., Liaw, A., and Breiman, L. (2004). Using random

forest to learn imbalanced data. University of Califor-

nia, Berkeley, 110:24–31.

Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W. F.,

and Sun, J. (2017). Generating multi-label discrete pa-

tient records using generative adversarial networks. In

Machine Learning for Healthcare Conference, pages

286–305.

Drummond, C. and Holte, R. (2003). C4.5, class imbalance,

and cost sensitivity: Why under-sampling beats over-

sampling. In Proceedings of the ICML’03 Workshop

on Learning from Imbalanced Datasets.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. In Neural

Information Processing Systems, pages 2672–2680.

Han, H., Wang, W. Y., and Mao, B. H. (2005). Borderline-

smote: A new over-sampling method in imbalanced

data sets learning. In International Conference on In-

telligent Computing, pages 878–887.

He, H., Bai, Y., Garcia, E. A., and Li, S. (2008). Adasyn:

Adaptive synthetic sampling approach for imbalanced

learning. In 2008 IEEE International Joint Confer-

ence on Neural Networks (IEEE World Congress on

Computational Intelligence), pages 1322–1328.

He, H. and Garcia, E. (2009). Learning from imbalanced

data. IEEE Transactions on Knowledge and Data En-

gineering, 21(9):1263–1284.

Huang, G. B., Zhu, Q. Y., and Siew, C. K. (2006). Extreme

learning machine: Theory and applications. Neuro-

computing, 70(1-3):489–501.

Mirza, M. and Osindero, S. (2014). Conditional generative

adversarial nets.

Powers, D. (2011). Evaluation: From precision, recall and

f-factor to roc, informedness, markedness and corre-

lation. Journal of Machine Learning Technologies,

2(1):37–63.

Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B.,

and Lee, H. (2016). Generative adversarial text to im-

age synthesis. In International Conference on Ma-

chine Learning, volume 48, pages 1060–1069.

Rodr

ıguez-Torres, F., Mart

ınez-Trinidad, J. F., and

Carrasco-Ochoa, J. A. (2022). An oversampling

method for class imbalance problems on large

datasets. Applied Sciences, 12(7):3424.

Tarawneh, S., Al-Betar, M. A., and Mirjalili, S. (2022). Stop

oversampling for class imbalance learning: A review.

IEEE Transactions on Neural Networks and Learning

Systems, 33(2):340–354.

Zhang, Y., Gan, Z., Fan, K., Chen, Z., Henao, R., Shen, D.,

and Carin, L. (2017). Adversarial feature matching

for text generation. In International Conference on

Machine Learning, pages 4006–4015.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

298