Binary Classification: Counterbalancing Class Imbalance by Applying
Regression Models in Combination with One-sided Label Shifts
Peter Bellmann, Heinke Hihn
a
, Daniel A. Braun
b
and Friedhelm Schwenker
c
Institute of Neural Information Processing, Ulm University, James-Franck-Ring, 89081 Ulm, Germany
Keywords:
Imbalanced Classification Tasks, Binary Classification, Regression, Support Vector Machines.
Abstract:
In many real-world pattern recognition scenarios, such as in medical applications, the corresponding classifi-
cation tasks can be of an imbalanced nature. In the current study, we focus on binary, imbalanced classification
tasks, i.e. binary classification tasks in which one of the two classes is under-represented (minority class) in
comparison to the other class (majority class). In the literature, many different approaches have been pro-
posed, such as under- or oversampling, to counter class imbalance. In the current work, we introduce a novel
method, which addresses the issues of class imbalance. To this end, we first transfer the binary classification
task to an equivalent regression task. Subsequently, we generate a set of negative and positive target labels,
such that the corresponding regression task becomes balanced, with respect to the redefined target label set.
We evaluate our approach on a number of publicly available data sets in combination with Support Vector
Machines. Moreover, we compare our proposed method to one of the most popular oversampling techniques
(SMOTE). Based on the detailed discussion of the presented outcomes of our experimental evaluation, we
provide promising ideas for future research directions.
1 INTRODUCTION
Imbalanced data sets, i.e. data sets including classes
that are not represented approximately equal (Chawla,
2005), occur in many areas of machine learning re-
search, such as in fraud detection (Jurgovsky et al.,
2018), outlier detection (Aggarwal, 2015), and med-
ical applications (Fakoor et al., 2013). Imbalanced
data sets can lead to a significant loss of performance
in most standard classifier learning algorithms, as
they assume a balanced class distribution.
In the past, a great effort has been made to ad-
dress this issue (Sun et al., 2009). One can broadly
identify the following general solution paradigms spe-
cific to class imbalance (Sun et al., 2009): 1) data-
level approaches include different forms of resam-
pling (Xiaolong et al., 2019) and synthetic data gen-
eration (Chawla et al., 2002) to represent the classes
equally; 2) algorithm-level approaches, where the
idea is to choose an appropriate inductive bias, e.g.,
adaptive penalties (Lin et al., 2002) and to adjust
the decision boundary (Wu and Chang, 2003) of
a
https://orcid.org/0000-0002-3244-3661
b
https://orcid.org/0000-0002-8637-6652
c
https://orcid.org/0000-0001-5118-0812
Support Vector Machines (Vapnik, 2013); 3) cost-
level approaches adapt the misclassification costs to
reflect the imbalance; and 4) boosting approaches
(Galar et al., 2011), which are common in multi-
classifier systems (Hihn and Braun, 2020; Bellmann
et al., 2018; Kuncheva, 2014). A recently pro-
posed example for an algorithm-level approach is the
Pattern-Based Classifier for Class Imbalance Prob-
lems (PBC4cip) introduced by Loyola-Gonz
´
alez et
al. (Loyola-Gonz
´
alez et al., 2017). A pattern is a
set of relational statements that describe objects. A
contrast pattern is a pattern that appears significantly
more often in a class than it does in the remaining
classes. The idea of the PBC4cip approach is to
weight the sum of support for the patterns in each
class by taking the class imbalance level into account.
In this work, we focus on Support Vector Ma-
chines (SVMs) and propose a novel approach, which
best fits to the algorithm-level category, to tackle the
difficulties introduced by imbalanced data sets. Note
that classification SVMs can be reformulated as re-
gression SVMs. Based on this property, we derive
a label transformation technique that maps binary la-
bels, {−1, +1}, to random unique positive and nega-
tive regression targets, respectively. We bound the tar-
get interval to [−|X
|,|X
+
|], where |X
| denotes the
724
Bellmann, P., Hihn, H., Braun, D. and Schwenker, F.
Binary Classification: Counterbalancing Class Imbalance by Applying Regression Models in Combination with One-sided Label Shifts.
DOI: 10.5220/0010236307240731
In Proceedings of the 13th International Conference on Agents and Artificial Intelligence (ICAART 2021) - Volume 2, pages 724-731
ISBN: 978-989-758-484-8
Copyright
c
2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
number of minority class samples and |X
+
| the num-
ber of majority class samples. Moreover, we scale the
targets such that the corresponding interval becomes
symmetric. In effect, this symmetric target range al-
leviates some of the problems caused by the data sets’
class imbalance, as we will show empirically.
This work is structured as follows. In Sec. 2, we
shortly discuss some related work. We introduce our
approach in Sec. 3. In Sec. 4, we briefly describe the
data sets. We present and discuss the results in Sec. 5.
2 RELATED WORK
Most likely, data-level techniques constitute the most
popular approaches to deal with class imbalance,
which can be further sub-categorised in undersam-
pling and oversampling. The general undersampling
approach is a straightforward approach for ensemble-
based classification models, i.e. for multi-classifier
systems (MCSs) which consist of a set of so-called
base classifiers. In an MCS, each base classifier is
trained on a subset of the initial training set. There-
fore, to overcome the issue of class imbalance, in
combination with an MCS, one can train each ensem-
ble member based on a balanced subset of the train-
ing set, by undersampling the corresponding majority
class of the current classification task.
In contrast, one can also use some kind of over-
sampling techniques, which lead to a balanced class
distribution. Thereby, one straightforward approach
is the generation of additional artificial data of the mi-
nority class, for example, by simply adding noise to
the initial minority class samples. Depending on the
classification task, one can also apply other data aug-
mentation techniques, such as rotations or shifts, for
instance, in image-based classification.
One popular oversampling approach is the Syn-
thetic Minority Over-sampling Technique (SMOTE)
proposed by Chawla et al. (Chawla et al., 2002).
The SMOTE method can be briefly summarised as
follows. Assume that we want to generate a new
minority class data point based on the data sample
x R
d
,d N. In the first step, one has to determine
the k nearest neighbours to x, from the set of minority
class samples, based on a predefined value for k N,
as well as on a predefined distance function. Second,
one of the k neighbours to x is chosen randomly, for
combination with x. Let the randomly chosen data
sample from the set of nearest neighbours be denoted
by y R
d
. Then, a synthetically generated new data
point, ˜x R
d
, is defined as follows,
˜x
i
= x
i
+ r
i
·(x
i
y
i
), i = 1,... ,d,
whereby r
i
[0,1] is a randomly generated number.
3 METHODOLOGY
In the current section, we first provide the main nota-
tions. Subsequently, we briefly describe the function-
ality of Support Vector Machines, followed by the in-
troduction of our proposed approach. Finally, we list
the performance measures that will be used in our ex-
perimental validation.
3.1 Formalisation
Let X R
d
, d N, be a d-dimensional data set which
constitutes a binary classification task. Further, let the
set of labels be denoted by = {−1,+1}. In the
current work, we assume that the classification task
(X,) is imbalanced. Without loss of generality, we
assume that the data samples from the minority and
the majority classes are associated with the class la-
bels 1 and +1, respectively. Moreover, by X
and
X
+
, we respectively define the set of the minority
class and majority class samples, i.e.
X
:= {x X : l(x) = 1},
X
+
:= {x X : l(x) = +1},
whereby l(x) denotes the label of x.
3.2 Support Vector Machines
Initially, Support Vector Machines (SVMs) were in-
troduced by Vapnik (Vapnik, 2013) as a classification
model that is trained to separate the classes of a bi-
nary classification task. An SVM provides a unique
decision boundary, which is obtained during the train-
ing phase by combining the following two objectives.
First, an SVM tries to separate the provided two sam-
ple sets. Second, an SVM maximises the so-called
margin. The margin is defined as the width between
the two sample sets, based on the computed hyper-
plane. The data samples, which are near to the op-
posing class, and which are used to define the corre-
sponding hyperplane are denoted as support vectors.
For two linearly inseparable classes, additional slack
variables are introduced, to penalise the misclassifica-
tion of training samples, based on their distance to the
current hyperplane.
Moreover, SVM models were modified to learn
different regression tasks, e.g. (Abe, 2005). In the
current work, we will denote classification SVMs and
regression SVMs by cSVM and rSVM, respectively.
3.3 Binary Regression Approach
The main idea of our approach is based on the trans-
formation of binary classification tasks to a specific
type of regression tasks, as discussed below.
Binary Classification: Counterbalancing Class Imbalance by Applying Regression Models in Combination with One-sided Label Shifts
725
3.3.1 Basic Idea
As briefly discussed above, SVMs can be used as
both classification and regression models. There-
fore, for binary classification tasks, equivalently, one
can train an rSVM in combination with the label set
= {−1,+1}. Thus, to classify a (test) sample,
z R
d
, one simply has to compute sgn(rSV M(z)),
with sgn(x) denoting the signum of data sample x.
3.3.2 Labels Generation
In our experiments, in Section 5, we will show that
one can replace the label set = {−1,+1} by a set
of random negative (x X
) and positive (x X
+
)
numbers, in combination with rSVM models. There-
fore, during the training phase of rSVM models, we
will define the set of target labels,
˜
, as follows,
˜
:= {−|X
|,. ..,1,+1, .. .,+|X
+
|}, (1)
whereby |·| denotes the number of elements in the
corresponding set. Thereby, we randomly assign each
x X
to one of the values from the label subset
{−|X
|,. ..,1}, and each x X
+
to one of the labels
from the set {1,. .. ,|X
+
|}. Note that each element of
˜
is assigned to exactly one data sample x X.
3.3.3 One-sided Labels Shift
To overcome the issues caused by the imbalance of
the given class distribution, we can further modify our
proposed approach with focus on the set
˜
, and apply
a one-sided labels shift as follows.
The training of an rSVM model, in combination
with the set
˜
, leads to a specific function, t : R
d
R. However, during the training, the target func-
tion is bounded to the set
˜
, i.e. to the interval I :=
[−|X
|,+|X
+
|], with |X
| < |X
+
|. Therefore, due
to the skewness of I, which originates from the cor-
responding class imbalance, sgn(t) is more likely to
lead to the value +1. To overcome this issue, we pro-
pose to balance the initial classification task by bal-
ancing the interval I, i.e. by shifting the redefined tar-
get labels specific to the minority class to the value
−|X
+
|. More precisely, with := |X
+
||X
|, by
˜
s
, we denote the set of shifted target labels, which
we define as follows,
˜
s
: = {−(|X
|+ ),... ,(1 + ),+1, .. .,+|X
+
|}
= {−|X
+
|,. ..,(1 + ), +1,..., +|X
+
|}. (2)
3.3.4 Example & Shift Generalisation
Let us assume that the data set X = {x
1
,. ..,x
15
}R
d
consists of 15 data samples, with X
= {x
1
,. ..,x
5
}
and X
+
= {x
6
,. ..,x
15
}, i.e. l(x
i
) = 1, l(x
j
) = +1,
i = 1,..., 5, j = 6,. ..,15. Then, for the training of
an rSVM model, we define the set of labels as
˜
:= {−5,. .. ,1,+1,.. ., +10},
since it holds, |X
| = 5 and |X
+
| = 10. Note that the
data samples specific to the set X
(X
+
) are randomly
associated to the negative (positive) numbers from the
set
˜
, such that each x X has a unique label, i.e.
l(x
i
) 6= l(x
j
) i 6= j. Moreover, applying the proposed
labels shift leads to the following set of target labels,
˜
s
:= {−10,. ..,6,+1, .. .,+10}.
In our experiments, we will focus on the recognition
of the minority classes. Therefore, we will analyse
the performance of cSVM and rSVM models in com-
bination with the sets
˜
and
˜
s
. Moreover, we will
analyse the effects of larger shifts. More precisely,
the definition of the set
˜
s
can be generalised by in-
troducing the parameter
m
which we define as fol-
lows,
m
:= m ·|X
+
||X
|, (3)
with a multiplier m, which we will bound to the inter-
val [1,1.5], in our experiments. We denote the corre-
sponding set of target labels by
˜
(m)
s
, i.e.
˜
(m)
s
:= {−(|X
|+
m
),. ..,(1 +
m
),1, ...,|X
+
|}.
The definition of
˜
(m)
s
constitutes a generalisation
of the set
˜
s
and hence of the set
˜
for all
m |X
|/|X
+
|. Specifically, it holds,
˜
s
=
˜
(1)
s
.
Moreover, for m = |X
|/|X
+
|, it follows,
m
= 0, and
therefore it holds,
˜
=
˜
(m)
s
.
Let us recall the above example from the current
section, i.e. |X
| = 5, and |X
+
| = 10. We now set
the parameter m to the value 2. Then, it follows,
2
= 2·105 = 15. Hence, the corresponding shifted
set of target labels is redefined as follows,
˜
(2)
s
= {−20,. ..,16,+1, .. .,+10}.
3.4 Performance Measures
There exist different performance measures for imbal-
anced classification tasks that are based on the evalua-
tion of so-called confusion matrices, which we depict
Table 1: General confusion table. Rows denote true class
labels. Columns denote predicted class labels. T
: True
minority. F
+
: False majority. F
: False minority. T
+
:
True majority. Min.: Minority. Maj.: Majority.
true, predicted Min. Class Maj. Class
Minority Class T
F
+
Majority Class F
T
+
ICAART 2021 - 13th International Conference on Agents and Artificial Intelligence
726
in Table 1. Using the entries of the confusion matrix,
one can compute the measures specificity (spe), pre-
cision (pre), and recall (rec), as defined in Table 2.
Note that the measure recall is also denoted as sen-
sitivity. Using the definitions from Tables 1 and 2,
in our experiments, we will compute the geometric
mean (G-mean) and the F1-score, which are two pop-
ular performance measures for imbalanced classifica-
tion tasks, and which are defined as follows,
G-mean =
rec ·spe, F1-score =
2 · pre ·rec
pre + rec
, (4)
whereby spe, pre, and rec are defined in Table 2.
Table 2: Definition of different measures, based on Table 1.
Specificity (spe) Precision (pre) Recall (rec)
T
+
F
+T
+
T
T
+F
T
T
+F
+
Note that our proposed approach does not fit com-
pletely to one of the four main counterbalancing cat-
egories from Section 1. There is no additional gen-
eration of synthetic data points, i.e. the size of the
training set is left unchanged with the initial imbal-
ance, specific to the number of minority and ma-
jority class samples (data-level). Moreover, there
is no explicit adaptation of the cost function (cost-
level). Obviously, there is no correlation to any of
the existing boosting-based ensembles (boosting ap-
proaches). Therefore, our proposed method fits best
to the algorithm-level category. However, there is no
explicit shift of the decision boundaries obtained by
the corresponding SVM models. Thus, our proposed
method seems to represent an additional meta-level
or target-level category, or a specific algorithm-level
subcategory.
4 DATA SETS
In the current work, we will analyse the following
four data sets, which are all publicly available on the
UCI Machine Learning Repository
1
(Dua and Graff,
2017).
The Arrhythmia data set constitutes a task for
the distinction between present and absent cardiac ar-
rhythmia. Initially, this data set was defined as a 16-
class classification task, with one normal class and
15 classes defined as abnormal, i.e. referring to some
kind of cardiac arrhythmia. However, eight of the 15
classes include less than ten samples each. Therefore,
this data set is mostly used in a binary classification
setting. From the initial 279 features, we removed
1
Hyperlink: http://archive.ics.uci.edu/ml
five features that included missing values. The pro-
vided features were extracted from different domains,
which are denoted by DI, DII, DIII, AVR, AVL, AVF,
as well as V1,. ..,V6. Moreover, the data set also
includes nominal features, such as the age, sex, and
height, amongst others.
The UCI repository includes several data sets cov-
ering breast cancer classification tasks. In the current
study, we analyse the data set that is denoted as Breast
Cancer Wisconsin (Original), and which we will sim-
ply denote by Breast Cancer. This data set consists
of nine ordinal-scaled features, ranging from 1 to 10
each. The features provided are, e.g., the clump thick-
ness, the uniformity of cell size, and the uniformity of
cell shape, amongst others.
The Heart Disease data set constitutes a binary
classification task for the distinction between the pres-
ence and absence of heart diseases, based on differ-
ent patients. As explained on the corresponding UCI
repository database download page, initially, this data
set consisted of ve classes and 76 features. However,
only 13 of the provided features are publicly avail-
able. The provided features include the patients’ age,
sex, as well as resting blood pressure, amongst others.
The Ionosphere data set is a binary classifica-
tion task which includes radar returns from the iono-
sphere. The classes are denoted by good and bad. The
first class is composed of samples that were show-
ing evidence of some type of structure in the iono-
sphere. The latter class is composed of the remaining
samples. All of the provided features are continuous.
Technical details specific to the data set are provided
in (Sigillito et al., 1989).
Table 3 summarises the properties of all data sets
described above. Note that we include the Arrhythmia
and Heart Disease data sets to additionally evaluate
the effectiveness of our proposed method in combina-
tion with slightly imbalanced classification tasks.
Table 3: Data set properties. #S: Number of samples. #F:
Number of features. (P%): P percent of the data belong to
the minority class.
Data Set #S #F Class Distribution
Arrhythmia 452 274 207 : 245 (46%)
Breast Cancer 683 9 239 : 444 (35%)
Heart Disease 270 13 120 : 150 (44%)
Ionosphere 351 34 126 : 225 (36%)
5 RESULTS & DISCUSSION
In the current section, we first provide our experimen-
tal settings, followed by the presentation of our re-
sults. Finally, we discuss the obtained outcomes, and
Binary Classification: Counterbalancing Class Imbalance by Applying Regression Models in Combination with One-sided Label Shifts
727
provide some ideas for future research directions.
5.1 Experimental Settings
We are using the MATLAB
2
software (version
R2019b), in combination with the default parameters
for linear SVM models, i.e. cSVMs and rSVMs with
linear kernels. As evaluation approaches, we apply a
100 ×20-fold cross validation (CV) for the data sets
Arrhythmia and Breast Cancer. In combination with
the data sets Heart Disease and Ionosphere, we apply
a 100 ×10-fold CV, due to the low number of data
points. Thereby, the class distribution of each fold re-
flects the initial class distribution, for each of the data
sets, i.e. we apply stratified CVs. Note that by set-
ting the number of evaluation folds respectively to 10
and 20, we ensure that each of the test folds includes
at least 10 data samples from the minority class.
Moreover, by applying 100 iterations for each of the
CV evaluations, we ensure a fair comparison across
all analysed models. To test for statistically signifi-
cant differences between the implemented models, we
will apply the two-sided Wilcoxon signed-rank test
(Wilcoxon, 1945), at a significance level of 5%.
Note that in the current study, we focus on SVM
models, since they are popular machine learning tools
which are widely used in binary classification task
scenarios. Moreover, we apply the default parameter
settings to focus on the effectiveness of label shifts in
particular.
As we will discuss later in detail, in this work,
we provide the basic form of our proposed balanc-
ing technique (see Section 5.5). Therefore, for the
comparison to the state-of-the-art, we focus on the
SMOTE method, which is a very intuitive and easily
interpretable, yet quite challenging approach.
5.2 Initial Regression Experiments
First, we focus on determining the best regression ap-
proach, based on the following three label set vari-
ants. We evaluate the rSVM performances based on
the initial label set, = {−1,+1}, the modified label
set,
˜
= {−|X
|,+|X
+
|}, as well as the modified and
symmetric label set,
˜
(1.0)
s
= {−|X
+
|,. ..,(|X
+
|
|X
|+ 1),1,.. ., |X
+
|}, as discussed in Section 3. Ta-
ble 4 states the corresponding performance values for
all of the four data sets.
From Table 4, we can make the following obser-
vations. Based on the Arrhythmia and Heart Disease
data sets, training rSVM models specific to the rede-
fined label set,
˜
, instead of the initial label set, ,
2
https://www.mathworks.com/products/matlab.html
leads to an improvement, in combination with both
performance measures, G-mean and F1-score. In con-
trast, based on the Breast Cancer and Ionosphere data
sets, training rSVM models specific to the redefined
label set,
˜
, instead of the initial label set, , leads
to a decrease of both performance measures. How-
ever, the training of rSVMs in combination with the
redefined and symmetric label set,
˜
(1.0)
s
, leads to the
best performance values, based on all data sets and
both measures. The improvement in performance was
always statistically significant, according to the two-
sided Wilcoxon signed-rank test, at a 5% significance
level, with respect to both approaches, and
˜
.
However, in general, binary classification tasks
are not transferred to regression tasks. Therefore, in
the next step, we will compare our proposed approach
to common classification SVM models.
5.3 Classification vs. Regression
In the current section, we will analyse the follow-
ing two research questions. First, we will evaluate
the effect of further shifting the minority class labels,
such that the redefined label sets become imbalanced
again, however over-representing the minority class.
Second, we will compare our proposed method to a
regular cSVM model, which is trained in combina-
tion with the SMOTE method. Thereby, we apply
the standard SMOTE approach in combination with
the Euclidean distance, and set the number of nearest
neighbours to k = 10 (see Section 2). Note that we ap-
ply the SMOTE method such that the current training
set is always equally distributed, i.e. |X
train
| = |X
+
train
|,
for each validation fold, and each data set. Table 5
states the obtained results, including the performance
values for
˜
(m)
s
, with m = 1.1,1.2, 1.3,1.4,1.5.
From Table 5, we can make the following obser-
vations. Additionally shifting the labels, with respect
to
˜
(1.0)
s
, increased the performance specific to both
measures, G-mean and F1-score, based on the Ar-
rhythmia and Breast Cancer data sets. In contrast,
based on the Heart Disease and Ionosphere data sets,
further labels shifts, with respect to
˜
(1.0)
s
, decreased
the averaged performance in general, in combination
with both measures. Moreover, our proposed method
outperformed the common cSVM approach, in com-
bination with SMOTE (and the initial binary label set
), based on the Arrhythmia, as well as Breast Cancer
data set, with respect to both performance measures.
On the other hand, our proposed method was outper-
formed by the cSVM SMOTE approach, for the other
two data sets, also with respect to both performance
measures.
ICAART 2021 - 13th International Conference on Agents and Artificial Intelligence
728
Table 4: Averaged G-mean and F1-score results in % based on rSVMs. AR: Arrhythmia. BC: Breast Cancer. HD: Heart
Disease. IO: Ionosphere. /
˜
/
˜
(1.0)
s
: rSVM in combination with the initial, = {−1, +1}, redefined,
˜
, redefined and
shifted,
˜
(1.0)
s
, label sets (see Section 3, for the definitions). The best performing method is depicted in bold. The improvement
in performance is statistically significant, according to the two-sided Wilcoxon signed-rank test, at a significance level of 5%,
with respect to and
˜
, for each data set, based on both performance measures. Standard deviation values are denoted by ±.
AR (100×20 Folds) BC (100×20 Folds) HD (100×10 Folds) IO (100×10 Folds)
G-mean F1-score G-mean F1-score G-mean F1-score G-mean F1-score
62.9 58.6 94.8 93.9 79.1 76.7 76.7 73.7
±2.6 ±3.1 ±0.9 ±1.0 ±2.9 ±3.3 ±2.8 ±3.4
˜
66.5 61.8 91.6 90.7 81.5 79.3 73.1 69.3
±2.7 ±3.3 ±1.2 ±1.3 ±2.4 ±2.8 ±2.5 ±3.1
˜
(1.0)
s
68.1 64.4 96.5 95.5 82.0 80.1 82.3 78.4
±2.3 ±2.8 ±0.8 ±1.0 ±2.2 ±2.5 ±1.9 ±2.4
Table 5: Averaged G-mean and F1-score results in %. AR: Arrhythmia. BC: Breast Cancer. HD: Heart Disease. IO:
Ionosphere. SMOTE: cSVM based on = {−1,+1}, in combination with an exactly balanced class distribution (see Section
2, for details).
˜
(m)
s
: rSVM in combination with the redefined and shifted label sets
˜
(m)
s
(see Section 3, for the definitions).
The best performing methods are depicted in bold. An asterisk (
) indicates a statistically significant difference between the
best performing rSVM approach and the cSVM SMOTE method, according to the two-sided Wilcoxon signed-rank test, at a
significance level of 5%. Standard deviation values are denoted by ±.
AR (100×20 Folds) BC (100×20 Folds) HD (100×10 Folds) IO (100×10 Folds)
G-mean F1-score G-mean F1-score G-mean F1-score G-mean F1-score
SMOTE 66.5 63.8 97.1 95.8
83.0
81.2
83.0
79.2
±2.7 ±3.0 ±0.7 ±0.9 ±2.0 ±2.3 ±2.2 ±2.7
˜
(1.0)
s
68.1 64.4 96.5 95.5 82.0 80.1 82.3 78.4
±2.3 ±2.8 ±0.8 ±1.0 ±2.2 ±2.5 ±1.9 ±2.4
˜
(1.1)
s
68.6 65.5 97.1 96.0 81.9 80.1 81.6 77.2
±2.0 ±2.3 ±0.7 ±0.9 ±2.3 ±2.5 ±1.9 ±2.4
˜
(1.2)
s
68.7 66.3 97.6
96.4 81.7 80.2 81.2 76.4
±1.9 ±2.1 ±0.6 ±0.8 ±2.5 ±2.6 ±2.0 ±2.4
˜
(1.3)
s
68.8 67.0
97.7 96.4 81.1 79.8 81.0 76.0
±2.2 ±2.3 ±0.6 ±0.9 ±2.6 ±2.5 ±2.0 ±2.5
˜
(1.4)
s
68.8 67.5
97.7 96.2 80.4 79.3 80.8 75.7
±2.2 ±2.2 ±0.6 ±0.9 ±2.4 ±2.3 ±1.9 ±2.4
˜
(1.5)
s
68.7
67.9
97.7 96.1 79.7 78.9 80.7 75.4
±2.2 ±2.1 ±0.6 ±0.9 ±2.4 ±2.2 ±1.9 ±2.4
Table 6: Averaged confusion tables. SMOTE: cSVM based on = {−1,+1}, in combination with an exactly balanced class
distribution.
˜
(m)
s
: rSVM in combination with the redefined and shifted label sets
˜
(m)
s
(see Section 3, for the definitions).
Min: Minority class. Maj: Majority class. Rows denote true class labels. Columns denote predicted class labels.
true Arrhythmia Breast Cancer Heart Disease Ionosphere
predicted Min Maj Min Maj Min Maj Min Maj
SMOTE
132 75 232 7 97 23 95 31
74 171 13 431 22 128 18 207
˜
(1.0)
s
126 81 228 11 96 24 92 34
56 189 11 433 24 126 17 208
˜
(1.1)
s
132 75 232 7 98 22 93 33
63 182 12 432 27 123 23 202
Binary Classification: Counterbalancing Class Imbalance by Applying Regression Models in Combination with One-sided Label Shifts
729
Table 6 illustrates the effects of increasing the labels
shift, based on the rounded confusion tables. The ta-
bles are averaged with respect to the 100 repetitions
of the 10-fold and 20-fold cross validations, specific
to the different data sets, respectively. Comparing the
cSVM SMOTE method to the
˜
(1.0)
s
based approach,
leads to the following observation, with respect to
the recognition of the minority class. The cSVM
SMOTE method slightly outperforms the
˜
(1.0)
s
based
approach on all of the four data sets. The effect of
shifting the minority class labels is also clearly de-
picted in Table 6. By increasing m from 1.0 to 1.1,
we can observe that the recognition of the minority
class improved, while simultaneously the recognition
of the majority class decreased, based on all four data
sets. The recognition of the minority class, specific to
the approaches cSVM SMOTE and
˜
(1.1)
s
, is approx-
imately the same.
5.4 Discussion
In the first part of our experiments, we showed that
shifting the labels to a symmetric target interval im-
proves the performance of rSVM models, with respect
to the initial class label set , as well as to the rede-
fined and asymmetric target label set
˜
. Moreover,
we showed that additional one-sided shifts of the la-
bels improve the recognition of the minority class,
while simultaneously decreasing the recognition rate
specific to the majority class. Moreover, the perfor-
mance based on our proposed approach is comparable
to the performance of cSVM SMOTE models. One
main advantage of our proposed method is that there
is no additional generation of synthetic data. This
property is especially beneficial in classification tasks
where the corresponding class imbalance is relatively
huge, and the feature dimension very high. In gen-
eral, the operational cost, for determining a small set
of nearest neighbours, has to be taken into account in
tasks including high-dimensional data. Moreover, it
is also necessary to store the additionally generated
data. Applying our proposed approach avoids both of
the aforementioned issues.
Similar to other counterbalancing methods, it is
possible to apply our proposed method also in multi-
class settings, in a straightforward manner. Therefore,
one can simply choose one of the existing divide-and-
conquer methods, such the error-correcting output
codes (Dietterich and Bakiri, 1991), including, e.g.,
the one-versus-one and one-versus-all approaches.
Moreover, one can also apply our proposed method in
form of cascaded classification architectures (Frank
and Hall, 2001), if it is possible to detect an ordi-
nal class structure in the current classification task,
as recently proposed, for instance, in (Bellmann and
Schwenker, 2020; Lausser et al., 2020).
5.5 Future Research Directions
Note that in the current study, we introduced the basic
form of our proposed method. Therefore, in the fol-
lowing, we want to provide some promising ideas for
future research directions.
First, one could include the parameter m in the
process of classifier design. As we showed in the pre-
vious section, the parameter m has a strong influence
on the recognition rate of the minority class. Depend-
ing on the current classification task and performance
measure, one could apply some kind of optimisation
techniques, e.g. grid search, to determine optimal val-
ues for m.
Second, note that in the current study, we pro-
posed to assign the labels randomly, between the new
set of labels
˜
(m)
s
and the corresponding data samples
from the majority (only positive labels) and the mi-
nority (only negative labels) classes. Therefore, we
have to further analyse whether the classification per-
formance can be improved by applying specific target
label assigning approaches. To this end, one could
define some kind of ordering function f , f : R
d
R,
and apply f separately to the sets X
and X
+
. Sub-
sequently, one can assign the target labels according
to the sorted values of f . More precisely, x X
+
is assigned to the target label |X
+
|, if f (x) > f (y),
y X
+
. Similarly, x X
+
is assigned to the target
label 1, if f (x) < f (y), y X
+
. Analogously, based
on the set X
, the data sample specific to the low-
est/highest value of f is assigned to the lowest/highest
negative value of the current label set
˜
(m)
s
. We as-
sume that assigning the target labels appropriately
can lead to a smooth regression target function, which
could lead to a better classification performance. In
future work, we want to evaluate different examples
for the ordering function f , for instance,
f (x) :=
d
i=1
|x
i
|, f (x) :=
d
i=1
x
n
i
,
e.g., with n {1,2}.
Third, our proposed method is not based on any
specific property of SVM models. Similar to SVMs,
for instance, decision trees (Breiman et al., 1984) can
also be implemented as both, classification and re-
gression models. Therefore, we aim to analyse our
proposed method in combination with additional clas-
sification - or even plain regression - models. More-
over, we aim to adapt our approach to deep models,
for instance, by scaling the target labels to the inter-
val [1,1], to avoid the exploding gradient problem.
ICAART 2021 - 13th International Conference on Agents and Artificial Intelligence
730
Fourth, we aim to analyse our proposed method in
combination with ensemble-based classification sys-
tems. Therefore, for instance, one could design an
ensemble in which each base classifier/regressor is
trained on 1) different target label intervals, and/or 2)
different assignments between training samples and
target labels. Moreover, one could include different
classification/regression models, or combine our pro-
posed method with some approaches from the dif-
ferent class imbalance solution categories, which we
briefly discussed in Section 1.
Finally, after adapting our method specific to some
or all of the modifications proposed above, we aim to
provide a detailed comparison to latest state-of-the-art
techniques including highly imbalanced data sets.
ACKNOWLEDGEMENTS
The work of Friedhelm Schwenker and Peter Bell-
mann is supported by the project Multimodal recogni-
tion of affect over the course of a tutorial learning ex-
periment (SCHW623/7-1) funded by the German Re-
search Foundation (DFG). The work of Daniel Braun
and Heinke Hihn is supported by the European Re-
search Council, grant number ERC-StG-2015-ERC,
Project ID: 678082, BRISC: Bounded Rationality in
Sensorimotor Coordination. We gratefully acknowl-
edge the support of NVIDIA Corporation with the do-
nation of the Tesla K40 GPU used for this research.
REFERENCES
Abe, S. (2005). Support Vector Machines for Pattern Clas-
sification. Advances in Pattern Recognition. Springer,
London, U.K.
Aggarwal, C. C. (2015). Outlier analysis. In Data mining,
pages 237–263. Springer.
Bellmann, P. and Schwenker, F. (2020). Ordinal classifi-
cation: Working definition and detection of ordinal
structures. IEEE Access, 8:164380–164391.
Bellmann, P., Thiam, P., and Schwenker, F. (2018). Multi-
classifier-Systems: Architectures, Algorithms and Ap-
plications, pages 83–113. Springer International Pub-
lishing, Cham.
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone,
C. J. (1984). Classification and Regression Trees.
Wadsworth.
Chawla, N. V. (2005). Data mining for imbalanced datasets:
An overview. In The Data Mining and Knowledge
Discovery Handbook, pages 853–867. Springer.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,
W. P. (2002). Smote: Synthetic minority over-
sampling technique. Journal of Artificial Intelligence
Research, pages 321–357.
Dietterich, T. G. and Bakiri, G. (1991). Error-correcting
output codes: A general method for improving mul-
ticlass inductive learning programs. In AAAI, pages
572–577. AAAI Press / The MIT Press.
Dua, D. and Graff, C. (2017). UCI machine learning repos-
itory.
Fakoor, R., Ladhak, F., Nazi, A., and Huber, M. (2013).
Using deep learning to enhance cancer diagnosis and
classification. In Proceedings of the international con-
ference on machine learning, volume 28. ACM New
York, USA.
Frank, E. and Hall, M. A. (2001). A simple approach to or-
dinal classification. In ECML, volume 2167 of Lecture
Notes in Computer Science, pages 145–156. Springer.
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H.,
and Herrera, F. (2011). A review on ensembles for
the class imbalance problem: bagging-, boosting-, and
hybrid-based approaches. IEEE Transactions on Sys-
tems, Man, and Cybernetics, Part C (Applications and
Reviews), 42(4):463–484.
Hihn, H. and Braun, D. A. (2020). Specialization in hier-
archical learning systems. Neural Processing Letters,
52:2319–2352.
Jurgovsky, J., Granitzer, M., Ziegler, K., Calabretto, S.,
Portier, P.-E., He-Guelton, L., and Caelen, O. (2018).
Sequence classification for credit-card fraud detec-
tion. Expert Systems with Applications, 100:234–245.
Kuncheva, L. I. (2014). Combining Pattern Classifiers:
Methods and Algorithms. John Wiley & Sons.
Lausser, L., Sch
¨
afer, L. M., K
¨
uhlwein, S. D., Kestler, A.
M. R., and Kestler, H. A. (2020). Detecting ordinal
subcascades. Neural Process Lett, 52:2583–2605.
Lin, Y., Lee, Y., and Wahba, G. (2002). Support vector
machines for classification in nonstandard situations.
Machine learning, 46(1-3):191–202.
Loyola-Gonz
´
alez, O., Medina-P
´
erez, M. A., Mart
´
ınez-
Trinidad, J. F., Carrasco-Ochoa, J. A., Monroy, R., and
Garc
´
ıa-Borroto, M. (2017). Pbc4cip: A new contrast
pattern-based classifier for class imbalance problems.
Knowl. Based Syst., 115:100–109.
Sigillito, V. G., Wing, S. P., Hutton, L. V., and Baker, K. B.
(1989). Classification of radar returns from the iono-
sphere using neural networks. Johns Hopkins APL
Technical Digest, 10(3):262–266.
Sun, Y., Wong, A. K., and Kamel, M. S. (2009). Classifica-
tion of imbalanced data: A review. International jour-
nal of pattern recognition and artificial intelligence,
23(04):687–719.
Vapnik, V. (2013). The nature of statistical learning theory.
Springer science & business media.
Wilcoxon, F. (1945). Individual comparisons by ranking
methods. Biometrics Bulletin, 1(6):80–83.
Wu, G. and Chang, E. Y. (2003). Class-boundary align-
ment for imbalanced dataset learning. In ICML 2003
workshop on learning from imbalanced data sets II,
Washington, DC, pages 49–56.
Xiaolong, X., Wen, C., and Yanfei, S. (2019). Over-
sampling algorithm for imbalanced data classifica-
tion. Journal of Systems Engineering and Electronics,
30(6):1182–1191.
Binary Classification: Counterbalancing Class Imbalance by Applying Regression Models in Combination with One-sided Label Shifts
731