Data Balancing using Deep Convolutional Generative Adversarial

Networks (DCGAN) in Patients with Congenital Syndrome by Zika Virus

Erika G. Assis, Mark A. Song, Luis E. Z

arate and Cristiane N. Nobre

Department of Computing, Pontiﬁcal Catholic University of Minas Gerais University, Brazil

Keywords:

Congenital Syndrome, Zika, Generative Adversarial Networks, GAN, DCGAN.

Abstract:

Class imbalance is a common health care problem and often affects the performance of machine learning

algorithms. Unfortunately, the minority class, generally the one with the most signiﬁcant interest, has their

learning affected to the detriment of the majority class. This article proposes using Deep Convolutional Gen-

erative Adversarial Networks (DCGAN) for minority class oversampling, generating synthetic instances. For

this, the ’RESP-Microcephaly’ database was used, which records suspected cases of congenital alteration due

to Zika virus (ZIKV) infection. The database presents unbalanced data with 2904 and 7606 instances with

and without congenital alteration, respectively. To evaluate the performance of DCGAN, we compared this

method with an undersampling and an oversampling approach, using SMOTE with three classiﬁcation algo-

rithms. The use of DCGAN for balancing demonstrates a signiﬁcant improvement in classiﬁcation indices,

especially about the minority class.

1 INTRODUCTION

In recent years, machine learning techniques have

been applied to several domains, especially in the

health area such as breast cancer, thyroid disease,

Parkinson’s disease, predict mortality rate, and life

expectancy (Tomar, 2013) (Herland et al., 2014) (Jap-

kowicz, 2000a) (Batista et al., 2004) (Alsharqi et al.,

2018) (Weng et al., 2017) (Green, 2018) (Esteva et al.,

2017).

However, these healthcare datasets often suffer

from “rare class” issues, which result in unbalanced

classes in the training datasets (Batista et al., 2004)

(Chawla, 2005) (Milovic and Milovic, 2012). That is,

most health datasets generally have very few cases of

the target disease compared to the number of healthy

patients in the dataset (Chawla, 2005) (Japkowicz,

2000b). In the binary classiﬁcation for medical diag-

nosis, the rare minority class refers to the positive in-

stances or target class. In contrast, the majority class

is represented by the negative cases in the dataset.

Although unbalanced data is frequent in machine

learning tasks, this is a very challenging task for clas-

siﬁcation algorithms (Chawla et al., 2003). This is be-

cause traditional machine learning methods applied to

unbalanced problems usually have a bias in favor of

the majority class, with unsatisfactory performance

in the minority class. This takes place during train-

ing; the minority classes collaborate less towards the

minimization of the objective function (Chawla et al.,

2002).

There are traditional techniques for working with

class imbalance. One approach is subsampling, which

involves excluding majority class instances to balance

instances in each class. Unfortunately, despite repair-

ing the disproportion, the classiﬁer loses the majority

of information and is also likely to make sampling er-

rors on small data sets (Japkowicz, 2000a). Another

technique is oversampling, in which instances of the

minority class are augmented so that the classes are

uniform. This solves the balance problem but may

cause the classiﬁer to generalize less to the minority

class because the particulars of the minority class data

become more usual (Japkowicz, 2000a).

An example of a widely used algorithm that uses

this approach is the SMOTE (Chawla et al., 2002)

(Chawla et al., 2003). It increases the number of

instances of the minority class by creating synthetic

examples. First, a sample belonging to the minority

class is randomly selected to generate new instances,

considering its neighbors. Then, from this neighbor-

hood, a new sample is built with the interpolation of

neighboring points. For this, lines are drawn between

the examples that make up the neighborhood. On

these lines, synthetic points belonging to the minority

class are generated (Chawla et al., 2002). Thus, al-

Assis, É., Song, M., Zárate, L. and Nobre, C.

Data Balancing using Deep Convolutional Generative Adversarial Networks (DCGAN) in Patients with Congenital Syndrome by Zika Virus.

DOI: 10.5220/0010842900003123

In Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2022) - Volume 5: HEALTHINF, pages 93-102

ISBN: 978-989-758-552-4; ISSN: 2184-4305

though these new instances may not accurately reﬂect

the actual distribution of the data, they tend to be close

enough to encourage generalization and increase the

accuracy of the overall classiﬁcation (Chawla et al.,

2002). On the other hand, deep learning methods have

been used very efﬁciently, as is the case of Generative

Adversarial Networks (GANs) (Mariani et al., 2018)

(Mullick et al., 2019).

The GANs were inspired by game theory; the gen-

erator (G) and the discriminator (D) complement each

other until reaching the Nash equilibrium in the train-

ing process (Goodfellow et al., 2014). Both are neural

networks with different responsibilities. The discrim-

inator is a network responsible for evaluating whether

certain content is real or generated. The generator

produces the content itself. The relationship between

these two components is performed in an adversarial

way. At the same time, the discriminator is enabled

to distinguish the real from the fake, and the genera-

tor is trained to deceive the discriminator through the

content it is producing. Through the training of both,

networks develop together to determine their respon-

sibilities (Goodfellow et al., 2014).

This article uses a synthetic oversampling ap-

proach called DCGAN (Deep Convolutional Genera-

tive Adversarial Network), a GAN type that explicitly

uses convolutional and convolutional transpose lay-

ers in the discriminator and generator (Salimans et al.,

2016).

Generating artiﬁcial data through data augmenta-

tion (DA) techniques can be an alternative to improve

classiﬁcation. Several works have already applied DA

and obtained improvement in their results (Hussain

et al., 2018), (Wang et al., 2017) and (Yu et al., 2017).

We used GAN’s to perform AD and tested it on a

set of tabular data from the Public Health Event Reg-

istry RESP-Microcephaly

, which presents an imbal-

ance of classes in the order of 1:2.6. There is a ma-

jority class with 72.37% (without syndrome), much

more frequently than the minority class (with the syn-

drome) 27.63%

In addition to the two oversampling balancing

methods, SMOTE and GANs, we also compared the

Random Under Sampler (RUS). The RUS undersam-

pling method removes the majority class samples at

random; in the end, the majority class has the same

number of samples as the minority class.

The RESP-Microcephaly is an online form devel-

oped by DATASUS-Brazil, instituted by the Ministry of

Health (MS), since November 19, 2015, to record cases

and deaths suspected of changes in growth and devel-

opment related to infection by the Zika virus and other

infectious etiologies (Brasil et al., 2015). Available at:

http://www.resp.saude.gov.br/microcefalia

Thus, this work aims to investigate data balancing

methods in diagnosing newborns and children with

congenital syndrome caused by ZIKV infection. Re-

garding classiﬁers, we use three algorithms: Random

Forest, Decision Tree, and Bagging.

This work is structured as follows: Section 2

brings the background used in the research. Section

3 presents the works related to the topic investigated.

Section 4 describes the materials and methods used in

the experiments. Finally, in Section 5, the results and

Discussions, and Section 6 presents the ﬁnal consid-

erations and proposals for future work.

2 BACKGROUND

2.1 Congenital Zika Syndrome

On March 31, 2016, the World Health Organization

(WHO) announced Zika virus infection (ZIKV) as an

emergency public health problem worldwide due to

the association of this arbovirus with the occurrence

of congenital Zika syndrome.

ZIKV is mainly transmitted by the vector Aedes

aegypti, which resides in tropical and subtropical re-

gions, as well as by Aedes albopictus, the inhabi-

tant of the European Mediterranean (Carvalho et al.,

2019).

Mothers can transmit the Zika virus to embryos

or fetuses during pregnancy or at birth time (Zanluca

et al., 2017).

Children born to women infected with ZIKV dur-

ing pregnancy showing varying degrees of nervous

system impairment, such as microcephaly and other

neurodevelopmental lesions (Boeuf et al., 2016).

In additional observational studies, a set of con-

genital anomalies was identiﬁed and linked to ZIKV

infection in the uterus, called Congenital Zika Syn-

drome (CZS). This syndrome includes, in addition to

microcephaly, craniofacial disproportion, irritability,

spasticity, seizures, feeding difﬁculties, visual abnor-

malities, and hearing loss, as well as calciﬁcations,

cortical disorders, and fetal cerebral ventricle dilata-

tion (Lima et al., 2019).

This article aims at improving CZS classiﬁca-

tion by balancing GAN’s by improving classiﬁcation

methods to improve early diagnosis and prevention.

2.2 Generative Adversarial

Networks - GAN

The Adversary Generative Networks (GANs), pro-

posed by (Goodfellow et al., 2014), are deep neu-

HEALTHINF 2022 - 15th International Conference on Health Informatics

ral network architectures composed of two networks

placed against each other. The authors call this model

adversary networks and training both models using

only the backpropagation and dropout algorithms, be-

ing highly successful.

GANs are a kind of differentiable generator net-

work, which is, we can use backpropagation to train

with a descending gradient (Goodfellow et al., 2014).

This type of model transforms samples from a latent

vector z into examples x using the smooth function

g(z, θ) (Goodfellow et al., 2014). Essentially, dif-

ferentiable generator networks are computing proce-

dures for generating samples.

A typical architecture for GANs is illustrated in

Figure 1. The generator is an G differentiable func-

tion. When z is sampled from some previous simple

distribution, G(z) yields a sample of x (Goodfellow

et al., 2014).

The generator input is a random noise vector Z,

usually a uniform or normal distribution. The noise is

mapped to a new data space via the G generator to ob-

tain a false sample, G(z), which is a multidimensional

vector (Goodfellow et al., 2014) (Pan et al., 2019).

The generator is G differentiable function. When

z is sampled from some previous simple distribu-

tion, G(z) produces a sample of x (Goodfellow et al.,

2014).

The generative network tries to produce samples

that resemble the original data, x yesP

ata, according

to the series of transformations that can be described

by the function x = g(z;θ

The D discriminating is a binary classiﬁer. It takes

an accurate sample from the dataset. The false piece

generated by the G generator as input and the out-

put from the D discriminator represents the probabil-

ity that the example is authentic.

The discriminating network, in turn, produces the

likelihood of x being false or real, which is given by a

function d(x; θ

). D is trained to maximize the proba-

bility of a hit of a sample being genuine or false (com-

ing from G), and while G is trained to minimize the

likelihood of being discovered, [log(1 − D(G)(z))].

In practice, Goodfellow et al. (2014) have ob-

served that minimize [log(1 − D(G)(z))] makes gra-

dients converge to zero quickly and maximizing

log(D(G(z))) is equivalent and allows for gradients

with higher values. Consequently, the optimization

problem of GANs is transformed into the objective

function presented in Equation 1.

min

∼ max

V (D, G) =

= E

x∼data

(x)[logD(x)] + E

z∼pz(z)

[log(1 − D(G)(z))],

(1)

So the game between Generator and Discrimina-

tor is established, the function V (D, G) will reach

its maximum value when the Nash equilibrium is

reached, that is, when neither of the two can improve

its performance. Thus, the above function will be

used as the trouble loss function.

Thus, in its simplest form, given the two play-

ers (discriminator and generator), the GAN’s learning

problem is solved as a zero-sum game, when the gain

obtained by one participant is equivalent to the loss

by the other participant, where the function r(θ

;θ

)

determines the reward for one of the networks and

−r(θ

;θ

) for the other.

As the game is zero-sum, in Nash’s unique equi-

librium you will have:

∗

= arg min

max

r(g, d) (2)

where g

∗

is the generating network at the conver-

gence point, optimally capturing the data distribution.

The standard choice for r is

r(θ

;θ

) = E

∼ P

data

log(d(x)) + E

∼ P

model

[log(1 − d(x))]

(3)

We can separate the cost function of each one of

the networks, for the discriminator we would have

−1

∑

[log(d(x))] + [log(1 − d(g)(z))] (4)

The second term, log(1 − d(g)(z)), concerns the

incorrect classiﬁcation of false samples. The equa-

tion 5 gives the cost to be optimized by the network

generator.

∑

[log(1 − d(g)(z))] (5)

Intuitively, these cost functions make it the objec-

tive of the discriminating network to maximize the

correctness of the classiﬁcation of samples into false

and accurate; the generating network will try to mini-

mize these hits.

3 RELATED WORKS

Adversary Generative Networks (GANs) have cur-

rently been used in machine learning problems in

unbalanced (Japkowicz, 2000a) databases. We will

present below the main works that used GAN’s for

oversampling of the minority class in several areas of

knowledge.

Mehta et al. (2019) used GAN’s to help improve

the images of those who suffered a stroke. Han et al.

(2019) investigated magnetic resonance (MR) images

Data Balancing using Deep Convolutional Generative Adversarial Networks (DCGAN) in Patients with Congenital Syndrome by Zika Virus

Figure 1: GANs architecture. Source: Based on (Goodfellow et al., 2014).

for tumor detection. Bhagat and Bhaumik (2019) cre-

ated synthetic chest X-ray images of pneumonia pa-

tients to improve the accuracy of the image classi-

ﬁcation. Bailo et al. (2019) augmented the blood

smear microscopic image datasets, which are im-

ages of blood taken with a microscope, where a thin

layer of blood is placed on a microscope slide (Bailo

et al., 2019). Asvestopoulou et al. (2019) augmented

the speech dataset to improve DysLexML’s dyslexia

screening tool ranking. Sheng et al. (2019) created

datasets to improve speech recognition in children un-

der noisy conditions (Hu et al., 2018). (Haradal et al.,

2018) and (Fahimi et al., 2020) increased the biosig-

nal dataset (electrocardiogram and electroencephalo-

gram) to improve ranking.

All works obtained promising experimental re-

sults, even with few original data, implying that the

generated data can be used as extended samples to in-

crease a database to improve classiﬁcation tasks in the

most diverse applications.

It was also observed that GAN performance is

sensitive to model parameters such as learning rate,

number of epochs, and others (Karadag and Erdas¸ Ci-

cek, 2019).

Data augmentation with GANs is sensitive to the

size of the dataset because when the number of im-

ages increases to more than 60,000, the synthetic im-

ages generated by the GAN do not contribute to the

classiﬁcation performance and may even cause a re-

duction in performance (Karadag and Erdas¸ Cicek,

2019).

4 MATERIALS AND METHODS

This section presents the materials and methods used

in this work. As well as an overview of the RESP-

Microcephaly database that records suspected cases

of genetic alteration due to Zika virus (ZIKV) infec-

tion. A descriptive analysis of the database is also

presented, in addition to its pre-processing.

4.1 RESP Database

Initially, the database had 17451 instances. As the in-

terest of the work is to classify children born with ge-

netic alterations due to Zika virus infection, we work

with the notiﬁcations of newborns and children

. We

also excluded all subjects with laboratory conﬁrma-

tion for syphilis or toxoplasmosis

The instance selection process summarized in Fig-

ure ref Selection. Initially, there were 14,144 cases

and conﬁrmed cases with syphilis and toxoplasmosis

excluded. We only work with congenital changes due

to ZKV and not other causes. At the end of the se-

lection process, we have 10,510 instances, 2,904 chil-

dren with genetic alterations, and 7,606 children with-

out alterations.

Figure 2: Selection of instances from the RESP.

These children already had Microcephaly before the

Zika outbreak and were included in the RESP under the

guidance of the Secretary of Health to follow up on cases

(Brasil et al., 2015).

Microcephaly may be associated with various envi-

ronmental and genetic factors. Among the environmental

factors there is fetal distress, congenital STORCH infec-

tions. The acronym is composed of the pathogens most fre-

quently related to diseases: Treponema Pallidum bacteria

that causes syphilis (S), the protozoan Toxoplasma Gondii

that causes toxoplasmosis (TO) and the rubella virus (R),

cytomegalovirus (C), herpes virus simple (H) (Ribeiro et al.,

2018).

HEALTHINF 2022 - 15th International Conference on Health Informatics

Thus, the RESP database had 10.510 instances

and 43 attributes, organized into nine (9) categories:

1. Notiﬁcation: Displays the classiﬁcation of sus-

pected cases of congenital infection (newborn,

child, fetus at risk, miscarriage, or stillbirth) and

the date it notiﬁed

2. Pregnant Woman’s Data: age, race/color, and

state of residence (UF)

3. Information about Live Births: sex, date of birth,

weight (grams), and length (centimeters)

4. Data on Pregnancy and Childbirth: types of con-

genital changes, when the change was detected (in

pregnancy or after delivery), gestational age at de-

tection of microcephaly, type of pregnancy, classi-

ﬁcation of live birth, head circumference, and date

of head circumference measurement. The type of

pregnancy that deﬁned as preterm (gestational age

less than 37 weeks of gestation), the term (gesta-

tional age between 37 and 41 weeks of gestation),

post-term (gestational age greater than 42 weeks)

5. Mother’s Clinical, Epidemiological Data: date of

onset of symptoms, type of symptoms (fever, rash,

itching, conjunctivitis, headache, and neurologi-

cal involvement), Syphilis/Toxoplasmosis test and

result, Zika test results, history of arboviruses, and

congenital malformations

6. Information about Imaging Tests: ultrasound,

transfontanellar ultrasound, computed tomogra-

phy, and magnetic resonance

7. Data about the Health Establishment: municipal-

ity and state

8. Data on Disease Evolution: death and date of

death

9. Fields Restricted to the Manager: Final classi-

ﬁcation of the suspected case of Congenital al-

terations and Conﬁrmation criteria through lab-

oratory tests performed (Zika, Dengue, Chikun-

gunya, Syphilis, and Toxoplasmosis, others and

image)

These categories, together with their attributes, are

shown in Figure 3.

The occurrences registered in the RESP conﬁrmed

congenital ZIKV infection peaked in 2016 with more

than 1600 records. As of May 2016, there is a drop

in the number of cases, a behavior observed in subse-

quent years, as shown in Figure 4.

The cases are distributed throughout the Brazilian

territory, as shown in Figure 5. The ten states that

presented the highest number of positive diagnoses

were: Bahia, Pernambuco, Rio de Janeiro, Para

ıba,

Maranh

ao, Cear

a, Sergipe with, respectively: 490,

452, 259, 193, 166, 146, 134 cases; in addition to the

states of Alagoas and Rio Grande do Norte, both with

130 cases.

About the pregnant women’s region, 60.8% of the

records are from the Northeast region, 24.9% from

the Southeast, 6.7% from the Midwest, 4.3% from

the North region, and only 3.3% from the Southern

region.

4.2 Preprocessing

Before applying the classiﬁcation algorithms effec-

tively, simple pre-processing strategies were adopted

to obtain a more consistent and impartial model. The

database processing phases were:

• Attribute Binarization: The original RESP was

composed of numerical (13%) and categorical

(87%) attributes. We perform one-hot coding to

binarize all categorical attributes. At the end of

the process, the database had 56 attributes.

• Inconsistent Data: There were 95 instances in

which the pregnant woman’s age was with values

2 and 3. As this is a physiologically incompati-

ble age for a conception, we excluded these values

and left them blank.

Two instances had the brain circumference value

measuring 323.3 cm, not corresponding to an ac-

tual value. These literature reports refer to a

mean head circumference of 34.61 cm in typi-

cal male NBs, ranging between 32.14 and 37.08

cm, and an average of 34.05 cm in normal female

NBs. with variation between 31.58 and 36.52 cm

(Brasil et al., 2015). As these values do not cor-

respond to values found in the literature, we ex-

cluded these values, leaving them absent.

• Missing Data: Missing data is common in health

databases. Therefore, the use of proper methods

becomes essential to reduce the impact of infor-

mation loss. The original database was about 30

% missing data. We deal with missing data by

imputing the data via the mean and median.

• Sampling Methods: There are two types of sam-

pling methods: undersampling and oversampling.

Undersampling removes elements from the major-

ity class while oversampling seeks to include ele-

ments from the minority class.

Random oversampling was implemented using the

RandomOverSampler class in Python. The class used

the sampling strategy argument set to “minority” to

balance the minority class with the majority class au-

tomatically.

Data Balancing using Deep Convolutional Generative Adversarial Networks (DCGAN) in Patients with Congenital Syndrome by Zika Virus

Figure 3: Categories and their respective attributes Registration of Public Health Events - RESP.

Figure 4: Conﬁrmed cases of congenital infection due to

Zika Virus in Brazil between 2015 and 2019.

In work we use two oversampling methods: 1)

SMOTE (Synthetic Minority Over-sampling TEch-

nique), which generates synthetic cases for the class

of interest from existing data. The new data are gen-

erated in the neighborhood of the minority class data

to increase the decision space of this class and in-

crease the generalization power of the obtained clas-

siﬁers (Chawla et al., 2002); 2) GAN, in which the ar-

chitecture used was a DCGAN (Deep Convolutional

Generative Adversarial Network) that allows training

Figure 5: Brazil - Cases of congenital infection due to Zika

Virus.

a pair of deep convolutional networks: generator and

discriminator.

DCGAN is a GAN model that uses deconvolution

layers in the generator and convolution layers in the

discriminator to extract characteristics from the data

HEALTHINF 2022 - 15th International Conference on Health Informatics

and build a model to generate the synthetic data.

The ﬁrst layer of the generator receives an evenly

distributed N-dimensional noise as an input to a fully

connected network. Then, the result is remodeled in a

series of four convolutions with fractional steps (512,

256, 128, 64), according to Figure 6.

DCGAN combines the deep learning stage as the

key to GAN training. These techniques include the

fully convolutional network and Batch Normalization

(BN). Batch normalization is a technique initially in-

troduced by (Ioffe and Szegedy, 2015). Batch nor-

malization is a solution to speed up the training phase

of deep neural networks by introducing internal nor-

malization of input values in the neural network layer.

The ﬁrst emphasizes magniﬁed convolutions

(rather than grouped layers) for both: increasing and

decreasing the spatial dimensions of the feature. Sec-

ond, normalizes feature vectors to have zero mean and

unity variance across all layers, helping to stabilize

learning and handle underweight startup problems.

The generator network has four convolutional lay-

ers. All followed by BN (except for the output layer)

and rectiﬁed linear activation (ReLU). In addition, the

generator receives as input a random z vector (ob-

tained from a normal distribution).

The discriminator is also a 4-layer CNN with BN

(except its input layer) and leaky RELU triggers.

Many enablement functions will work ﬁne with this

basic GAN architecture. However, leaky ReLUs are

very popular because they help gradients to ﬂow more

easily across the architecture.

A regular ReLU function works by truncating neg-

ative values to zero, blocking the ﬂow of gradients

across the network. However, instead of the func-

tion being zero, leaking RELUs allow a small nega-

tive value to pass. That is, the function calculates an

immense value between resources and a smaller fac-

tor.

The generator output corresponds to the 55 re-

sources of the dataset, plus an extra neuron to enable

class discrimination. In the discriminator, the ﬁnal

convolution layer is ﬂattened and then fed to a single

sigmoid output.

4.3 Assessment Metrics

In this work, the following performance evaluation

measures were used: precision, recall and F-Score.

Precision (Equation 6) identiﬁes, among all in-

stances classiﬁed in a given class, those that are ac-

tually of the class in question.

Precision =

T P

T P + FP

(6)

wich: TP = True positive, TN = True Negative, FP =

False positive, and FN= False Negative.

Recall (Equation 7) measures the hits in a given

class. That is, among all the instances of a given class,

how many actually the classiﬁer classiﬁed as being of

the class.

Recall =

T P

T P + FN

(7)

F-Score is a harmonic mean between precision

and recall and indicates the overall quality of the

model, given by Equation 8.

F − Score =

2 ∗ precision ∗ recall

precision + recall

(8)

5 RESULTS AND DISCUSSIONS

To evaluate the performance of using GANs, we

compared this method with two traditional balancing

methods: Undersampling and SMOTE, and with the

results with unbalanced classes.

We split the dataset into 80% for model creation

and 20% for testing. 10-fold cross-validation was

used to create the models.

The unbalanced dataset had 2904 instances of the

class “yes” and 7606 instances of class “no”. In

the sub-sampling, the data of the majority class were

reduced so that, in the end, the data set was com-

posed of 2336 instances of each class. To avoid a bi-

ased model, the oversampling method was applied in

cross-validation. In other words, for every nine train-

ing folds, the oversampling method was used.

In addition, we use three learning algorithms:

Bagging, Random Forest, and Decision Tree. The re-

sults obtained are shown in Figure 7.

Analyzing the results, we see that the highest pre-

cision for the ‘Yes’ class was 94% with data balancing

with GAN using the Random Forest classiﬁer. This

means that only 6% of the data were classiﬁed as false

positives, that is, individuals who were identiﬁed as

having the congenital syndrome but actually did not

have the alteration.

Regarding the precision for class ‘No,’ the best re-

sults were also balanced with GAN in which the three

classiﬁers had the same 90% performance; this repre-

sents that 10% of the instances were classiﬁed as not

having the syndrome, but actually they were.

Regarding the Recall rate, the best index for the

‘Yes’ class was 90% for balanced data with GAN with

Random Forest. This result represents that 10% of the

patients were classiﬁed as not having a congenital al-

teration and were carriers. For the ‘No’ class, the best

Data Balancing using Deep Convolutional Generative Adversarial Networks (DCGAN) in Patients with Congenital Syndrome by Zika Virus

Figure 6: Structure of the deep convolutional generator for DCGAN.

Figure 7: Classiﬁcation metrics with balanced and unbal-

anced datasets.

rate was 94% for unbalanced and balanced data with

GAN, for Bagging and Random Forest. This means

that 6% were false negatives; that is, individuals clas-

siﬁed as having congenital alterations, but on the con-

trary, they didn’t have it.

For the F-Score metric, for both classes, the best

rate was 92% in balancing with GAN using Random

Forest classiﬁer.

Regarding the unbalanced data, we found that

the algorithms ranked the majority class better (not),

which was expected to happen, as there is a signiﬁcant

difference between the classes.

When we work with SMOTE, there are a slight

improvement in the “yes” class results. When we

compare the ”no” class results, all metrics improve

for all algorithms.

Therefore, we can conclude that the undersam-

pling method signiﬁcantly improves the classiﬁca-

tion of the ‘yes’ minority sample, which is expected

since we took samples from the majority class. With

SMOTE, there is a slight improvement for the minor-

ity class, but it continues to rank the majority class

better. Finally, with GAN’s there was a signiﬁcant

gain in all metrics for both classes.

It is noted that balancing data with GAN showed

a signiﬁcant improvement in the indices, especially

concerning the minority class ‘Yes,’ as with the cre-

ation of synthetic data with GAN, the minority class

gains more importance, and the bias on the class ma-

jority is slight.

6 FINAL CONSIDERATIONS

This work uses Adverse Generative Networks to syn-

thesize data to oversample minority classes in unbal-

anced datasets and compares the results with other

balancing algorithms. The results suggest that GAN

HEALTHINF 2022 - 15th International Conference on Health Informatics

100

can increase the classiﬁer performance for all evalu-

ated metrics since, in all metrics (accuracy, recall, and

F-Score), for all classiﬁcation algorithms, the values

were above 90%.

Observed that there is a signiﬁcant improvement,

especially about the minority class, as with the cre-

ation of synthetic data, there was an increase in the

representation and density of the data.

As most classiﬁcation models are designed to

work with balanced datasets, GANs for data balanc-

ing add a greater generalization power of the algo-

rithms, detecting rare and essential patterns that dis-

criminate the classes of the problem by establishing a

reliable decision threshold.

Another point worth mentioning is that data on

CZS are rare data since, since 2019, the Federal

Government has considered the data as conﬁdential

and no longer makes this information available to re-

searchers and the general public

Therefore, this detailed analysis of this dataset

and, above all, the signiﬁcant improvement in the

classiﬁcation process, add the importance of using

GANs for balancing tabular datasets and, above all,

for generating synthetic data from rare and restricted

data as it is our case.

Furthermore, the approach presented in this arti-

cle has the potential for early diagnosis of congenital

syndrome associated with Zika virus infection. Early

diagnosis increases prevention, speeds up treatment,

and reduces the devastating consequences of this ill-

ness for mothers and children.

In future works, we suggest the reﬁnement of

the model proposing the application of deep learning

techniques to the GAN architecture to deal with dif-

ferent data types and performing statistical analysis

and uncertainty analysis of the results obtained.

ACKNOWLEDGEMENTS

The authors thank the National Council for Scientiﬁc

and Technological Development of Brazil (CNPq),

the Coordination for the Improvement of Higher Ed-

ucation Personnel - Brazil (CAPES), the Founda-

tion for Research Support of Minas Gerais State

(FAPEMIG) and Pontiﬁcal Catholic University of Mi-

nas Gerais (PUC Minas).

The report is available in a way that has al-

ready been processed by state and is available at:

https://datasus.saude.gov.br/acesso-a-informacao/registro-

de-eventos-em-saude-publica-resp-microcefalia/

REFERENCES

Alsharqi, M., Woodward, W., Mumith, J., Markham, D.,

Upton, R., and Leeson, P. (2018). Artiﬁcial intelli-

gence and echocardiography. Echo research and prac-

tice, 5(4):R115—R125.

Asvestopoulou, T., Manousaki, V., Psistakis, A., Nikolli,

E., Andreadakis, V., Aslanides, I., Pantazis, Y., Smyr-

nakis, I., and Papadopouli, M. (2019). Towards a

robust and accurate screening tool for dyslexia with

data augmentation using gans. In 2019 IEEE 19ª Con-

fer

encia Internacional sobre Bioinform

atica e Bioen-

genharia (BIBE), pages 775–782.

Bailo, O., Ham, D., and Shin, Y. (2019). Red blood cell

image generation for data augmentation using condi-

tional generative adversarial networks. In 2019 IEEE

/ CVF Conference on Computer Vision and Pattern

Recognition Workshops (CVPRW), pages 1039–1048.

Batista, G. E. A. P. A., Prati, R., and Monard, M. C. (2004).

A study of the behavior of several methods for balanc-

ing machine learning training data. SIGKDD Explor.,

6:20–29.

Bhagat, V. and Bhaumik, S. (2019). Data augmentation

using generative adversarial networks for pneumonia

classiﬁcation in chest xrays. In 2019 Fifth Interna-

tional Conference on Image Information Processing

(ICIIP), pages 574–579, Shimla, India, 2019.

Boeuf, P., Drummer, H., Richards, J., Scoullar, M., and

Beeson, J. (2016). The global threat of zika virus

to pregnancy: Epidemiology, clinical perspectives,

mechanisms, and impact. BMC Medicine, 14:112.

Brasil, da Sa

ude, M., de Vigil

ancia em Sa

ude, S., and

de Vigil

ancia das Doenc¸as Transmiss

ıveis., D. (2015).

Protocolo de vigil

ancia e resposta

a ocorr

encia de mi-

crocefalia e/ou alterac¸

oes do sistema nervoso central

(snc): emerg

encia de sa

ude p

ublica de import

ancia in-

ternacional.

Carvalho, I. F., Alencar, P. N. B., Carvalho de Andrade,

M. D., Silva, P. G. d. B., Carvalho, E. D. F., Ara

ujo,

L. S., Cavalcante, M. P. M., and Sousa, F. B. (2019).

Clinical and x-ray oral evaluation in patients with con-

genital Zika Virus. Journal of applied oral science :

revista FOB, 27:e20180276–e20180276.

Chawla, N. (2005). Data Mining for Imbalanced Datasets:

An Overview, volume 5, pages 853–867. Springe.

Chawla, N., Japkowicz, N., and Kolcz, A. (2003). Work-

shop learning from imbalanced data sets ii. In

Proceedings of international conference on machine

learning.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,

W. P. (2002). Smote: Synthetic minority over-

sampling technique. Jornal de pesquisa de in-

telig

encia artiﬁcial, 16:321–357.

Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M.,

Blau, H. M., and Thrun, S. (2017). Dermatologist-

level classiﬁcation of skin cancer with deep neural net-

works. Nature, 542(7639):115—118.

Fahimi, F., Dosen, S., Ang, K. K., Mrachacz-Kersting,

N., and Guan, C. (2020). Generative adversarial

networks-based data augmentation for brain-computer

Data Balancing using Deep Convolutional Generative Adversarial Networks (DCGAN) in Patients with Congenital Syndrome by Zika Virus

101

interface. IEEE Transactions on Neural Networks and

Learning Systems, pages 1–13.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. In

Proceedings of the 27th International Conference on

Neural Information Processing Systems - Volume 2,

NIPS’14, page 2672–2680, Cambridge, MA, USA.

MIT Press.

Green, M. A. (2018). Use of machine learning approaches

to compare the contribution of different types of data

for predicting an individual’s risk of ill health: an ob-

servational study. The Lancet, 392:p.S40.

Han, C., Murao, K., Noguchi, T., Kawata, Y., Uchiyama,

F., Rundo, L., Nakayama, H., and Satoh, S. (2019).

Learning more with less: Conditional pggan-based

data augmentation for brain metastases detection us-

ing highly-rough annotation on mr images. In Pro-

ceedings of the 28th ACM International Conference

on Information and Knowledge Management, CIKM

’19, page 119–127, New York, NY, USA. Association

for Computing Machinery.

Haradal, S., Hayashi, H., and Uchida, S. (2018). Biosig-

nal data augmentation based on generative adversarial

networks. In 2018 40th Annual International Confer-

ence of the IEEE Engineering in Medicine and Biol-

ogy Society (EMBC), pages 368–371.

Herland, M., Khoshgoftaar, T., and Wald, R. (2014). A re-

view of data mining using big data in health informat-

ics. Journal Of Big Data, 1:2.

Hu, H., Tan, T., and Qian, Y. (2018). Generative adversar-

ial networks based data augmentation for noise robust

speech recognition. In 2018 IEEE International Con-

ference on Acoustics, Speech and Signal Processing

(ICASSP).

Hussain, Z., Gimenez, F., Yi, D., and Rubin, D. (2018).

Differential data augmentation techniques for medical

imaging classiﬁcation tasks. AMIA ... Annual Sympo-

sium proceedings. AMIA Symposium, 2017:979–984.

Ioffe, S. and Szegedy, C. (2015). Batch normalization:

Accelerating deep network training by reducing in-

ternal covariate shift. In Bach, F. and Blei, D., ed-

itors, Proceedings of the 32nd International Confer-

ence on Machine Learning, volume 37 of Proceedings

of Machine Learning Research, pages 448–456, Lille,

France. PMLR.

Japkowicz, N. (2000a). The class imbalance problem: Sig-

niﬁcance and strategies. In Proc. of the Int’l Conf. on

Artiﬁcial Intelligence, volume 56. Citeseer.

Japkowicz, N. (2000b). The class imbalance problem: Sig-

niﬁcance and strategies. Proceedings of the 2000 In-

ternational Conference on Artiﬁcial Intelligence ICAI.

Karadag, O. O. and Erdas¸ Cicek, O. (2019). Experimen-

tal assessment of the performance of data augmenta-

tion with generative adversarial networks in the image

classiﬁcation problem. In 2019 Innovations in Intel-

ligent Systems and Applications Conference (ASYU),

pages 1–4.

Lima, G. P., Rozenbaum, D., Pimentel, C., Frota, A. C. C.,

Vivacqua, D., Machado, E. S., das Neves Sztajnbok,

F. C., Abreu, T., Soares, R. A., and Hofer, C. B.

(2019). Factors associated with the development of

congenital zika syndrome: a case-control study. In

BMC Infectious Diseases.

Mariani, G., Scheidegger, F., Istrate, R., Bekas, C., and

Malossi, A. C. I. (2018). Bagan: Data augmentation

with balancing gan. arXiv, pages 1–9.

Mehta, K., Kobti, Z., Pfaff, K., and Fox, S. (2019). Data

augmentation using ca evolved gans. 2019 IEEE Sym-

posium on Computers and Communications (ISCC),

pages 1087–1092.

Milovic, B. and Milovic, M. (2012). Prediction and decision

making in health care using data mining. International

Journal of Public Health Science (IJPHS), 1.

Mullick, S. S., Datta, S., and Das, S. (2019). Generative ad-

versarial minority oversampling. In 2019 IEEE/CVF

International Conference on Computer Vision (ICCV),

pages 1695–1704.

Pan, Z., Yu, W., Yi, X., Khan, A., Yuan, F., and Zheng,

Y. (2019). Recent progress on generative adversarial

networks (gans): A survey. IEEE Access, 7:36322–

36333.

Ribeiro, I. G., Andrade, M. R. d., Silva, J. d. M. S., Silva,

Z. M., Costa, M. A. d. O., Vieira, M. A. d. C. e. S.,

Batista, F. M. d. A., Guimar

aes, H., Wada, M. Y.,

and Saad, E. (2018). Microcefalia no Piau

ı, Brasil:

estudo descritivo durante a epidemia do v

ırus Zika,

2015-2016. Epidemiologia e Servic¸o de Sa

ude, 27.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V.,

Radford, A., Chen, X., and Chen, X. (2016). Im-

proved techniques for training gans. In Lee, D.,

Sugiyama, M., Luxburg, U., Guyon, I., and Garnett,

R., editors, Advances in Neural Information Process-

ing Systems, volume 29. Curran Associates, Inc.

Sheng, P., Yang, Z., and Qian, Y. (2019). Gans for children:

A generative data augmentation strategy for children

speech recognition. In 2019 IEEE Automatic Speech

Recognition and Understanding Workshop (ASRU),

pages 129–135.

Tomar, D. (2013). A survey on data mining approaches for

healthcare. International Journal of Bio - Science and

Bio - Technology, 5:241–266.

Wang, S., Lv, Y.-D., Sui, Y., Liu, S., Wang, S.-J., and

Zhang, Y.-D. (2017). Alcoholism detection by data

augmentation and convolutional neural network with

stochastic pooling. Journal of Medical Systems, 42.

Weng, S. F. R., J.; Kai, J. G., and J. M.; Qureshi, N. (2017).

Can machinelearning improve cardiovascular risk pre-

diction using routine clinical data? Public Library of

Science, 12(4):e0174944.

Yu, X., Wu, X., Luo, C., and Ren, P. (2017). Deep learning

in remote sensing scene classiﬁcation: a data augmen-

tation enhanced convolutional neural network frame-

work. GIScience & Remote Sensing, 54:1–18.

Zanluca, C., Noronha, L., and Santos, C. (2017). Maternal-

fetal transmission of the zika virus: An intriguing in-

terplay. Tissue Barriers, 6:00–00.

HEALTHINF 2022 - 15th International Conference on Health Informatics

102