Impact of Duplicating Small Training Data on GANs 
Yuki Eizuka
1
, Kazuo Hara
1
 and Ikumi Suzuki
2
  
1
Yamagata University, 1-4-12 Kojirakawa-machi, Yamagata City, 990-8560, Japan 
2
Nagasaki University, 1-14 Bunkyo, Nagasaki City, 852-8521, Japan 
Keywords:  Generative Adversarial Networks, Small Training Data, Emoticons. 
Abstract:  Emoticons such as (^_^) are face-shaped symbol sequences that are used to express emotions in text. However, 
the number of emoticons is miniscule. To increase the number of emoticons, we created emoticons using 
SeqGANs, which are generative adversarial networks for generating sequences. However, the small number 
of emoticons means that few emoticons can be used as training data for SeqGANs. This is concerning because 
as SeqGANs underfit small training data, generating emoticons using SeqGANs is difficult. To address this 
problem, we duplicate the training data. We observed that emoticons can be generated when the duplication 
magnification is of an appropriate value. However, as a trade-off, it was also observed that SeqGANs overfit 
the training data, i.e., they produce emoticons that are exactly the same as the training data. 
1  INTRODUCTION 
In recent years, multi-layer neural networks (MNNs) 
have contributed to the considerable development of 
artificial  intelligence.  In  particular,  remarkable 
progress  has  been  made  in  the  technology  used  for 
generating  data  such  as  images,  texts,  and  music 
using  MNNs.  The  most  representative  of  these  are 
generative  adversarial  networks  (GANs) 
(Goodfellow  et  al.,  2014).  GANs  build  models  for 
data  at  hand  describing  the  data  generation 
mechanism. Subsequently, new images, texts, music, 
among others, can be created using the model. 
In this study, we use GANs to create emoticons 
(precisely,  kaomoji).  An  emoticon  is  a  sequence  of 
symbols that make up the shape of a face. It is often 
used to express emotions such as laughter, sadness, 
and  anger  in  blogs  or  social  networking  site  (SNS) 
texts. However, currently, the number of emoticons is 
miniscule.  For  example,  only  95  emoticons 
representing laughter are in the SHIMEJI dictionary,
1
 
as  shown  in  Table  1.  Nowadays,  SNSs  are  widely 
used, and the demand for emotional expressions is on 
an increase. The objective of this paper is to increase 
the number of emoticons by automatically generating 
emoticons using GANs.  
GANs are used to build a generator reproducing 
the characteristics of the data at hand, which is called 
the  original  data  or  training  data.  To  achieve  this, 
 
1
 https://simeji.me/blog/顔文字-一覧/kaomoji 
GANs employ a discriminator as a guide. GANs are 
algorithms  that  alternately  update  a  generator  and 
discriminator by repeating the following three steps. 
First,  in  step  1,  data  are  generated  using  a  current 
generator, which is called fake data (see Figure 1(a)). 
Next,  in  step  2,  we  build  a  discriminator  that 
distinguishes  the  fake  data  from  the  original  data 
(Figure 1(b)).  A  discriminator was  used to  evaluate 
the performance of the generator. If the fake data and 
the  original  data  cannot  be  discriminated  by  the 
discriminator, the generator is successfully built; that 
is, the data with the characteristics of the original data 
are successfully produced. However, if the fake data 
and  the  original  data  can  be  discriminated,  we 
proceed  to  step  3,  where  the  generator  is  rebuilt. 
Ideally,  by  repeating  the  three  steps,  we  obtain  a 
generator  that  can  produce  data  indistinguishable 
from or very similar to the original data (Figure 1(c)). 
In GANs, the main role of the discriminator is to 
guide the rebuilding of the generator. That is, a new 
generator  is  built  by  being  guided  by  the 
discriminator such  that  the  generator produces  fake 
data  that  are  difficult  for  the  discriminator  to 
distinguish from the original data. 
More  specifically,  the  discriminator  divides  the 
data space into two regions: 
  positive region, where the original data is likely 
to be distributed, and 
  negative  region,  where  the  original  data  is 
unlikely to be distributed.