Diversifying Image Synthesis using Data Classiﬁcation

Yuta Suzuki, Fumihiko Sakaue and Jun Sato

Nagoya Institute of Technology, Japan

Keywords:

Diverse Image, Data Classiﬁcation, GAN.

Abstract:

In this paper, we propose a method for generating highly diverse images in GAN-based image generation. In

recent years, GANs that generate various images such as MSGAN and BicycleGAN have been proposed. By

using these methods, it is possible to generate a variety of images to some extent, but when compared with

the variety of training images, they are still less diverse. That is, it is still a difﬁcult problem to generate a

variety of images, even if a wide variety of training images are being trained. Thus, in this paper, we propose

a new structure of GAN which enables us to generate more diversity than the existing methods. Our method

estimates the distribution of training images in advance and learns to imitate the diversity of training images.

The effectiveness of the proposed method is shown by comparative experiments with the existing methods.

1 INTRODUCTION

In recent years, research on GAN (Goodfellow et al.,

2014), which generates highly realistic images by

deep learning, has been progressing. While general

GAN can generate a realistic image from random

noise, conditional GAN (cGAN) (Mirza and Osin-

dero, 2014; Isola et al., 2017) can generate images

according to the given labels.

In many conditional GAN studies, it was impor-

tant to output realistic images from labels, and the

diversity of generated images was not so important.

However, in many image generation tasks, there are

many optimal images for a single label. For example,

in the task of ”edges-to-shoes”, we human can imag-

ine various kinds of shoes such as red shoes and blue

shoes from a single edge image as shown in Fig. 1.

However, when we use pix2pix (Isola et al., 2017) for

this task, it is not possible to generate images with

such variety of colors and shapes.

On the other hand, in recent years, some new types

of GANs, such as BicycleGAN (Zhu et al., 2017) and

MSGAN (Mao et al., 2019), succeeded in diversifying

the generated images by incorporating losses which

evaluate the diversity of generated images. However,

even in these GANs, the diversity of generated images

is still not high compared to the diversity of training

images.

Therefore, in this paper, we propose a method

for generating a variety of images equal to or greater

than the variety of training images. In the existing

Figure 1: Generation of diverse images.

methods, the network training was conducted with-

out knowing the distribution of training images. On

the other hand, in our method, we estimate the dis-

tribution of training images in advance, and learn to

imitate the diversity of training images. For estimat-

ing the distribution of training images, we perform k-

means clustering of the training images by using fea-

ture vectors extracted from a pre-trained classiﬁcation

network. In this way, our GAN can generate a wide

variety of images that are closer to the ground truth

distribution of training images.

We perform a comparative experiment between

the proposed method and the existing methods using

quantitative evaluation, and show that the proposed

method can generate a wider variety of images.

2 RELATED WORK

GAN (Goodfellow et al., 2014) is a learning model

in which the Generator that generates images and the

Suzuki, Y., Sakaue, F. and Sato, J.

Diversifying Image Synthesis using Data Classiﬁcation.

DOI: 10.5220/0010902600003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 5: VISAPP, pages

617-622

ISBN: 978-989-758-555-5; ISSN: 2184-4321

617

Figure 2: Structure of general GANs.

Discriminator that distinguishes between generated

data and training data perform adversarial training to

enhance each other’s performance as shown in Fig. 2.

Generator tries to generate data similar to the

training data, and learns to generate images that fool

Discriminator. On the other hand, Discriminator tries

to distinguish the authenticity of the input data, that

is Discriminator determines that the input data is True

if it is training data, and False if it is generated by

the Generator. By training Generator and Discrimina-

tor adversarially, the Generator will gradually be able

to generate realistic images, and ﬁnally the Genera-

tor will be able to generate data that cannot be distin-

guished by the Discriminator. GAN has been actively

researched in recent years since it can generate im-

ages with high accuracy that is indistinguishable from

the real images.

While general GAN can generate a realistic image

from random noise, conditional GAN (cGAN) (Mirza

and Osindero, 2014; Isola et al., 2017) can generate

images according to the given labels. However, in

these GANs, the reality of images was often empha-

sized, and the diversity of generated images was not

considered much.

On the other hand, MSGAN (Mao et al., 2019)

challenged the diversiﬁcation of generated images.

As shown in Fig. 3, the training of MSGAN is per-

formed so that the difference d

of output G(y, z

)

and G(y, z

) generated from two random noise z

and

becomes large even if the difference d

, z

) be-

tween z

and z

is small by adding a new regulariza-

tion term called MS loss. The MS loss L

is a ratio

between the distance d

of the input noises and the

distance d

of the generated images as follows:

= max

(G(y, z

), G(y, z

))

, z

)

(1)

The total loss L

MSGAN

of MSGAN is deﬁned by

adding the MS Loss L

to the loss of original GAN

ori

as follows:

MSGAN

= L

ori

+ λ

(2)

where, λ

represents the weight of the MS Loss.

Figure 3: Training of MSGAN.

Figure 4: Feature extraction.

Although MSGAN succeeded in improving the di-

versity, it is still not sufﬁciently high compared to the

diversity of training images.

3 PROPOSED METHOD

In this paper, we propose a method for generating im-

ages with diversity close to that of the training im-

ages. The proposed method estimates the distribution

of the training images in advance, and learns so that

the distribution of the generated images is close to the

distribution of the training images.

3.1 Estimation of Training Image

Distribution

In our method, the distribution of training images

is estimated ﬁrst by using k-means clustering before

training GAN.

For this objective, a pre-trained classiﬁcation net-

work is used as an encoder to extract the feature vec-

tor of the image. In this research, the feature vec-

tor is extracted by using Efﬁcient-Net-b0 (Tan and

Le, 2019), which had been pre-trained with Ima-

geNet (Deng et al., 2009). We ﬁrst extract 7 × 7 ×

1280 features by using Efﬁcient-Net-b0, and then the

obtained features are averaged for each channel and a

7 × 7 = 49 dimensional feature vector is obtained as

shown in Fig. 4.

Then, k-means clustering of the training images is

performed by using the extracted feature vectors, and

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

618

Figure 5: Network structure of the proposed method. G, D, and E represent Generator, Discriminator, and Encoder respec-

tively. G and D are similar to the conventional GAN, and Encoder is a pre-learned feature extractor. In the proposed method,

a part for extracting the features of the generated image is added. We compute the loss by comparing the mean µ and the

covariance matrix Σ of the extracted features with those {µ

, Σ

} of cluster i.

the training images are classiﬁed into k clusters.

The distribution of the training images is esti-

mated by computing the mean vector µ

(i = 1, ·· · , k)

and the covariance matrix Σ

(i = 1, · ·· , k) of feature

vectors in each cluster.

In the following sections, we describe the GAN

network structure and learning method using the esti-

mated distribution of training images.

3.2 Network Structure

In this research, we input a cluster number to Genera-

tor, and the Generator generate images with different

characteristics according to the input cluster number.

This enables us to generate a variety of images from

a single Generator.

For this objective, we add the cluster number i to

the input of the conventional GAN as shown in Fig. 5.

The cluster number is also input to the Discriminator,

and the Discriminator determines not only whether

the image is real or not, but also whether the image

that matches the cluster number can be generated or

not.

E in Fig. 5 is a pre-trained feature extractor. The

training of the network is performed so that the fea-

ture vector obtained through this feature extractor ap-

proaches the feature vector of the image of cluster i.

The proposed network structure can be applied to

various tasks, such as image2image and text2image.

In these cases, not only the cluster number and noise

but also the condition image and condition text are

added to the input of Generator and Discriminator.

3.3 Training

At the training stage, the cluster number to which the

training image belongs is input to the Generator, and

the Generator is trained so as to generate images with

the characteristics of the cluster.

Discriminator judges whether the Generator is

able to generate an image that matches the input clus-

ter number. However, the generated clusters are not as

easily distinguishable by humans as dog clusters and

car clusters. Therefore, in order to determine whether

the generated image includes the characteristics of the

cluster, the following two losses are introduced for

training the network.

= |µ

− E(G(z, i))| (3)

(E(G(z, i)) − µ

)

−1

(E(G(z, i)) − µ

) (4)

where, µ

is the mean feature vector of cluster i, and Σ

is the covariance matrix of feature vectors of cluster i.

is the L1 distance between the generated image

and the center of cluster i, and L

is the Mahalanobis

distance between the generated image and cluster i.

By using the Mahalanobis distance L

, the variance

is taken into consideration and learning is performed

so as to generate an image containing more cluster-

speciﬁc features. The L1 distance L

is used for sta-

bilizing the training.

In this research, the above two losses are used with

MS loss and original loss, so the total Loss is as fol-

lows:

new

= L

ori

+ λ

(5)

Diversifying Image Synthesis using Data Classiﬁcation

619

Figure 6: Results of clustering the training images in the proposed method. The ﬁgure shows 7 example images classiﬁed into

each of the 10 clusters. In the proposed method, the training was performed using the results of the clustering.

By using the MS Loss with L

and L

, it is expected

that diversity will be produced in the same cluster

where similar images are gathered, and more diverse

images can be generated.

Depending on the number of images and the num-

ber of clusters used for training, the number of fea-

tures in the feature vector may be larger than the

number of images n belonging to the smallest clus-

ter. In such a case, the covariance matrix cannot be

computed properly. Therefore, in this research, the

variance of the feature vector in the training image is

computed, and the top n − 1 features with large vari-

ance are used to compute the Mahalanobis distance.

This is because the feature with a larger variance in

the training image can be considered as a more im-

portant feature that changes greatly depending on the

image.

3.4 Image Generation

The trained generator can be used for generating im-

ages. In the proposed method, we need to enter the

cluster number for generating images. At the train-

ing time, the cluster number of the training image is

input, but at the testing time, the cluster number is

randomly determined according to the cluster size ra-

tio in the training images, so that images with a dis-

tribution similar to that of the training dataset can be

generated. For example, if the ratio of cluster 1 is 12

% in the training dataset, cluster number 1 is selected

and input to the Generator with a probability of 12

%. This makes it possible to generate images with a

distribution close to the training dataset.

4 EXPERIMENTS

We next show the experimental results obtained from

the proposed method. In our experiments, we gener-

ated images by using the pix2pix fasades dataset (Yu

and Grauman, 2014; Zhu et al., 2016). First, learning

was performed using training data, and then image

generation and quantitative evaluation using test data

were performed.

The proposed method was trained with the num-

ber of clusters k = 10, the weight λ

= 1.0, λ

= 5.0,

and λ

= 0.2. With this setting, 400 epoch train-

ing was performed using 400 training data. Since the

training becomes unstable when the Mahalanobis dis-

tance is used from the beginning, it is introduced from

100 epoch, where the training has progressed to some

extent.

We next show the results of comparative evalua-

tion between the proposed method and the existing

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

620

Figure 7: Generated images in the proposed method and existing methods. The proposed method can generate more diverse

images than the existing methods. For the proposed method, the cluster number is also shown.

Figure 8: Generated images in the proposed method and existing methods. The proposed method can generate more diverse

images than the existing methods. For the proposed method, the cluster number is also shown.

methods. Fig. 6 shows the result of clustering the

training images in the proposed method. This ﬁgure

shows 7 example images classiﬁed for each cluster.

In the proposed method, the training was performed

using the results of this clustering.

Fig. 7 and Fig. 8 show the results of image gener-

ation by the proposed method and the existing meth-

ods. For the proposed method, the entered cluster

number is also shown in the ﬁgure. From these re-

sults, we ﬁnd that the proposed method can gener-

ate more diverse images than the existing methods for

both input images. In particular, MSGAN did not

generate many dark-colored buildings, but the pro-

posed method was able to generate dark orange and

yellow building images, indicating that images with a

wider variety could be generated.

4.1 Quantitative Evaluation

Finally, the results of quantitative evaluation using

FID (Heusel et al., 2017), NDB, JSD (Richardson

and Weiss, 2018), and LPIPS (Zhang et al., 2018) are

shown. We ﬁrst explain each evaluation metrics.

FID

Since FID is an index for measuring whether or not

the distribution of features obtained by passing each

of the generated image and the training image through

Inception Net (Szegedy et al., 2015) is close, we use

FID to evaluate the reality of the generated images.

NDB and JSD

NDB and JSD are indexes to measure whether the

distribution of training images and the distribution of

generated images are similar by using bin-based met-

rics. The training images are ﬁrst clustered into bins

by k-means clustering, and then the generated images

are assigned to bins of nearest clusters. Then, the sim-

ilarity of the cluster distribution of the training images

and that of the generated images is measured. It can

Diversifying Image Synthesis using Data Classiﬁcation

621

Table 1: Quantitative evaluation.

pix2pix MSGAN Ours

FID 98.23 88.84 92.27

NDB 11 11 9

JSD 0.0812 0.0559 0.0300

LPIPS 0.0621 0.3752 0.4444

be said that the lower these two indicators, the closer

the diversity is to the real data.

LPIPS

LPIPS measures the average distance between images

in the feature space. In this research, we evaluated the

diversity by measuring the average of LPIPS between

the generated images. It can be said that the higher

the LPIPS value, the more successful the generation

of diverse images.

Results

The table 1 shows the results of quantitative evalua-

tion of the existing methods and the proposed method.

From this table, we ﬁned that the proposed method

shows the best score in metrics except FID, and we

ﬁnd that the proposed method can generate diverse

images close to the ground truth distribution. Regard-

ing FID, although it has decreased in the proposed

method, the degradation is small, and we ﬁnd that

the diversiﬁcation was successful while maintaining

the quality of the generated images in the proposed

method.

5 CONCLUSION

In this research, we proposed a method for generat-

ing more diverse images in GAN. In particular, we

proposed a method that estimates the distribution of

training images in advance and uses it for learning to

generate diverse images. We demonstrated its effec-

tiveness by conducting comparative experiments with

the existing methods. The results show that the pro-

posed method can generate more diverse images efﬁ-

ciently.

REFERENCES

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Li, F.-

F. (2009). Imagenet: A large-scale hierarchical image

database. In CVPR, pages 248–255. IEEE Computer

Society.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. In

Advances in neural information processing systems,

pages 2672–2680.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and

Hochreiter, S. (2017). Gans trained by a two time-

scale update rule converge to a local nash equilibrium.

NIPS.

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017).

Image-to-image translation with conditional adversar-

ial networks. CVPR.

Mao, Q., Lee, H.-Y., Tseng, H.-Y., Ma, S., and Yang, M.-H.

(2019). Mode seeking generative adversarial networks

for diverse image synthesis. In IEEE Conference on

Computer Vision and Pattern Recognition.

Mirza, M. and Osindero, S. (2014). Conditional generative

adversarial nets. arXiv preprint arXiv:1411.1784.

Richardson, E. and Weiss, Y. (2018). On gans and gmms. In

Advances in Neural Information Processing Systems.

Szegedy, C., Wei Liu, Yangqing Jia, Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions. In

2015 IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 1–9.

Tan, M. and Le, Q. (2019). EfﬁcientNet: Rethinking model

scaling for convolutional neural networks. In Chaud-

huri, K. and Salakhutdinov, R., editors, Proceedings of

the 36th International Conference on Machine Learn-

ing, volume 97 of Proceedings of Machine Learning

Research, pages 6105–6114. PMLR.

Yu, A. and Grauman, K. (2014). Fine-grained visual com-

parisons with local learning. In Computer Vision and

Pattern Recognition (CVPR).

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang,

O. (2018). The unreasonable effectiveness of deep

features as a perceptual metric. In CVPR.

Zhu, J.-Y., Kr

ahenb

uhl, P., Shechtman, E., and Efros, A. A.

(2016). Generative visual manipulation on the natural

image manifold. In Proceedings of European Confer-

ence on Computer Vision (ECCV).

Zhu, J.-Y., Zhang, R., Pathak, D., Darrell, T., Efros, A. A.,

Wang, O., and Shechtman, E. (2017). Toward multi-

modal image-to-image translation. NIPS.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

622