DEff-GAN: Diverse Attribute Transfer for Few-Shot Image Synthesis

Rajiv Kumar

and G. Sivakumar

Department of CSE, IIT Bombay, Mumbai, India

Keywords:

One-shot Learning, Few-shot Learning, Generative Modelling, Adversarial Learning, Data Efﬁcient GAN.

Abstract:

Requirements of large amounts of data is a difﬁculty in training many GANs. Data efﬁcient GANs involve

ﬁtting a generator’s continuous target distribution with a limited discrete set of data samples, which is a

difﬁcult task. Single image methods have focused on modelling the internal distribution of a single image

and generating its samples. While single image methods can synthesize image samples with diversity, they

do not model multiple images or capture the inherent relationship possible between two images. Given only

a handful number of images, we are interested in generating samples and exploiting the commonalities in the

input images. In this work, we extend the single-image GAN method to model multiple images for sample

synthesis. We modify the discriminator with an auxiliary classiﬁer branch, which helps to generate wide

variety of samples and to classify the input labels. Our Data-Efficient GAN (DEff-GAN) generates excellent

results when similarities and correspondences can be drawn between the input images/classes.

1 INTRODUCTION

Most of the modern deep learning based methods de-

pend on large datasets and need long training times

(Donahue and Simonyan, 2019), (Karras et al., 2020)

for achieving high performance and state-of-the-art

results. The trend still continues and is observed even

in some of the few-shot learning tasks (Liu et al.,

2019). However, there are use cases and scenarios

where obtaining even a handful number of images is

difﬁcult due to reasons of privacy, security and eth-

ical reasons. Though Generative Adversarial Net-

works (GANs) are able to generate realistic images

of high quality (Donahue and Simonyan, 2019), (Ku-

mar et al., 2021), (Karras et al., 2020), this is pos-

sible with the availability of large and diverse train-

ing datasets (Tundia et al., 2021) that prevents mem-

orization problems. In most cases, the amount of data

needed for training or adapting a GAN is in the or-

der of hundreds, if not in thousands, thereby leaving

no purpose in generating more of the same data. In

the few-shot realm, when GANs are trained directly

with small datasets, it leads to severe quality degra-

dation or memorization issues or both. Therefore, it

becomes essential to prevent mode collapse and over-

ﬁtting to generate samples with diversity.

Recently, there has been interest in single image

https://orcid.org/0000-0003-4174-8587

https://orcid.org/0000-0003-2890-6421

generative models to synthesize image samples of

various scales and sizes. Single-image GAN mod-

els (Shocher et al., 2019), (Shaham et al., 2019),

(Hinz et al., 2020), (Sushko et al., 2021), etc have

overcome the overﬁtting and mode collapse issues

by learning from the internal distribution of patches

from a single image. However, the synthesized im-

age samples make little to no sense when they lack

coherence. Efforts to improve the diversity in a few-

shot setting leads to artifacts, poor realism, and inco-

herency in images. With only a single image mod-

elled by a GAN, there are applications like image

super-resolution, harmonization, etc., which is pos-

sible by using the patches from the input image it-

self. Modelling multiple images can result in gen-

eralization as well as learning the underlying seman-

tic relations between the images. This leads to po-

tential for learning the relation between the patches

from multiple images that opens up the possibilities

of style transfer, content transfer, image compositing,

image blending. In a few shot scenario, novel sample

synthesis is possible by transferring visual attributes

like color, tone, texture or style from one image to

another and by combining features from different in-

puts. For unsupervised image synthesis, the visual at-

tributes can come from different images without any

guidance on how the features should be combined.

In this paper, we illustrate that single-image

GANs can be adapted for multi-class image synthe-

870

Kumar, R. and Sivakumar, G.

DEff-GAN: Diverse Attr ibute Transfer for Few-Shot Image Synthesis.

DOI: 10.5220/0011799600003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP, pages

870-877

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

sis in a few-shot setting for similar classes. Lets con-

sider the case of two face images, where correspon-

dences can be drawn between common facial features

like eyes, nose, lips, hair, etc. These correspondences

can give rise to similarities and relations at the local

patch level, which can be leveraged for novel sample

synthesis. For this, we propose changes to existing

single image GAN (Hinz et al., 2020) to adapt it for

multi-class few-shot image synthesis. Previous meth-

ods like SinGAN and ConSinGAN had focused only

in the generation of samples of a single image. We

propose changes to generate samples with attributes

from multiple images by the use of an auxiliary clas-

siﬁer branch for the discriminator, to output the class

probabilities in addition to the real/generated labels.

The discriminator objective then includes the classi-

ﬁer loss that minimizes the cross entropy loss between

the labels of generated images and the class labels.

We also modify the training procedure for modelling

multiple images and to speed up the training, while

single-image GAN methods generate a single sample

every time. As a result, for images with similar se-

mantics and underlying content, our method synthe-

sizes novel samples in a few shot setting. In the case

of face images and textures, our method can result

in diverse sample synthesis generating hundreds of

variations while retaining the semantics, from a sin-

gle image of two different faces images. The paper

contributions are as follows:

• We introduce DEff-GAN, a pretraining-free few-

shot image synthesis method by adapting single-

image GAN methods for multiple images for di-

verse novel sample synthesis.

We brieﬂy explain the Related works in Section 2,

Methodology in Section 3, Implementation details in

Section 4, Experiments and evaluation in Section 5,

Results and analysis in Section 6 and Conclusion and

future scope in Section 7.

2 RELATED WORKS

There are various approaches for few-shot genera-

tion, from direct training of few-shot image datasets

to few-shot test time generalization. In the former

case, a generative model is trained directly on a small

dataset with a handful of images without adapting a

pre-trained model or training on large number of base

categories. In the latter case, generative models are

trained on a set of base categories for long training

schedules and later applied to novel categories with

optimization (Clou

atre and Demers, 2019), (Liang

et al., 2020) or ﬁnetuning. In some cases, there is no

optimization involved when using fusion-based meth-

ods (Hong et al., 2020a), (Hong et al., 2020b), (Gu

et al., ) or transformation-based methods (Hong et al.,

2022), (Ding et al., 2022). One way for knowledge

transfer is to use pre-trained models from related do-

mains and adapting it using only a few input im-

ages. However, the resulting network can still be large

which can easily overﬁt to the data since the number

of samples is very less.

Recent works (Shocher et al., 2017) perform vari-

ous tasks (Ruiz et al., 2020), (Tritrong et al., 2021) us-

ing very few data samples (Yang et al., 2019) and even

from a single image (Shaham et al., 2019), (Shocher

et al., 2019), (Hinz et al., 2020). We brieﬂy ex-

plain the similarities and differences of single image

GAN methods and their drawbacks. InGAN (Shocher

et al., 2019) focuses on the completeness and coher-

ence of the generated images with an encoder-encoder

architecture that generates sample images of various

shapes, sizes and aspect ratios. SinGAN (Shaham

et al., 2019) is a single image based GAN frame-

work for image harmonization, image editing, super-

resolution tasks, etc. ConSinGAN (Hinz et al., 2020)

takes one step further by improving the speed of

training of SinGAN and also improves on the num-

ber of stages required for generating an image of

required resolution. InGAN and rcGAN (Arantes

et al., 2020) learns the distribution of image patches

of multiple images in the same model and ﬁlls in the

patches from the training image for image manipu-

lations and downstream tasks. SA-SinGAN (Chen

et al., 2021) uses self-attention mechanism in a sin-

gle image model to improve the image quality by

obtaining the global structure and also improves the

training time. While the above methods generate ap-

pealing results, most single image methods have not

been adapted or illustrated to work with multiple im-

ages/classes.

In the setting of learning from a single video,

One-shot GAN (Sushko et al., 2021) uses a two-

branch discriminator to assess the internal content

from the scene layout with separate content and lay-

out branches. In a few-shot setting, one method (Liu

et al., 2021) works with dataset sizes up to 100 im-

ages but fails for fewer images (≤ 10) in terms of

sample diversity, as generated samples become lim-

ited to input image reconstructions. Another method

(Ojha et al., 2021) can adapt a pre-trained GAN with

as few as 10 images by learning cross-domain corre-

spondences. However, it is difﬁcult to ﬁnd a GAN

pre-trained on related domains and only the style pa-

rameters of the pre-trained GAN are altered, which

prevents capturing of the underlying semantics of the

target domain. For pretraining-free few-shot image

DEff-GAN: Diverse Attribute Transfer for Few-Shot Image Synthesis

871

synthesis, one method (Kong et al., 2021) proposes

a mixup-based distance regularization on the feature

space of both the generator and discriminator to en-

hance both ﬁdelity and diversity.

3 METHODOLOGY

3.1 Problem Formulation

For the few-shot image synthesis task, we consider

two images, x

and x

belonging to the same class as

the base case. The goal is to learn a generative model

that can generate samples of large diversity with vi-

sual attributes from the two input images. Similarly,

for the multi-class image synthesis problem, we con-

sider a set of k images, {x

, x

... x

} belonging to

the related classes. Given a set of k images, which is

usually a small number (k < 5), our goal is to learn

a model that can generate samples of the k related

classes using a single image of each class, for the im-

age synthesis problem.

3.2 Proposed Framework

For modelling a few number of images using a gener-

ative model, training a lightweight model is preferable

than adapting a pretrained model that was trained on a

large dataset. Single-image based sample synthesis is

generally based on progressive growing based archi-

tectures with multi-stage and multi-resolution train-

ing. This gives greater control over the image gener-

ation process and its quality in comparison to end-to-

end training of the whole network, which otherwise

may also overﬁt to the input images. The receptive

ﬁelds at varying scales are captured by a cascade of

patch-GANs with progressive ﬁeld of view to cap-

ture the patch distributions at that scale and by scal-

ing up through image sizes. An unconditional gen-

erative model is learned as a growing generator by

adding new layers, keeping the previous stages frozen

or trained at small learning rates. To this end, we de-

tail the design and details of our framework for one-

shot multi-class image synthesis and few-shot image

synthesis.

3.3 Design

In principle, we could adapt the architecture of Sin-

GAN (Shaham et al., 2019) or that of ConSinGAN

(Hinz et al., 2020). We adapt ConSinGAN architec-

ture for our method due to faster training speeds and

concurrent training of multiple stages. Hence, our

method has commonalities in terms of design, archi-

tecture and implementation with ConSinGAN (Hinz

et al., 2020). Also, we use features in our method

between the generator stages, rather than image out-

puts from the previous stage generators. We em-

ploy a pyramid of fully convolutional patch-GANs,

which consists of generators stages {G

, G

N−1

...G

}

and discriminators {D

, D

N−1

...D

}. We associate

each generator stage G

from {G

, G

N−1

...G

} with

a discriminator D

from {D

, D

N−1

...D

}, for i ε

{N, N − 1...0} (refer Figure 1). Generator stage G

corresponds to the image of coarsest scale, while the

generator stage G

corresponds to generator dealing

with the ﬁnest details.

Let’s consider the training of generator at stage i,

for i < N. During the training stage i, the generator

stage G

and discriminator D

are trained. The gen-

erator training at any stage i requires only ﬁxed noise

maps and noise samples to the unconditional genera-

tor at the coarsest scale, with features from the lower

stages propagated to the higher stages. Once the gen-

erator at a scale i is trained completely, then training

proceeds to the generator stage i + 1 and so on. There

are different set of images involved in training at any

scale i, i.e. real images and generated images at scale

i. The growing generator learns by adversarial train-

ing by generating images and by minimizing the re-

construction loss of generated images to real images.

3.4 Objective Function

For adversarial training, we consider a set of real

and fake images, which correspond to the dataset im-

ages and generated images correspondingly. In our

method, the discriminator is modiﬁed to have an aux-

iliary classiﬁer branch to classify the inputs in addi-

tion to the discriminator’s real/fake label that helps in

adversarial learning. However, our generator is differ-

ent from that of AC-GAN, since it is dependent only

on the noise samples and independent of the class la-

bels, while the generator used in AC-GAN takes the

class label along with the noise samples while gener-

ating samples. We do not condition the generator on

an input image or class label and hence our generator

is unconditional, while our discriminator has an aux-

iliary classiﬁer branch. The growing generator G is a

lightweight iterative optimization based network that

learns to map randomly sampled noise z belonging to

Z to the output space of images, G : Z− > X . l(.) is

a distance metric in the image space which can either

belong to l

or l

norm. We consider Mean Squared

Error (MSE) pixel reconstruction loss enforced be-

tween the real images and reconstructed images for

samples generated using ﬁxed noise maps, as given

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

872

in Equation 1. The generator’s objective is to fool the

discriminator into identifying the generated images as

real and to reduce the reconstruction loss. The gen-

erator objective involves an adversarial loss and re-

construction loss, as given in Equation 2. More im-

portantly, we do not have an adversarial or supportive

classiﬁer loss enforced as a part of generator’s objec-

tive.

rec

) = ||G

(z) − x

. (1)

min

max

adv

, D

) + αL

rec

). (2)

The discriminator objective function consists of two

parts: the log-likelihood of the correct source, L

given in Equation 3 and the log-likelihood of the cor-

rect class, L

, given in Equation 4. The discrim-

inator gives a probability distribution over sources

(real/generated), P(S|X ) and a probability distribution

over the class labels, P(C|X) = D(X ).

= E[log P(S = real | X

real

)]+

E[log P(S = f ake | X

f ake

)]. (3)

= E[log P(C = c | X

real

)]+

E[log P(C = c | X

f ake

)]. (4)

The discriminator is trained to maximize L

and

, while G is trained to maximize L

. For the real

input images, cross entropy loss is enforced between

class labels of the randomly ordered training batch

and the classiﬁer outputs of the discriminator. For the

fake images, it is desired to have attributes from mul-

tiple inputs for attribute transfer and hence we assign

class labels in a random fashion resulting in gener-

ated images to take attributes from other classes. The

discriminator is provided with input images labelled

as real and generated images labelled as fake. Gradi-

ent penalty is computed between the real and the fake

images and the gradients are back propagated using

WGAN-GP (Gulrajani et al., 2017) adversarial loss.

To prevent mode collapse and to capture the complete

set of real images, we consider each training batch

to comprise of the whole set of input images. Con-

sidering batch sizes smaller than the full set of real

images may lead to non-capturing of all modes. Also,

the input images are ﬁxed before the critic operations

and shufﬂed in each critic iteration in a random order.

To be memory-efﬁcient while handling multiple im-

ages, the ﬁxed maps are generated only for the coars-

est scale, while previous methods have considered a

pyramid of ﬁxed maps for each image corresponding

to each scale. Consequently, we abstain from adding

noise after each stage with upsampling step after ob-

serving that it has little to no effect during our method

implementation.

Generator:

Stage 0

Generator:

Stage 1

Generator:

Stage N

Discriminator

adv

Stage 0 Stage 1 Stage 2

class

Discriminator

real/fake

class

G(z)

Stage 0

Stage 1

Figure 1: Layout diagram illustrating the different stages

and the relation between the real and generated images. The

right section illustrates the relation between the various im-

ages, the generator and discriminator networks.

Figure 2: Multi-class image synthesis on cats and dog

classes. The leftmost two columns are the real images while

the rest of the images are generated images.

4 IMPLEMENTATION

The implementation of our method involves a grow-

ing generator and a pyramid of discriminators. The

generator and discriminator start with the same num-

ber of convolutional layers. As training proceeds,

the generator is progressively grown by concatenat-

ing the latest stage that captures the patch distribu-

tion at that scale to the previously trained stages. We

suggest concurrent training of at least two stages and

the learning rates are exponentially decayed along the

stages so to ﬁne-tune the network weights of previous

stages. We use a pyramid of the scaled real image for

each training stage for each image. We randomly se-

lect one of the k images in each iteration and the asso-

ciated pyramids with it, while training on that image.

A ﬁxed noise map is a random noise map that is as-

signed at the beginning of training and ﬁxed for each

train image for the coarsest scale and used for recon-

struction of input images. For WGAN-GP, the num-

ber of critic iterations per generator iteration is usually

ﬁxed between 3 and 5. Differentiable Augmentation

(Zhao et al., 2020) is an augmentation technique that

improves the data efﬁciency of GANs for both un-

conditional and class-conditional generation, by im-

posing various types of differentiable augmentations

on both real and fake samples. We observe that dif-

ferentiable augmentation with color helps to improve

the quality of the generated images, while cutout and

translation have detrimental effects in some cases.

DEff-GAN: Diverse Attribute Transfer for Few-Shot Image Synthesis

873

Figure 3: Few shot image synthesis of two face images. The

leftmost column has the inputs and the rest of the images are

generated images (256 x 256).

Figure 4: Generated samples (128x128) from ConSinGAN

on modelling two inputs for different sets of faces. Samples

are affected by mode collapse and are incoherent.

4.1 Architecture

We train our method with the following hyper-

parameters. When multiple generator stages are con-

currently trained, we use a learning rate scaling of 0.5

between any stage and its previous stages. The num-

ber of training stages can be varied between 6 to 8

for training images up to dimensions 256 x 256. The

number of input channels are 3 and the number of ﬁl-

ters in the convolution layers can be 64 or 128 ﬁlters

for most cases. Using larger number of ﬁlters come

at the expense of more GPU memory consumption.

We use prelu as the activation function and α, the

weight for reconstruction loss as 10. We use Adam

optimizer with betas of 0.5 and 0.999. While train-

ing, we explicitly set the learning rate of discrimina-

tor at 0.00025, half as that of generator at 0.0005. We

use multi-step learning rate scheduler with a gamma

value of 0.1 and milestones as 0.8 times the number

of images times the number of iteration per image.

The penultimate and the last stage can be trained for

extended iterations to further improve the quality of

generated images. The number of convolutional lay-

ers in each stage can be varied from 3 to 6 depending

on the number of images that are modelled.

5 EXPERIMENTS AND

EVALUATION

Most of the related few-shot generative methods re-

quire pre-training on a large number of base classes

and long training schedules, since they focus on few-

shot test-time generalization. We have not considered

these methods as baselines, since they beneﬁt from

prior-knowledge from previously seen data and have

an unfair advantage in a true few-shot setting. We

choose mixup-based distance learning method (Kong

et al., 2021) as the baseline for comparing our method

for few-shot image generation.

5.1 Datasets

We consider images from multiple datasets to illus-

trate the ﬂexibility of our method. The inputs/classes

used in our experiments belong to human faces, cat

and dog faces, etc. The face images are sourced from

celebA (Liu et al., 2015) and anime datasets. We

source the ﬂower images from the Oxford 102-ﬂowers

dataset for the few-shot image synthesis task. We

also considered a subset of 10 images from 100-shot-

Obama dataset for the few-shot image synthesis.

5.2 Evaluation Metric

Image synthesis quality of the GAN generated sam-

ples are usually evaluated by Inception Score (IS),

Frechet Inception Distance (FID) and Learned Per-

ceptual Similarity (LPIPS). We assess the quality of

the generated images using LPIPS and SIFID (Sha-

ham et al., 2019) for evaluating our method. While,

SIFID compares one input image against the set of

generated images, FID metric is designed to compute

the distance between comparable number of real and

generated images. However, the number of input im-

ages are limited in few-shot multi-class image synthe-

sis while the generated samples are diverse and large

in number. A recent work (Sushko et al., 2021) points

out that SIFID tends to penalize the diversity and fa-

vors overﬁtting and hence may not be the best met-

ric to evaluate diverse images. The diversity of the

generated samples can be measured by LPIPS metric

(Dosovitskiy and Brox, 2016).

5.3 Experiments

Unlike single image GANs, the images gen-

erated by our method in one-shot multi-class

image synthesis/few-shot image synthesis tasks

have features and attributes from multiple input

classes/images. Since each generated sample has

attributes from multiple inputs/classes, we compute

SIFID metric on the complete set of generated im-

ages against each input image. For all FID and SIFID

computation, 100 images were generated against two,

three, ﬁve or ten input images for our method. For

our baseline (Mixdl), the checkpoints at 10K intervals

were used to generate 5000 images from which 1000

images were considered for computing the above met-

rics. For different experiments, the images of size

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

874

Table 1: LPIPS metric computed for all generated images

between consecutive image pairs for ﬁve inputs of ﬂow-

ers images, two inputs of male faces and female faces for

the few-shot image synthesis task compared for Mixdl*

(columns 2-5) and our method (columns 6-11).

LPIPS 30K* 40K* 50K* 60K* 3000(6) 3500(6) 4000(6) 4500(6) 5000(6) 5500(6)

Flowers-5 – 0.48 0.39 0.53 0.59 0.59 0.58 0.59 – –

Male faces 0.41 0.38 0.36 – – – 0.28 0.28 0.26 0.27

Female faces 0.27 0.32 0.32 – – – 0.27 0.27 0.27 0.27

Table 2: LPIPS values computed between each input im-

age and generated images for ﬁve inputs compared between

Mixdl* (columns 2-4) and our method (columns 5-8).

LPIPS 30K* 40K* 50K* 60K* 3000(6) 3500(6) 4000(6) 4500(6)

Flower-1 – 0.64 0.62 0.64 0.64 0.65 0.64 0.65

Flower-2 – 0.52 0.44 0.63 0.63 0.62 0.62 0.63

Flower-3 – 0.66 0.65 0.67 0.64 0.64 0.63 0.64

Flower-4 – 0.63 0.60 0.64 0.69 0.68 0.69 0.69

Flower-5 – 0.68 0.66 0.51 0.63 0.63 0.63 0.63

Male-1 0.42 0.44 0.41 – 0.30 0.27 0.28 0.28

Male-2 0.47 0.43 0.44 – 0.30 0.29 0.31 0.28

Female-1 0.33 0.36 0.37 – 0.26 0.26 0.27 0.27

Female-2 0.46 0.45 0.42 – 0.29 0.28 0.28 0.28

128 or 256 were generated, while all experiments that

compare our method with Mixdl (Kong et al., 2021)

are compared on image size of 256 x 256.

We conducted image synthesis experiments for

one-shot multi-class image synthesis on cat and dog

faces for two inputs (refer Fig.2), human faces for two

inputs of male (refer Fig.8) and female faces (refer

Fig.3), three inputs of female faces (refer Fig.9) and

ﬁve inputs of ﬂower images (refer Fig.7). The gen-

erated images for the few-shot image synthesis for a

selected set of ten images from the 100-shot-Obama

dataset are in Figure 6. The LPIPS value for the same

are given in Table 1 and Table 2, where the top most

row denotes the iteration number at which the images

were generated using Mixdl, while for our method the

number of stages are mentioned alongside training it-

erations that vary between 2-6k per stage. We also

report the FID values of our method compared to the

baseline (Kong et al., 2021) (Mixdl) in Table 3 and

SIFID values of our method in Table 4. The SIFID

values of mixdl are very large in comparison to our

method and has been skipped in the table. The results

of the generation of cat and dog images are given in

Figure 2. The results of one-shot image synthesis on

non-facial texture images of polka dot are given in

Figure 5.

6 RESULTS AND ANALYSIS

Initially, we considered ConSinGAN as a baseline for

modelling multiple images, with a single input ran-

domly selected in each iteration from the set of few

images, keeping the image remained ﬁxed throughout

the critic operations, but the generated images were

affected by mode collapse. The mode collapsed im-

ages on modelling two face images for three image

Figure 5: Few-shot image synthesis on polka dot texture

class. The top row leftmost two images are the inputs and

the rest are generated images (256x256).

Table 3: FID values between input images and the gener-

ated images for few-shot image synthesis task using Mixdl*

(columns 2-4) and our method (columns 5-10).

FID 30K* 40K* 50K* 3000 3500 4000 4500 5000 5500

Flowers – 200.94 219.63 244.73 236.52 238.56 244.33 – –

Male faces 238.03 205.94 201.05 – – 217.78 202.74 205.14 189.16

Female faces 167.90 140.29 130.09 – – 128.76 121.78 119.97 128.11

Table 4: SIFID values between each input image and gen-

erated images for two inputs (rows 2-3, 4-5) and three in-

puts (rows 6-8) for few-shot image synthesis task using our

method.

SIFID 3000 3500 4000 4500 5000 5500

Male-1 – – 0.196 0.142 0.177 0.173

Male-2 – – 0.174 0.197 0.199 0.169

2-Female-1 – – 0.149 0.143 0.148 0.156

2-Female-2 – – 0.226 0.205 0.199 0.194

3-Female-1 0.593 0.611 0.619 0.644 – –

3-Female-2 0.569 0.419 0.456 0.445 – –

3-Female-3 0.810 0.850 0.826 0.826 – –

pairs can be seen in Figure 4. We avoid the evaluation

of mode collapsed images as all generated images are

the same. We conjecture that mode collapse could be

due to smaller batch sizes that doesn’t consider all in-

put images and the image remains ﬁxed throughout

the critic operations.

6.1 Quantitative Results

Table 3 compares the FID scores computed between

the input images and the generated images. We can

observe that our method has lower FID scores that im-

plies better quality than the baseline for two inputs of

human faces for both male and female faces. Table 2

compares LPIPS scores computed between each input

image and the set of generated images for our method

and Mixdl. We can observe that for 2-input case, our

method scores better LPIPS scores than the baseline.

For ﬁve inputs of ﬂower images, our method falls be-

hind the baseline for both LPIPS and FID scores. We

conjecture that this could be due to fewer common

correspondences and rough alignments of input im-

ages. It is easier to align correspondences with fewer

and similar images but difﬁcult when the number of

classes are large and different, leading to less coherent

samples. To summarize, training Mixdl is inefﬁcient

in a few-shot setting due to large number of network

DEff-GAN: Diverse Attribute Transfer for Few-Shot Image Synthesis

875

Figure 6: Few-shot synthesis on ten selected Obama face

images. The top two rows images are the input im-

ages, while the rest of the images are generated images

(128x128).

parameters and the checkpoint size. Table 4 compares

the SIFID values of our method for various input im-

ages. Values below 1 for SIFID scores indicate that

the generated images are similar and share features

from the input images. Since we have computed the

SIFID scores of each input against all generated im-

ages, small SIFID scores imply that generated sam-

ples have features from multiple input images, which

imply the transfer of visual attributes. The baseline

method had very high SIFID scores, which could be

due to poor attributes/features from multiple inputs.

6.2 Observations and Analysis

The selection of input images becomes crucial for

data efﬁcient few-shot GANs, which are important

for the decision boundary of the discriminator. The

abrupt changes in the output of the generator is due to

the discontinuities in latent space and possible reason

for degradation of few-shot GANs. The assumption

for novel image synthesis is that the generated image

should have the similar global layouts as that of origi-

nal images with possible attribute transfer from other

input images.

From the results of few-shot face synthesis, we

can observe that our method is able to faithfully gen-

erate diverse set of generated images. Our method

also extends to non-facial classes like texture images

or ﬂower class. Also, our method can extend to re-

lated classes as one-shot multi-class image synthesis,

Figure 7: Few-shot image synthesis on ﬁve ﬂower images.

The ﬁrst row images are the input images, while the rest of

the images are generated images (256x256).

Figure 8: One-shot face synthesis on two male face images.

The leftmost two images are the input images and the rest

of the images are generated images (128x128).

as in the case of dog and cat images. We can observe

that the generated images with largest diversity were

the ones with similarity in textures, shapes and color,

which is observed in the case of polka-dot texture im-

ages. We can infer that neither FID and SIFID are

good evaluation metrics for one-shot multi-class im-

age synthesis. FID computation requires lot more in-

put images and gives large FID values when the num-

ber of input images are a few, while the generated im-

ages are large in number. On the other hand, SIFID

is suitable for a single image and doesn’t take into

consideration for multiple input images and feature

transfer. As a limitation in comparison to other GAN

methods, the generated samples of our method also

tend to have non-smooth interpolations to other sam-

ples. One can always ﬁnd some set of input images

that are inherently difﬁcult for our method to generate

leading to reduced semantics.

Figure 9: Few-shot face synthesis using three face images.

The top row leftmost three images are the inputs and the rest

are generated images (256x256).

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

876

7 CONCLUSION AND FUTURE

SCOPE

In this work, we improved the capabilities of single

image models to accommodate multiple images. This

is possible with simple assumptions of similarities in

underlying content and a modiﬁed discriminator ar-

chitecture and objective function. When we consider

two face images that are roughly aligned, but differ

in other aspects like texture, color and light intensi-

ties, our method involves learning a distribution of the

patches that appear from the natural composition of

the input images. The idea extends to multiple im-

ages, assuming that the images are roughly aligned

and the images share similar underlying content lay-

outs. Our method generates diverse set of hundreds

of data samples by training on just two input images.

Future work can focus on improving control over the

style at global and local level.

REFERENCES

Arantes, R. B., Vogiatzis, G., and Faria, D. R. (2020). rcgan:

Learning a generative model for arbitrary size image

generation. In Bebis, G., Yin, Z., Kim, E., Bender,

J., Subr, K., Kwon, B. C., Zhao, J., Kalkofen, D., and

Baciu, G., editors, Advances in Visual Computing.

Chen, X., Zhao, H., Yang, D., Li, Y., Kang, Q., and Lu,

H. (2021). Sa-singan: self-attention for single-image

generation adversarial networks. Machine Vision and

Applications, 32(4):104.

Clou

atre, L. and Demers, M. (2019). FIGR: few-shot image

generation with reptile. CoRR, abs/1901.02199.

Ding, G., Han, X., Wang, S., Wu, S., Jin, X., Tu, D., and

Huang, Q. (2022). Attribute group editing for reliable

few-shot image generation. 2022 IEEE CVPR.

Donahue, J. and Simonyan, K. (2019). Large scale adver-

sarial representation learning. CoRR, abs/1907.02544.

Dosovitskiy, A. and Brox, T. (2016). Generating images

with perceptual similarity metrics based on deep net-

works. CoRR, abs/1602.02644.

Gu, Z., Li, W., Huo, J., Wang, L., and Gao, Y. Lofgan:

Fusing local representations for few-shot image gen-

eration. In Proceedings of the IEEE/CVF ICCV.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and

Courville, A. C. (2017). Improved training of wasser-

stein gans. CoRR, abs/1704.00028.

Hinz, T., Fisher, M., Wang, O., and Wermter, S. (2020).

Improved techniques for training single-image gans.

CoRR, abs/2003.11512.

Hong, Y., Niu, L., Zhang, J., and Zhang, L. (2020a). Match-

inggan: Matching-based few-shot image generation.

CoRR, abs/2003.03497.

Hong, Y., Niu, L., Zhang, J., and Zhang, L. (2022). Delta-

gan: Towards diverse few-shot image generation with

sample-speciﬁc delta. In ECCV.

Hong, Y., Niu, L., Zhang, J., Zhao, W., Fu, C., and Zhang,

L. (2020b). F2GAN: fusing-and-ﬁlling GAN for few-

shot image generation. CoRR, abs/2008.01999.

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J.,

and Aila, T. (2020). Analyzing and improving the im-

age quality of StyleGAN. In Proc. CVPR.

Kong, C., Kim, J., Han, D., and Kwak, N. (2021). Smooth-

ing the generative latent space with mixup-based dis-

tance learning. CoRR, abs/2111.11672.

Kumar, R., Dabral, R., and Sivakumar, G. (2021).

Learning unsupervised cross-domain image-to-image

translation using a shared discriminator. CoRR,

abs/2102.04699.

Liang, W., Liu, Z., and Liu, C. (2020). DAWSON: A do-

main adaptive few shot generation framework. CoRR,

abs/2001.00576.

Liu, B., Zhu, Y., Song, K., and Elgammal, A. (2021).

Towards faster and stabilized GAN training for

high-ﬁdelity few-shot image synthesis. CoRR,

abs/2101.04775.

Liu, M.-Y., Huang, X., Mallya, A., Karras, T., Aila, T.,

Lehtinen, J., and Kautz., J. (2019). Few-shot un-

sueprvised image-to-image translation. In arxiv.

Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep learn-

ing face attributes in the wild. In In ICCV.

Ojha, U., Li, Y., Lu, J., Efros, A. A., Lee, Y. J., Shecht-

man, E., and Zhang, R. (2021). Few-shot image

generation via cross-domain correspondence. CoRR,

abs/2104.06820.

Ruiz, N., Theobald, B., Ranjan, A., Abdelaziz, A. H., and

Apostoloff, N. (2020). Morphgan: One-shot face syn-

thesis GAN for detecting recognition bias. CoRR,

abs/2012.05225.

Shaham, T. R., Dekel, T., and Michaeli, T. (2019). Singan:

Learning a generative model from a single natural im-

age. CoRR, abs/1905.01164.

Shocher, A., Bagon, S., Isola, P., and Irani, M. (2019). In-

gan: Capturing and retargeting the ”dna” of a natural

image. In The IEEE ICCV.

Shocher, A., Cohen, N., and Irani, M. (2017). ”zero-shot”

super-resolution using deep internal learning.

Sushko, V., Gall, J., and Khoreva, A. (2021). One-shot

GAN: learning to generate samples from single im-

ages and videos. CoRR, abs/2103.13389.

Tritrong, N., Rewatbowornwong, P., and Suwajanakorn, S.

(2021). Repurposing gans for one-shot semantic part

segmentation. In IEEE CVPR.

Tundia, C., Kumar, R., Damani, O. P., and Sivakumar,

G. (2021). The MIS check-dam dataset for object

detection and instance segmentation tasks. CoRR,

abs/2111.15613.

Yang, W., Zhang, X., Tian, Y., Wang, W., Xue, J.-H., and

Liao, Q. (2019). Deep learning for single image super-

resolution: A brief review. IEEE Transactions on Mul-

timedia, 21(12):3106–3121.

Zhao, S., Liu, Z., Lin, J., Zhu, J., and Han, S. (2020). Dif-

ferentiable augmentation for data-efﬁcient GAN train-

ing. CoRR, abs/2006.10738.

DEff-GAN: Diverse Attribute Transfer for Few-Shot Image Synthesis

877