Generation of Human Images with Clothing using Advanced Conditional

Generative Adversarial Networks

Sheela Raju Kurupathi

1,2 a

, Pramod Murthy

2 b

and Didier Stricker

1,2

Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany

Augmented Vision, German Research Center for Artiﬁcial Intelligence, Kaiserslautern, Germany

Keywords:

Conditional GANs, Human Pose, Market-1501, DeepFashion.

Abstract:

One of the main challenges of human-image generation is generating a person along with pose and clothing

details. However, it is still a difﬁcult task due to challenging backgrounds and appearance variance. Re-

cently, various deep learning models like Stacked Hourglass networks, Variational Auto Encoders (VAE), and

Generative Adversarial Networks (GANs) have been used to solve this problem. However, still, they do not

generalize well to the real-world human-image generation task qualitatively. The main goal is to use the Spec-

tral Normalization (SN) technique for training GAN to synthesize the human-image along with the perfect

pose and appearance details of the person. In this paper, we have investigated how Conditional GANs, along

with Spectral Normalization (SN), could synthesize the new image of the target person given the image of the

person and the target (novel) pose desired. The model uses 2D keypoints to represent human poses. We also

use adversarial hinge loss and present an ablation study. The proposed model variants have generated promis-

ing results on both the Market-1501 and DeepFashion Datasets. We supported our claims by benchmarking

the proposed model with recent state-of-the-art models. Finally, we show how the Spectral Normalization

(SN) technique inﬂuences the process of human-image synthesis.

1 INTRODUCTION

The idea of generating realistic human images has

been of great value in recent times due to their varied

applications in e-commerce for fashion shopping and

also in synthesizing training data for person detection,

person identiﬁcation (Chen et al., 2019). Due to ad-

vances in Artiﬁcial Intelligence (AI), we can see the

rapid growth of integrating every aspect into AI. The

human-image generation has been one of the most

crucial tasks over the past few decades. There exist

two problems that need to be dealt with while gener-

ating human images, one is the representation of the

human pose, and the other is the generation of the ap-

pearance details like clothing textures.

We have various ways to represent the poses like

2D skeletons (stick ﬁgures), segmentation masks, 3D

pose skeletons, dense pose, as shown in Figure 1. For

generating the clothing textures, we can use warping

and clothing segmentation techniques. A wide range

of deep learning models like PG

(Ma et al., 2017),

Pix2pixHD (Wang et al., 2018), Deformable GANs

https://orcid.org/0000-0003-4530-9717

https://orcid.org/0000-0002-8016-8537

(a) 2D Skeleton (b) Segmented Map

(e) Dense Pose (f) 3D Skeleton

Figure 1: Different pose representations.

(Siarohin et al., 2018) has been used for generating

the human poses along with the clothing. However,

Kurupathi, S., Murthy, P. and Stricker, D.

Generation of Human Images with Clothing using Advanced Conditional Generative Adversarial Networks.

DOI: 10.5220/0009832200300041

In Proceedings of the 1st International Conference on Deep Learning Theory and Applications (DeLTA 2020), pages 30-41

ISBN: 978-989-758-441-1

these models still suffer to generate human images

with accurate pose and clothing due to many varia-

tions in the textures, appearance, and shape.

Figure 2: The conditional pose generation task. The input

image source: DeepFashion (Liu et al., 2016b).

Data availability to train a deep learning model is

very scarce in both 2D and 3D domains. By using the

proposed models, we can generate the human-images

in rare poses, which can be used as synthetic datasets

for humans for further research purposes. For gen-

erating the human-image with accurate pose and ap-

pearance details, the model needs to know informa-

tion about the human body poses. To avoid the ex-

pensive annotations for poses, we represent the pose

related information using the 2D keypoints represent-

ing the 2D coordinates for each joint in the image.

We use the HumanPoseEstimator (HPE) (Cao et al.,

2017) to estimate the 2D coordinates for all the joints,

and the number of keypoints is 18. Using these key

points, the model learns the positions of joints in the

human body. We can also use other pose representa-

tions depending on the application.

We have used Conditional GANs along with SN to

deal with the problem of generating humans with pose

and clothing details. The main aim can be seen from

Figure 2, that given an input image of a person, the

pose of that person, and a target pose to a Generator to

output an image (target image) of the person in the tar-

get pose. However, training these networks requires

high computational power and a sufﬁcient amount of

training data to achieve the desired results (Stewart,

2019). In this paper, we do not make any assumptions

about the backgrounds, objects, etc. and we do not

use any representations to denote the clothing infor-

mation like segmentation, which makes the network

to learn different clothing textures by itself.

2 RELATED WORK

Recently deep learning models have shown substan-

tial improvement in the neural image synthesis (Isola

et al., 2017), providing numerous applications in the

ﬁeld of virtual reality and gaming. One such an

emerging class of models that are being vastly re-

searched and well studied in recent years are Gen-

erative Adversarial Networks (GANs) (Goodfellow

et al., 2014). GANs are aimed to generate novel data

that has similar characteristics to that of real-world

data. The main idea of our proposed models is to

guide the generation process explicitly by an appro-

priate pose representation like 2D stick ﬁgures to en-

able direct control over the generation process.

There have been many deep models that have been

proposed to deal with the task of human image gen-

eration. The most commonly used deep learning ar-

chitectures are AlexNet (Krizhevsky et al., 2012),

ResNet (He et al., 2016), etc. We use Generative Ad-

versarial Network architecture to solve the problem

related to human-image generation. GANs (Good-

fellow et al., 2014) are a particular class of artiﬁcial

intelligence algorithms that comprise mainly two neu-

ral networks Generator G and Discriminator D, which

play a zero-sum game. The applications of GANs

include neural image synthesis, image in-painting,

super-resolution, semi-supervised learning, and more.

For our problem, we have chosen image-to-image

translation in which GANs take the human image as

input from one domain and translate it into another

domain without any alignment between the domains.

Conditional pose generation helps to synthesize a new

image of a person, given a reference image of the per-

son and a target pose. Most of the models focus on

detection, pose, and shape estimation of people from

images. The most critical task would be to trans-

fer the one pose of person to another person or the

same person in a different pose. The subjects can

have different deformable objects in the foreground

and the background, thus making the model difﬁcult

to learn. It is also challenging to learn the pose and

the clothing details simultaneously. Isola et al. (Isola

et al., 2017) proposed a conditional GAN for image-

to-image translation, where a given scene representa-

tion from one domain is translated into another repre-

sentation.

Recently Siarohin et al. (Siarohin et al., 2018)

proposed a person image generation method con-

ditioned on a given image of the person and the

novel (target) pose of the person to synthesize the

new image of that same person in the novel pose.

Jetchev et al. (Jetchev and Bergmann, 2017) proposed

the Conditional Analogy Generative Adversarial Net-

Generation of Human Images with Clothing using Advanced Conditional Generative Adversarial Networks

work (CAGAN), which learns to swap the clothing

of the person and paint realistically looking images

with a target cloth article, given pairs of humans and

clothes. In contrast, in our case, we focus on hu-

man image generation along with pose and cloth-

ing. Neverova et al. (Neverova et al., 2018) adopted

DensePose (Alp G

uler et al., 2018) as its pose rep-

resentation for human pose transfer. Ma et al. (Ma

et al., 2017) proposed a more general approach to syn-

thesize person images in any arbitrary (random) pose.

Similarly, a conditioning image of the person and a

target new pose deﬁned by 18 joint locations is the

input to our proposed models. The generation pro-

cess is divided into two different stages as pose gen-

eration and texture reﬁnement. Horiuchi et al. (Hori-

uchi et al., 2019) addressed the problem of human im-

age generation by using deformable skip connections,

self-attention, and spectral normalization in GAN. In-

painting modules are used in the recent models to

achieve a considerable level of detail in both image

resolution and texture of cloth. There are 3D clothing

models which automatically captures real clothing to

estimate body shape, pose, and to generate new body

shapes. Speciﬁcally, we have U-Net based architec-

tures that are commonly used for pose-based person-

image generation tasks (Ma et al., 2017), (Lassner

et al., 2017). Because of local information in input

and the output images is not aligned, the skip connec-

tions are not well-suited to handle large spatial defor-

mations. Contrary to this, the proposed models use

deformable skip connections to deal with this mis-

alignment problem and to share the local information

from the encoder to the decoder.

3 PROPOSED MODELS

We address the problem of transferring the person’s

appearance from a given pose to the desired target

pose using a Deformable GAN (Siarohin et al., 2018)

architecture with Spectral Normalization (SN) and

adversarial hinge loss in the generator and discrimina-

tor. We also use warping in the discriminator and skip

connections in the generator. We discuss the effect of

architectural changes on the performance of human-

image generation by comparing different variants.

3.1 Approach

With the inspiration from neural image synthesis

(Isola et al., 2017), the proposed human-image gen-

eration technique uses Deformable GAN (DGAN)

(Siarohin et al., 2018) architecture as the base model

and has a similar architecture of DGAN as shown

in Figure 3. We discuss in detail the different vari-

ants of our proposed architecture. In variant-1, the

Spectral Normalization (SN) is integrated into both

Generator G and Discriminator D. In the variant-

2, the Generator G and Discriminator D losses are

modiﬁed by adding the hinge adversarial loss along

with Spectral Normalization (SN). In variant-3, the

Generator G with skip connections, Discriminator D

with Warping (W) along with RMSprop optimizer are

used. In variant-4, all three variants are combined to

observe their combined effect on the overall genera-

tion of the target image. The models need to preserve

the appearance details like texture from the input im-

age along with the pose information from the target

pose. The model ﬁrst extracts the pose of the person

in the 2D skeleton with an HPE (Cao et al., 2017)

model. All the four variants of the model are evalu-

ated on the Market-1501 and DeepFashion datasets,

and results are reported in the next section to show

how the proposed model performs comparatively to

existing state-of-the-art approaches PG

(Ma et al.,

2017), DGAN (Siarohin et al., 2018) in conditional

image synthesis.

3.2 Variants

3.2.1 Variant-1: Spectral Normalization (SN)

GANs are well known to be unstable during their

training and more sensitive to the choice of hyper-

parameters. One of the challenges in the training of

GANs is controlling the performance of the discrimi-

nator. In higher dimensional spaces, the density ratio

estimation by the discriminator is often inaccurate and

unstable while training the model. Therefore, the gen-

erator networks fail to learn the multi-modal structure

of the target data distribution. A novel weight nor-

malization technique known as Spectral Normaliza-

tion (SN) is used to stabilize the training of the gen-

erator and discriminator of GAN. SN has been one of

the recent popular normalization techniques (Miyato

et al., 2018), which stabilizes the training and avoids

unwanted gradients preventing the parameters from

exploding. To ensure that the generated image has

better resolution without loss of texture details, we

need to use SN (Miyato et al., 2018) in adversarial

training. Spectral normalization (Miyato et al., 2018)

normalizes the spectral norm of the weight matrix W

of a convolution layer such that it satisﬁes the Lips-

chitz constraint σ(W ) = 1.

(W ) = W /σ(W ) (1)

where σ(W ) = u

W v. For each layer, the vectors u

and v are randomly initialized. So, it replaces every

DeLTA 2020 - 1st International Conference on Deep Learning Theory and Applications

weight W by W /σ(W ), and the task is to compute the

value of σ(W ) efﬁciently. By applying singular value

decomposition naively at each step to compute σ(W )

might be computationally expensive. So, the power

iteration method is used to estimate the σ(W ) spectral

norm of each layer. The value of Lipschitz constant is

the only hyperparameter that needs to be tuned, and

the model does not require intensive tuning of the hy-

perparameter for improving performance. This tech-

nique is computationally cheap and easy to incorpo-

rate into existing implementations. Spectrally Nor-

malized GANs (SN-GANs) (Miyato et al., 2018) are

capable of generating images of better or equal qual-

ity in comparison to the previous training stabilization

techniques. In the proposed model, for a given input

image of a person and a target pose, DGAN (Siarohin

et al., 2018) initially extracts the pose of the person

in the form of a 2D skeleton with a Human Pose Esti-

mator (HPE) (Cao et al., 2017). The data is then pro-

cessed with a fully convolutional network encoder to

obtain a feature representation of the input image, the

extracted pose, and the target pose. Later, using the

pose information for each speciﬁc body part, an afﬁne

transformation is computed and applied to move the

feature-map content corresponding to that body part.

The spectral normalization is used in both the encoder

and decoder of the discriminator and the generator

networks, as shown in Figure 3. It improves the re-

sults by decreasing the degradation of the error sig-

nal during back-propagation. As the weights change

slowly, only a single power iteration for each step

of learning needs to be performed. Therefore, spec-

tral normalization for GANs is more computationally

efﬁcient than other regularization techniques such as

weight clipping (Arjovsky et al., 2017) and gradient

penalty (Gulrajani et al., 2017).

3.2.2 Variant-2: Spectral Normalization with

Hinge Loss (SN + H)

A discriminator is trained with conditional adversar-

ial losses to classify whether the given input is real or

fake with negative log-likelihood. Adversarial hinge

losses are used to optimize the probability that a given

real data is realistic than a randomly sampled gener-

ated (fake) data and vice versa. It leads to more stable

and robust training. The variant-2 makes use of an

additional loss called as adversarial hinge loss. The

adversarial hinge losses for the Generator G and Dis-

criminator D can be calculated as follows:

(G,D) = E

x∈X

max(0,1 − D( ˆx))

+ E

∗

∈X

max(0,1 + D(x

∗

)) (2)

Figure 3: Simple architecture of the proposed model. SN

represents Spectral Normalization after each Convolutional

layer. L

: adversarial Hinge loss, L

: Nearest Neighbour

loss.

(G,D) = E

∗

∈X

max(0,1 − D(x

∗

))

+ E

x∈X

max(0,1 + D( ˆx)) (3)

These adversarial hinge loss functions are integrated

into the generator and discriminator of the model

along with nearest-neighbour loss and minimize the

overall objective function. The relativistic hinge loss

(Jolicoeur-Martineau, 2018), which modiﬁes the out-

put of the discriminator is not used in our proposed

model. Relativistic adversarial losses (Jolicoeur-

Martineau, 2018) optimize the probability that a given

real data is highly realistic than randomly generated

fake data and vice versa. The relativistic average

hinge loss used in (Horiuchi et al., 2019) leads to

a min-min problem instead of the normal min-max

problem of adversarial training. The proposed model

is trained with spectral normalization along with ad-

versarial hinge loss on the Market-1501 and Deep-

Fashion datasets.

3.2.3 Variant-3: Discriminator with Warping

(W)

In variant-3, the architecture of the Discriminator D

is modiﬁed. The discriminator uses an afﬁne transfor-

mation layer between the ﬁrst two convolution layers.

We performed an ablation study to analyze the var-

ious effects of the different components of the pro-

posed model. We used our proposed model for eval-

uating all the methods by amputating the parts of the

Generation of Human Images with Clothing using Advanced Conditional Generative Adversarial Networks

full model similar to the base model (Siarohin et al.,

2018).

The qualitative and quantitative results are found

to be better than the benchmark model (Siarohin et al.,

2018) by using this architecture for Discriminator D,

which are reported in the next section.

3.2.4 Variant-4: Spectral Normalization with

Hinge Loss and Discriminator with

Warping (SN + H + W)

In variant-4, the variants SN + H and W are combined

in order to know the effectiveness of these variants

for the human-image generation. The use of SN with

adversarial hinge loss stabilizes the training, whereas

the use of discriminator with Warping (W) in the Full

model helps to generate the human-images with ap-

pearance details close to the target image. The model

is trained with both the Nearest-neighbour L

and

adversarial Hinge L

losses. The SN is used after ev-

ery convolution and fully connected layers in both the

Generator G and Discriminator D. The layers in the

Generator G and Discriminator D are the same as in

the variant-3 with SN and adversarial hinge loss. The

model is trained on both the Market-1501 and Deep-

Fashion datasets. Both the qualitative and quantitative

results of the model, in comparison to other variants

and benchmark models DGAN (Siarohin et al., 2018),

(Ma et al., 2017) have been described in detail in

the next section.

3.3 Loss Functions

Recently with advances in Generative Adversarial

Framework, different loss functions are used to tackle

different optimization problems. The Generator G

and Discriminator D are trained using the standard

conditional adversarial loss L

cGAN

along with the

nearest-neighbour loss L

similar to the base model

(Siarohin et al., 2018).

We have integrated the hinge adversarial loss to

the variant-2. We train the model using the nearest-

neighbour loss L

in conjunction with the adversar-

ial hinge loss L

by minimizing the ﬁnal objective

function as given below:

∗

= argmin

max

CGAN

(G,D) + λL

(G)+

(G,D) + L

(G,D) (4)

where we use λ as 0.01 in all our experiments. Based

on the behavior of λ value leading to the generation of

artifacts and blurry results, the value of λ is kept small

(Ma et al., 2017). In detailed description of L

and

other losses can be found in (Siarohin et al., 2018).

4 EXPERIMENTS AND RESULTS

In this Section, we present both quantitative and

qualitative evaluation results on publicly available

Market-1501 and DeepFashion datasets. We also

compare our proposed models with other state-of-the-

art methods. We describe in detail about the datasets,

evaluation metrics used along with the implementa-

tion details for the proposed model.

4.1 Datasets

We used the most commonly available public datasets

for humans like Market-1501 (Zheng et al., 2015) and

DeepFashion (Liu et al., 2016b) datasets. Person Re-

identiﬁcation dataset: Market-1501 This dataset

contains images of 1,501 persons captured from 6 dif-

ferent surveillance cameras totaling to 32,668 images.

It is a very challenging dataset as it contains low-

resolution images of size 128 × 64 with the high di-

versity in illuminations, poses, and backgrounds. The

data is divided into train and test sets with 12,936 and

19,732 images, respectively. In the train set, we have

439,420 pairs, each of which is composed of images

of the same person with different poses. Then, ran-

domly 12,800 pairs are selected from the test set for

testing. The DeepFashion dataset (In-shop Clothes

Retrieval Benchmark) comprises of 52,712 clothes

images, leading to 200,000 pairs of same clothes with

two different poses and scales of the persons wear-

ing these clothes. These images have a resolution of

256 × 256 pixels, and the images are less noisy com-

pared to the Market-1501 dataset. The dataset con-

tains images with no background, where the clothing

patterns can contain text, graphical logos, etc. Ran-

domly 12,800 pairs are selected from the test set for

testing.

4.2 Evaluation Metrics

For benchmarking the performance of the proposed

model, widely-used evaluation metrics for human im-

age generation like Structural Similarity (Wang et al.,

2004) (SSIM), Inception Score (IS) (Salimans et al.,

2016) (based on the entropy of classiﬁcation neu-

rons), m-SSIM (masked SSIM), m-IS (masked IS),

(Detection Scores) DS scores are calculated. Though

these measures are widely accepted, we like to show

the qualitative results of the generator with adver-

sarial training that are visually appealing compared

to present state-of-the-art approaches. The masked

scores are obtained by masking out the background

of the images and feeding it to the Generator G. An-

other metric DS (Detection Score) that is based on the

DeLTA 2020 - 1st International Conference on Deep Learning Theory and Applications

detection outcome of the object detector SSD (Single

Shot multi-box Detector) (Liu et al., 2016a) is used. It

is trained without ﬁne-tuning to the datasets on chal-

lenging Pascal VOC 2007 (Everingham et al., 2007).

During testing, scores of SSD are computed on each

generated image and averaging the SSD score of each

generated image; the ﬁnal DS is obtained. Therefore,

DS gives the conﬁdence that a person is present in the

image.

4.3 Implementation Details

In all the four variants of the proposed model, both

the G and D are trained with the RMSprop opti-

mizer (learning rate:0.0002 and ρ:0.9). Because of

the higher resolution of the DeepFashion dataset, the

generator for the DeepFashion dataset of all the four

variants has one extra convolution block of 512 ker-

nels and a stride of 2 both in encoder and decoder.

The ReLU of the last layer in Discriminator D for all

the four variants is replaced with the sigmoid activa-

tion function. The dropout is used only at the time of

training.

4.4 Quantitative Results

We present the results of the ablation study for

the variant-3 (W) on Market-1501 and DeepFashion

datasets. We also show the comparison of the four

variants of our proposed model with other state-of-

the-art models DGAN (Siarohin et al., 2018), PG

(Ma et al., 2017).

4.4.1 Ablation Study

We have four different methods in variant-3, which

are obtained by removing the parts from the Full

method described earlier to see the impact of each

part. The architecture of the Discriminator D includes

warping and is the same for all the four methods.

The results in Table 1 report the evaluation metrics

like SSIM, IS, m-SSIM, m-IS for the Market-1501

dataset and SSIM, IS, DS scores for the DeepFashion

dataset. Table 1 gives the results in comparison to the

base model (Siarohin et al., 2018). Table 1 gives the

respective scores of the base model (Siarohin et al.,

2018) and the proposed model for the four different

methods. From Table 1, the scores show that there is

a signiﬁcant progressive improvement from the Base-

line method to the Full method. We can observe that a

combination of all the components gives a large boost

in IS and improves SSIM scores. The bold scores

represent the highest values obtained for that partic-

ular score in comparison with all the models. For ex-

ample, on the Market-1501 dataset, the SSIM value

observed is 0.293, which is highest when compared

to all the other SSIM scores of 4 different methods.

The highest scores for all the four methods in the case

of DeepFashion are obtained by the proposed model

when compared with their respective methods, as re-

ported in Table 1. The DS scores for all four different

methods on DeepFashion are 0.97, which are close

to real data scores that are computed on ground truth

images from datasets.

4.4.2 Comparison of the Model with Other

State-of-the-Art Models

All the four variants of the proposed model are com-

pared with the other state-of-the-art models like PG

(Ma et al., 2017) and Siarohin et al. (Siarohin et al.,

2018) in Table 2. The scores for model PG

(Ma

et al., 2017) were taken from the benchmark paper

(Siarohin et al., 2018) that are computed using the

code and the network weights released by PG

model

(Ma et al., 2017). From Table 2, the variant SN

has improved the IS scores when compared to De-

formable GAN (DGAN) model proposed by Siaro-

hin et al. (Siarohin et al., 2018), whereas the SSIM

scores are slightly less. Therefore, the use of SN in the

generator and discriminator has increased the IS and

m-IS scores, whereas the SSIM scores have slightly

decreased. When compared to the PG

model (Ma

et al., 2017), except for the IS score, all the other

scores have been improved. These scores show that

the variant SN has outperformed the PG

model.

The second variant, SN + H, has outperformed the

DGAN model (Siarohin et al., 2018), whereas only

the IS score was less than the PG

model (Ma et al.,

2017). It can be noticed that the use of SN has signif-

icantly increased the DS scores in the ﬁrst and second

variants. The inclusion of adversarial hinge loss along

with SN in the GAN framework has improved the re-

sults, and this proves the importance of adversarial

hinge loss for the human-image generation. The third

variant (W), which represents the Full model of the

ablation study, has the highest scores than the DGAN

model (Siarohin et al., 2018) except for the DS score.

In comparison to the PG

model (Ma et al., 2017),

this variant reports the highest performance than the

benchmark PG

model with all metrics except the IS

metric. Conversely, on the DeepFashion dataset, the

full model signiﬁcantly improves the IS and DS val-

ues than the other two benchmark models PG

(Ma

et al., 2017), DGAN (Siarohin et al., 2018) but re-

turns a slightly lower SSIM value. The fourth variant,

which is SN + H + W has obtained better quantita-

tive results than PG

model (Ma et al., 2017) except

for the IS metric on the Market-1501 dataset. In the

case of the DeepFashion dataset, it has signiﬁcantly

Generation of Human Images with Clothing using Advanced Conditional Generative Adversarial Networks

Table 1: Quantitative results: Ablation study results on the Market-1501 and the DeepFashion datasets for the variant-3 of the

proposed model. The best results are highlighted in bold. For all the measures, higher is better.

Models Market-1501 DeepFashion

SSIM IS m-SSIM m-IS DS SSIM IS DS

Baseline (Siarohin et al., 2018) 0.256 3.188 0.784 3.580 0.595 0.754 3.351 0.96

Ours-Baseline 0.265 3.232 0.787 3.631 0.625 0.759 3.452 0.97

DSC (Siarohin et al., 2018) 0.272 3.442 0.796 3.666 0.629 0.754 3.352 0.96

Ours-DSC 0.277 3.433 0.797 3.684 0.599 0.764 3.385 0.97

PercLoss (Siarohin et al., 2018) 0.276 3.342 0.788 3.519 0.603 0.744 3.271 0.96

Ours-PercLoss 0.279 3.318 0.802 3.537 0.671 0.762 3.400 0.97

Full (Siarohin et al., 2018) 0.290 3.185 0.805 3.502 0.720 0.756 3.439 0.96

Ours-Full 0.293 3.354 0.805 3.540 0.571 0.759 3.584 0.97

Real-Data 1.00 3.86 1.00 3.36 0.74 1.000 3.898 0.98

Table 2: Quantitative results: Comparison of four variants of the proposed model with other state-of-the-art models on Market-

1501 and the DeepFashion datasets. SN represents Spectral Normalization, and H represents Adversarial Hinge loss. For all

the measures, higher is better.

Models Market-1501 DeepFashion

SSIM IS m-SSIM m-IS DS SSIM IS DS

(Ma et al., 2017) 0.253 3.460 0.792 3.435 0.39 0.762 3.090 0.95

(Siarohin et al., 2018) 0.290 3.185 0.805 3.502 0.72 0.756 3.439 0.96

Ours SN 0.280 3.300 0.797 3.528 0.64 0.764 3.461 0.97

Ours SN + H 0.291 3.239 0.804 3.592 0.69 0.763 3.444 0.97

Ours W 0.293 3.354 0.805 3.540 0.57 0.759 3.584 0.97

Ours SN + H + W 0.291 3.192 0.805 3.551 0.72 0.756 3.473 0.97

Real-Data 1.00 3.86 1.00 3.36 0.74 1.000 3.898 0.98

higher values for IS and DS than the PG

model (Ma

et al., 2017). When compared to the DGAN model

(Siarohin et al., 2018) on the Market-1501 dataset, the

values for SSIM, m-SSIM are the same for both the

models. Whereas all the other scores are higher for

Siarohin et al. (Siarohin et al., 2018) on the Market-

1501 dataset. It can be seen that the DS score has

increased rapidly by using SN + H along with the

Full model of variant-3 (W) when compared to all the

other variants. On the DeepFashion dataset, the val-

ues for IS and DS are higher for SN + H + W when

compared to the benchmark models PG

(Ma et al.,

2017), DGAN (Siarohin et al., 2018). The fourth vari-

ant SN + H + W obtained a higher score for IS than

all the other models except for the Full model. Apart

from this, the proposed model has reported better re-

sults than the two benchmark models PG

(Ma et al.,

2017), DGAN (Siarohin et al., 2018).

From Table 3, the comparison of the two variants

with the state-of-the-art model proposed by (Horiuchi

et al., 2019) can be seen. The scores SSIM and L

are

computed based on the generated and ground truth

images. It shows that the SSIM values are slightly

lower than the SSIM scores of the variants SN and

SN + RH proposed by (Horiuchi et al., 2019). The

IS scores are comparatively higher than the proposed

Table 3: Quantitative results: Comparison of variants SN

and SN + H with the state-of-the-art model (Horiuchi et al.,

2019) on the Market-1501 dataset. SN: Spectral Normal-

ization, H: Adversarial Hinge loss, RH : Relativistic Hinge

Loss. For all the measures, higher is better.

Models Market-1501

SSIM IS L

Horiuchi et al.-SN 0.289 3.066 0.289

Ours SN 0.280 3.300 0.289

Horiuchi et al.-SN + RH 0.296 2.973 0.288

Ours SN + H 0.291 3.239 0.288

Real-Data 1.00 3.86 0.00

model (Horiuchi et al., 2019). It can be observed that

the reported IS is 3.300 for SN, and that of the bench-

mark model (Horiuchi et al., 2019) is 3.066. In the

case of SN with adversarial hinge loss, the value of

IS has been 3.239, which is higher than the SN + RH

variant of Horiuchi et al. (Horiuchi et al., 2019).

4.5 Qualitative Results

Not only the quantitative measures are enough to

show how good is the model, but the qualitative re-

sults are equally important to see how realistic they

are from a human point of view.

DeLTA 2020 - 1st International Conference on Deep Learning Theory and Applications

Condition Target Siarohin et al. Ours Ours Ours Ours

image image 2018 SN SN+H SN +H+W W

Figure 4: Qualitative results: Comparison of all the four variants SN, SN + H, SN + H + W, W and the benchmark model

(Siarohin et al., 2018) on the DeepFashion dataset.

Condition Target Siarohin et al. Ours Ours

image image 2018 SN SN+H

Figure 5: Qualitative results: Comparison of some of the

challenging results for SN, SN + H and the benchmark

model (Siarohin et al., 2018) on the DeepFashion dataset.

4.5.1 Comparison of Four Variants with

Benchmark Model (Siarohin et al., 2018):

DeepFashion

From Figure 4, the comparison of the qualitative re-

sults for all the four variants SN, SN + H, SN + H

+ W, W with the benchmark model (Siarohin et al.,

2018) on the DeepFashion dataset can be found. The

results from the four variants are qualitatively bet-

ter and sharper than the benchmark model (Siarohin

et al., 2018).

In the ﬁrst row, the result from Siarohin et al.

(Siarohin et al., 2018) lacks the texture details, and

the generated image is blurry, whereas the result from

the four variants looks sharper in texture and appear-

ance details. Even the color of the shirt is similar to

that of the target image for all the four variants. In the

third and fourth rows, the generated results from the

benchmark model (Siarohin et al., 2018) miss the tex-

ture details and output the empty spaces when com-

pared to our proposed four variants, which produce

crisp appearance details. This shows the inclusion of

SN, which has generated realistic images with sharper

texture and appearance details.

4.5.2 Comparison of SN and SN+H with

Benchmark Model (Siarohin et al., 2018):

DeepFashion

Figure 5 shows the qualitative comparison of the re-

sults for the benchmark model (Siarohin et al., 2018)

and the ﬁrst two variants SN, SN + H on the Deep-

Fashion dataset. The images in Figure 5 represent

the challenging images where (Siarohin et al., 2018)

failed to generate the images which look similar to the

target image. In the ﬁrst row, the benchmark model

(Siarohin et al., 2018) generates the image with a

black patch for the shirt instead of generating it only

for the hands, as seen in the images from the SN and

SN + H. In the second row, the texture of the cloth

is generated with gaps. In contrast, the variants SN

and SN + H generate the texture similar to the target

image. These results show how our variants are qual-

itatively better than the benchmark model (Siarohin

et al., 2018).

Generation of Human Images with Clothing using Advanced Conditional Generative Adversarial Networks

Condition Target Baseline C PercLoss Full

image image image image image image

Figure 6: Qualitative results: Ablation study results of Baseline, DSC, PercLoss, Full methods of the benchmark model

(Siarohin et al., 2018) on the DeepFashion dataset.

Condition Target Ours Ours Ours Ours

image image Baseline DSC PercLoss bFull

Figure 7: Qualitative results: Ablation study results of Baseline, DSC, PercLoss, Full methods of variant-3 of the proposed

model on the DeepFashion dataset.

4.5.3 Ablation Study Results: DeepFashion

Dataset

Figures 6 and 7 depict the qualitative results of the ab-

lation study for the benchmark model (Siarohin et al.,

2018) and the variant-3 of the proposed model, re-

spectively, on the DeepFashion dataset. In both Fig-

ures, there is a progressive improvement of the results

from Baseline to the Full model. When the Baseline

results are compared from both Figures, the baseline

model (Siarohin et al., 2018) does not capture all the

appearance details of the target. The clothing tex-

ture is blurred for the baseline model (Siarohin et al.,

2018). In contrast, the images generated by the Ours-

Baseline model in Figure 7 are better in texture and

appearance details than the Baseline model (Siarohin

et al., 2018). The results of DSC are better when

compared to Baseline as the images are sharper, and

texture details are improved in both Figures 6 and 7.

When the DSC results are compared in both Figures 6

and 7, the facial features of the Ours-DSC model ap-

pears to be sharper. In PercLoss, the results are more

reﬁned than the DSC model. The results from the

Ours-DSC model show the generation of shoes which

are missing in DSC (Siarohin et al., 2018). In most

of the cases, the results obtained by the Full model

are better than the PercLoss model in both Figures 6

and 7. The generated results for the Ours-Full model

are better than the Full model (Siarohin et al., 2018),

which can be seen from the texture generated for the

images in the last row.

4.5.4 Comparison of Four Variants with

Benchmark Model (Siarohin et al., 2018):

Market-1501

Figure 8 shows the comparison of the results between

the four variants SN, SN + H, SN + H + W, W,

and the Siarohin et al. (Siarohin et al., 2018) model.

DeLTA 2020 - 1st International Conference on Deep Learning Theory and Applications

Condition Target Siarohin et al. Ours Ours Ours Ours

image image 2018 SN SN+H SN +H+W W

Figure 8: Qualitative results: Comparison of all the four variants SN, SN + H, SN + H + W, W and the benchmark model

(Siarohin et al., 2018) on the Market-1501 dataset.

As already known that the images in the Market-1501

dataset have challenging backgrounds and are of low-

resolution, the generated images are still blurry. From

the images, the results of Siarohin et al. (Siarohin

et al., 2018) differ in the color of the shirt from the tar-

get image. In contrast, the variants with SN generate

the color appropriate to the target. The background

of the generated images for the SN variants seems to

be better than the base model (Siarohin et al., 2018).

However, the results generated are still not sharper in

facial features and the texture details in comparison to

the target image when the Baseline results are com-

pared from both the Figures.

4.5.5 Ablation Study Results: Market-1501

Dataset

Figures 9 and 10 depict the qualitative results of the

ablation study for the benchmark model (Siarohin

et al., 2018) and the variant-3 of the proposed model

respectively on the Market-1501 dataset. In both Fig-

ures, there is a progressive improvement of the results

from Baseline to the Full model. The baseline model

(Siarohin et al., 2018) does not capture all the texture

detail of the target. This can be observed from the ﬁrst

and second rows of Figure 9 as the color of the shirt

has not been generated correctly. The results of DSC

are better when compared to Baseline as the appear-

ance details are improved in both Figures 9 and 10.

When the DSC results are compared in both Figures

9 and 10, the appearance details, including the color

of the clothing of the Ours-DSC model, appear to be

Condition Target Baseline DSC PercLoss Full

image image image image image image

Figure 9: Qualitative results: Ablation study of Baseline,

DSC, PercLoss, Full methods of the benchmark model

(Siarohin et al., 2018) on the Market-1501 dataset.

better as seen in the images. In PercLoss, the results

are more reﬁned than the DSC model, but since the

improvements are minor, it is very challenging to dif-

ferentiate the images.

In most of the cases, the results obtained from the Full

model are better than the PercLoss model in both Fig-

ures 9 and 10, and the pose information is generated

well by all the methods. The generated results of the

Ours-Full model are better than the Full model (Siaro-

hin et al., 2018), which can be observed from the gen-

erated images in Figure 10. From the condition image

in the second row, the image is too blurry, which gives

vague information for training the model.

Generation of Human Images with Clothing using Advanced Conditional Generative Adversarial Networks

Condition Target Ours Ours Ours Ours

image image Baseline DSC PercLoss Full

Figure 10: Qualitative results: Ablation study of Baseline,

DSC, PercLoss, Full methods of variant-3 of the proposed

model on the Market-1501 dataset.

Condition Target Si Horiuchi Ours Ours Ours

image image et al. et al. SN SN + H W

Figure 11: Qualitative results: Results of SN, SN+H, W

variants of the proposed model in comparison with the

benchmark models (Siarohin et al., 2018), (Horiuchi et al.,

2019) on the Market-1501 dataset. Here Si et al. represents

(Siarohin et al., 2018).

4.5.6 Comparison of SN, SN+H and W with

Benchmark Models (Siarohin et al., 2018),

(Horiuchi et al., 2019): Market-1501

Figure 11 shows the qualitative results of Siarohin et

al. (Siarohin et al., 2018), Horiuchi et al. (Horiuchi

et al., 2019), SN, SN + H, and W models. The cloth-

ing color and the shoes of the person in the second

row seems to be not generated correctly by Siarohin

et al. (Siarohin et al., 2018) and Horiuchi et al. (Hori-

uchi et al., 2019) when compared to the target image.

5 CONCLUSION

We presented the conditional generative models

which exploit the power of deep neural networks for

transferring various poses from one person to the

same person along with appearance and clothing tex-

ture details. The model uses the novelties like Spec-

tral Normalization (SN) and adversarial hinge loss.

The use of SN would stabilize the training of GANs

and helps to generate images with high quality. The

use of adversarial hinge loss has shown to improve

the generation of images with sharper details. This

model has been divided into four variants as a part

of the ablation study to know the importance of each

component of the model for the human-image gener-

ation. We showed how the four variants differ from

each other both qualitatively and quantitatively on

the Market-1501 and DeepFashion datasets. With-

out any data augmentation, the proposed models con-

verge faster and also generalize well to never seen test

data. We showed how the proposed model outper-

forms recent state-of-the-art models for person image

generation with pose and clothing details by provid-

ing benchmark results on publicly available Market-

1501 and DeepFashion datasets. It can be concluded

that the previous state-of-the-art models generated the

results with loss of appearance and clothing details

based on our experiments. We have also reported

results that depict that the proposed models have a

qualitative improvement over other methods. We pre-

sented that generative modeling with SN for person

image generation conditioned on pose and appearance

in a supervised setting provides good generalization

capabilities.

5.1 Future Work

There are many models for GANs which are designed

to tackle the problem of human image generation

along with pose and clothing. Different models fo-

cus on different pose representations, but the combi-

nation of a good pose representation along with objec-

tive function would improve the person-image gener-

ation. In future work, we would like to focus on using

other pose representations like 3D keypoints along

with existing Spectral Normalization (SN) and hinge

loss, which we believe would make better improve-

ments in this area of research. Also, self-attention

(Zhang et al., 2018) modules could be integrated into

the GAN architectures along with different pose rep-

resentations for the person-image generation. These

attention modules would attend to all the important

features of the person to generate accurate appearance

details with high quality. We can use a multi-stage

model with SN and attention mechanisms to further

improve the human-image generation.

DeLTA 2020 - 1st International Conference on Deep Learning Theory and Applications

REFERENCES

Alp G

uler, R., Neverova, N., and Kokkinos, I. (2018).

Densepose: Dense human pose estimation in the wild.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 7297–7306.

Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasser-

stein generative adversarial networks. In International

conference on machine learning, pages 214–223.

Cao, Z., Simon, T., Wei, S.-E., and Sheikh, Y. (2017). Real-

time multi-person 2d pose estimation using part afﬁn-

ity ﬁelds. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

7291–7299.

Chen, X., Song, J., and Hilliges, O. (2019). Un-

paired pose guided human image generation. CoRR,

abs/1901.02284.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J.,

and Zisserman, A. (2007). The pascal visual object

classes challenge 2007 (voc2007) results.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. In

Advances in neural information processing systems,

pages 2672–2680.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and

Courville, A. C. (2017). Improved training of wasser-

stein gans. In Advances in neural information pro-

cessing systems, pages 5767–5777.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Horiuchi, Y., Iizuka, S., Simo-Serra, E., and Ishikawa, H.

(2019). Spectral normalization and relativistic ad-

versarial training for conditional pose generation with

self-attention. In 2019 16th International Conference

on Machine Vision Applications (MVA), pages 1–5.

IEEE.

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017).

Image-to-image translation with conditional adversar-

ial networks. In Proceedings of the IEEE conference

on computer vision and pattern recognition, pages

1125–1134.

Jetchev, N. and Bergmann, U. (2017). The conditional anal-

ogy gan: Swapping fashion articles on people images.

Jolicoeur-Martineau, A. (2018). The relativistic discrimina-

tor: a key element missing from standard gan. arXiv

preprint arXiv:1807.00734.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in neural information process-

ing systems, pages 1097–1105.

Lassner, C., Pons-Moll, G., and Gehler, P. V. (2017). A gen-

erative model of people in clothing. In Proceedings of

the IEEE International Conference on Computer Vi-

sion, pages 853–862.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016a). Ssd: Single shot

multibox detector. In European conference on com-

puter vision, pages 21–37. Springer.

Liu, Z., Luo, P., Qiu, S., Wang, X., and Tang, X. (2016b).

Deepfashion: Powering robust clothes recognition and

retrieval with rich annotations. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 1096–1104.

Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., and

Van Gool, L. (2017). Pose guided person image gener-

ation. In Advances in Neural Information Processing

Systems, pages 406–416.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y.

(2018). Spectral normalization for generative adver-

sarial networks. arXiv preprint arXiv:1802.05957.

Neverova, N., Alp Guler, R., and Kokkinos, I. (2018).

Dense pose transfer. In The European Conference on

Computer Vision (ECCV).

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V.,

Radford, A., and Chen, X. (2016). Improved tech-

niques for training gans. In Advances in neural infor-

mation processing systems, pages 2234–2242.

Siarohin, A., Sangineto, E., Lathuili

ere, S., and Sebe, N.

(2018). Deformable gans for pose-based human image

generation. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

3408–3416.

Stewart, M. (2019 (accessed May 8, 2019)). Advanced

Topics in Generative Adversarial Networks (GANs).

https://towardsdatascience.com/comprehensive-

introduction-to-turing-learning-and-gans-part-2-

fd8e4a70775.

Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., and

Catanzaro, B. (2018). High-resolution image synthe-

sis and semantic manipulation with conditional gans.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 8798–8807.

Wang, Z., Bovik, A. C., Sheikh, H. R., Simoncelli, E. P.,

et al. (2004). Image quality assessment: from error

visibility to structural similarity. IEEE transactions

on image processing, 13(4):600–612.

Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A.

(2018). Self-attention generative adversarial net-

works. arXiv preprint arXiv:1805.08318.

Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., and Tian,

Q. (2015). Scalable person re-identiﬁcation: A bench-

mark. In Proceedings of the IEEE International Con-

ference on Computer Vision, pages 1116–1124.

Generation of Human Images with Clothing using Advanced Conditional Generative Adversarial Networks