Ambient Lighting Generation for Flash Images with Guided Conditional

Adversarial Networks

Jos

e Ch

avez

, Rensso Mora

and Edward Cayllahua-Cahuina

Department of Computer Science, Universidad Cat

olica San Pablo, Arequipa, Peru

LIGM, Universit

e Paris-Est, Champs-sur-Marne, France

Keywords:

Flash Images, Ambient Images, Illumination, Generative Adversarial Networks, Attention Map.

Abstract:

To cope with the challenges that low light conditions produce in images, photographers tend to use the light

provided by the camera ﬂash to get better illumination. Nevertheless, harsh shadows and non-uniform illumi-

nation can arise from using a camera ﬂash, especially in low light conditions. Previous studies have focused

on normalizing the lighting on ﬂash images; however, to the best of our knowledge, no prior studies have

examined the sideways shadows removal, reconstruction of overexposed areas, and the generation of synthetic

ambient shadows or natural tone of scene objects. To provide more natural illumination on ﬂash images and

ensure high-frequency details, we propose a generative adversarial network in a guided conditional mode. We

show that this approach not only generates natural illumination but also attenuates harsh shadows, simulta-

neously generating synthetic ambient shadows. Our approach achieves promising results on a custom FAID

dataset, outperforming our baseline studies. We also analyze the components of our proposal and how they

affect the overall performance and discuss the opportunities for future work.

1 INTRODUCTION

Scenes with low light conditions are challenging in

photography, cameras usually produce noisy and/or

blurry images. In these situations, people usually use

an external device such as a camera ﬂash, thus, cre-

ating ﬂash images. However, when the light from the

ﬂash is pointing directly at the object, the light can be

too harsh for the scene and create a non-uniform il-

lumination. Comparing a ﬂash image with its respec-

tive image with ambient illumination, it is clear that

the illumination is more natural and uniform because

the available light can be more evenly distributed (see

Figure 1).

Researchers have studied the enhancement of

ﬂash images (Petschnigg et al., 2004; Eisemann and

Durand, 2004; Agrawal et al., 2005; Capece et al.,

2019), producing enhanced images by the combina-

tion of such ambient and ﬂash images, or normalizing

the illumination on ﬂash image in a controlled envi-

ronment (backdrop and studio lighting), but without

replicating the natural skin tone of people. However,

in a real scenario with low light conditions, there is no

information about how the ambient image is. On the

other hand, on scenarios without a backdrop, objects

away from the camera will have very low illumina-

tion, thus, creating dark areas in the image, consider-

ing that there is only the illumination of the camera

ﬂash. Consequently, in a real scenario with low light

conditions, creating ambient images from ﬂash im-

ages poses a very challenging problem.

(a) Flash image (b) Ambient image

Figure 1: A comparison of a ﬂash image and an ambient

image. (a) Image with camera ﬂash illumination. The image

suffers from harsh shadows, dark areas and bright areas. (b)

Image with available ambient illumination. In this image,

the illumination is more uniform, natural, and the image has

not sideways shadows. Images extracted from FAID (Aksoy

et al., 2018).

Chávez, J., Mora, R. and Cayllahua-Cahuina, E.

Ambient Lighting Generation for Flash Images with Guided Conditional Adversarial Networks.

DOI: 10.5220/0008983603810388

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 4: VISAPP, pages

381-388

ISBN: 978-989-758-402-2; ISSN: 2184-4321

381

Prior works handle the enhancement of low light

images, where a scene is underexposed; however,

on ﬂash images, objects close to the camera tend to

be bright and these techniques overexpose these re-

gions. Our method attenuates the illumination that is

close to the camera, and illuminates the underexposed

regions at the same time. Since ﬂash and ambient im-

ages represent the same scene, researchers (Capece

et al., 2019) study the lighting normalization on a

ﬂash image by learning the relationship between both

images to estimate a relationship between these pair

of images, which is added to the respective ﬂash im-

age in a next step, thus, normalizing the illumination

on ﬂash images but maintaining high-frequency infor-

mation. This approach is not effective to restore over-

exposed areas due to this region still needs to compute

the ﬁnal result.

In this article, we propose a conditional adversar-

ial network in a guided mode, which follows two ob-

jective functions. First, the reconstruction loss gener-

ates uniform illumination and synthetic ambient shad-

ows. Second, the adversarial loss, which represents

the objective function of GANs (Goodfellow et al.,

2014), forces to model high-frequency details on the

output image, and perform a more natural illumina-

tion. Both loss functions are guided through the at-

tention mechanism, which is performed by attention

maps based on the input image and ground truth. The

attention mechanism allows to the model to be more

robust to overexposed areas and sideways shadows

presented on ﬂash images. It also improves the ro-

bustness of the model on inconsistent scene match be-

tween pairs of ﬂash and ambient images since they are

both usually not perfectly aligned at the moment of

capture. We compare against state-of-the-art enhance-

ment techniques for low light images (Fu et al., 2016;

Guo et al., 2017), and ﬂash images (Capece et al.,

2019). Ablation studies are also performed on the ar-

chitecture.

Then, the major contributions of this article are:

• An attention mechanism to guide a conditional ad-

versarial network on the task of translating from

ﬂash images to ambient images. Giving robust-

ness against overexposed areas and shadows pre-

sented on ﬂash and ambient images, and the mis-

aligned scene between both images. This mech-

anism guides the adversarial loss to avoid blurry

results on regions by discriminating these cases.

• Our proposed attention mechanism also guides

the reconstruction loss to be robust against high-

frequency details thought the texture information

that the attention map gives.

2 RELATED WORK

2.1 Low Light Image Enhancement

Prior works (Petschnigg et al., 2004; Eisemann and

Durand, 2004; Agrawal et al., 2005) combine the ad-

vantages of both ambient and ﬂash images. These im-

age processing techniques use the information of the

image with the available illumination (ambient im-

age) and the image with light from the camera ﬂash

(ﬂash image) and create an enhanced image based on

both images. In contrast with these techniques, our

model enhances the ﬂash image but without any kind

of information of the ambient image.

In SRIE (Fu et al., 2016), the reﬂectance and il-

lumination are estimated by a weighted variational

model, then, the images are enhanced with the re-

ﬂectance and illumination components. LIME (Guo

et al., 2017), on the other hand, enhance the images

by the estimation of their illumination maps. More

speciﬁc, the illumination map of each pixel is ﬁrst es-

timated individually by ﬁnding the maximum value

in the R, G and B channels, then the illumination

map is reﬁned by imposing a structure prior. This

reﬁned illumination map has smoothness texture de-

tails. Both methods SRIE and LIME do not contem-

plate sideways shadows removal, reconstruction of

overexposed areas or generation of synthetic ambient

shadows.

2.2 Image-to-Image Translation

Prior works use symmetric encoder-decoder net-

works (Ronneberger et al., 2015; Isola et al., 2017;

Chen et al., 2018) for image-to-image translation such

as: image segmentation, synthesizing photos, enhanc-

ing low light images, etc. These networks are com-

posed of various convolutional layers, where the input

is encoded to a latent space representation and then

decoded to estimate the desired output. Inspired on

the U-Net architecture (Ronneberger et al., 2015), our

model employs skip connections to share information

between encoder and decoder, to recover spatial in-

formation lost by downsampling operations.

In (Capece et al., 2019), a deep learning model

turns a smartphone ﬂash selﬁe into a studio por-

trait. The model generates a uniform illumination, but

not reproduce the same skin tone of the person under

studio lighting. The encoder part of the network rep-

resents the ﬁrst 13 convolutional blocks of the VGG-

16 (Simonyan and Zisserman, 2015), and the weights

of the encoder are initialized with a pre-trained model

for face-recognition (Parkhi et al., 2015). The inputs

and target of this network are given ﬁltered, to es-

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

382

timate an image with low-frequency details, which

represent the relationship on illumination between the

ambient and ﬂash image. This pre-processing step is

the drawback of this model because it can not learn a

high-quality relationship of illumination between the

ﬂash and the ambient image. This step also has a com-

putation time due to the model uses a bilateral ﬁlter.

We exploit the transfer learning approach of this

model, but we proposed an end-to-end architecture

where the encoder path is initialized with the VGG-

16 pre-trained on the ImageNet dataset (Deng et al.,

2009), thus, making our model for general scenes, not

only for faces. And the decoder part is symmetric re-

spect to the encoder. The end-to-end architecture also

avoids an additional pre-processing step.

2.3 Conditional GANs

Conditional GANs (Mirza and Osindero, 2014) have

been proposed as a general purpose for image-to-

image translation (Isola et al., 2017). A cGAN is com-

posed of two architectures, the generator, and the dis-

criminator. Both architectures are fully convolutional

networks (Long et al., 2015). On the generator, which

represents an encoder-decoder network, each step of

the encoder and decoder is mainly composed by con-

volutional layers. The generator G and discriminator

D are conditioned on some type of information such

as images, labels, texts, etc. In our case, this infor-

mation represents the ﬂash images I

, and our cGAN

learns to map from ﬂash images I

to ambient images

. Thus, the generator synthesizes ambient images

which can not be distinguished from the real ambient

images I

, while the discriminator is trained in adver-

sarial form respect to the generator to distinguish be-

tween I

and

. As it shows in pix2pix model (Isola

et al., 2017), this min-max game ensure the learn-

ing of high-frequency details unlike using only a re-

construction loss like a MAE (Mean Absolute Error),

which output smoothed results.

3 PROPOSED METHOD

Our model is composed of two architectures, gener-

ator G, and discriminator D; and translate from ﬂash

images I

to ambient images I

. Then, the training

procedure follows two objectives: the reconstruction

loss R, which aims to minimize the distance between

the input image (I

) and the target image (I

); and the

adversarial loss A; which represent the objective of

the cGAN (Isola et al., 2017). Figure 2 illustrates an

overall of our architecture model.

Figure 2: Network architecture. The generator has as its in-

put the ﬂash image I

and as its output the synthetic ambient

image

. The discriminator network learns through the ad-

versarial loss A to classify between the real ambient image

, this is the ambient image that belongs to the training set,

and the synthetic ambient image

. We also set the recon-

struction loss R between I

and

. All attention maps are

compute thought I

and I

Both the reconstruction loss R and the adversarial

loss A are guided by our attention mechanism to en-

Ambient Lighting Generation for Flash Images with Guided Conditional Adversarial Networks

383

sure a better learning procedure. The attention mech-

anism is performed on the entries of R and A, that is,

the ambient image I

and synthetic ambient image

ﬁrst pass through the attention map before the com-

putation of R and A.

3.1 Attention Mechanism

The attention mechanism that we propose aims to

guide the reconstruction and adversarial loss. The

mechanism is simple but efﬁcient, we guide both R

and A with an attention map base on the ﬂash image

and the ambient image I

. We deﬁne the attention

map M as:

M (i, j) = 1 −

∑

k=1

| I

(i, j, k) − I

(i, j, k) | .

(1)

In Equation 1, C represents the number of chan-

nels and M (i, j) the value of the attention map at

the position (i, j). I(i, j, k) represent the pixel value

at (i, j) and channel k. Then, I

and

pass though the

attention map before compute the reconstruction loss

R and the adversarial loss A,

:=I

⊗ M ,

⊗ M .

(2)

The operation ⊗ represents the element-wise mul-

tiplication. Equation 2 guides A and R to a bet-

ter learning procedure through the discrimination of

overexposed areas, shadows, and scene misalignment,

between I

and I

. Then R, which represent the L1

distance, and A are deﬁned as:

R(G) = E

∼p

data

∼p

data





− G(I

)





A(D, G) = E

∼p

data

[logD(I

)]

+ E

∼p

data

[log(1 − D(G(I

)))].

(3)

By this operation, the reconstruction loss R is con-

ducted to learn the normalization of the lighting, dis-

criminating the high-frequency details because the at-

tention map M gives this information by the element-

wise multiplication. M also guides R to be robust for

the misaligned scene between ﬂash and ambient im-

ages. On the other hand, the adversarial loss A is fo-

cused on generating realism and high-frequency de-

tails on the regions indicated by M . A not allows

blurry outputs where the attention map M indicates,

because all blurry regions are classiﬁed as fake and

the adversarial loss tries to ﬁx it by generating high-

frequency details on these regions.

Finally, our full objective L is a mix of the re-

construction and the adversarial loss, maintaining the

relevance of the reconstruction loss and scaling the

adversarial loss by the hyperparameter λ. Equation 4

allows determining to what extent the adversarial loss

A should inﬂuence to L , thus, controlling the genera-

tion of artifacts in the output images.

L(G, D) = R(G) + λ · A(G, D).

(4)

We perform ablation studies on the architecture,

and verify the improvements of using our proposed

attention mechanism. Our ablation studies also con-

sider the use and not of a pre-trained model in the

generator.

4 EXPERIMENTS

In this section, we describe the Flash and Ambient Il-

lumination Dataset (FAID) and the custom set of these

images that we use. We present the training protocol

that we followed and show the quantitative and qual-

itative results that validate our proposal. Finally, we

present the controlled experiments that we perform to

determine how the components of our architecture af-

fect the overall performance.

4.1 Dataset

Figure 3: Ambient images from FAID (Aksoy et al., 2018)

with low illumination, reﬂections, and shadows from exter-

nal objects.

Introduced by (Aksoy et al., 2018), the FAID(Flash

and Ambient Illumination Dataset) is a collection of

pairs of ﬂash and ambient images, which present 6

categories: People, Shelves, Plants, Toys, Rooms, and

Objects. As a result, we have 2775 pairs of ﬂash

and ambient images. We inspected each image in the

dataset and found that there exist ambient images

that have problems such as low illumination, shadows

from external objects or even reﬂections. Therefore,

we used a reduced set of the entire FAID dataset for

our experiments. Finally, our custom dataset has 969

pairs of images for training and 116 for testing and

all images were resized to 320 ×240 or 240×320 de-

pending on their orientation.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

384

Input SRIE LIME DeepFlash Ours Target

Figure 4: Qualitative comparison. Enhancement of low-illuminated areas (red), and estimation of natural skin and air tone of

people (green). We compare with SRIE (Fu et al., 2016), LIME (Guo et al., 2017), and DeepFlash (Capece et al., 2019).

4.2 Training

We freeze all convolutional layers of in the encoder

part of the generator, and train our model using the

Adam optimizer (Kingma and Ba, 2015) with β

0.5, based on (Isola et al., 2017). Using learning rates

2·10

−5

and 2·10

−6

for the generator and the discrim-

inator respectively, equal or higher learning rate of the

discriminator respect to the generator results on a di-

vergence. To regularize the adversarial loss A, we set

λ = 1, fewer values for λ results on blurry outputs

and higher values of λ results on many artifacts. The

training procedure is performed using random crops

of 224 × 224 and horizontal random ﬂipping for data

augmentation. The implementation of our architec-

ture is in Pytorch, and the training process takes ap-

proximately one day using an NVIDIA graphics card

GeForce GTX 1070.

4.3 Quantitative and Qualitative

Validation

We use the PSNR (Peak Signal-to-Noise Ratio) and

the SSIM (Structural Similarity) to measure the per-

formance of our quantitative results. Table 1 reports

the mean PSNR and the mean SSIM on the test set, for

1000 epochs. All hyperparameters are setting on the

same way for (Capece et al., 2019), and the encoder-

decoder network was pre-trained on the ImageNet

dataset (Deng et al., 2009) instead on a model used

for face recognition (Parkhi et al., 2015). Our quanti-

tative results do not signiﬁcantly outperform the state-

of-the-art image enhancement methods, but at least

shows improvements on the ﬂash image enhancement

task.

Table 1: Reporting the mean PSNR and the mean SSIM

with SRIE (Fu et al., 2016), LIME (Guo et al., 2017), and

DeepFlash (Capece et al., 2019).

Method PSNR SSIM

LIME 12.38 0.611

SRIE 14.09 0.659

DeepFlash 15.39 0.671

Ours 15.67 0.684

Estimation of the skin tone of people is shown

in Figure 4, where the illumination map created

by LIME (Guo et al., 2017) conducts to brighten-

ing and overexposing the ﬂash images. LIME (Guo

et al., 2017), can not distinguish the natural color of

dark objects and tend to illuminate them. Results in

SRIE (Fu et al., 2016) do not present considerable

changes concerning the ﬂash images on these kind of

scenes. DeepFlash (Capece et al., 2019) present non-

Ambient Lighting Generation for Flash Images with Guided Conditional Adversarial Networks

385

Input SRIE LIME DeepFlash Ours Target

Figure 5: Qualitative comparison. Generation of ambient shadows (green), attenuation of overexposed areas (red), and side-

ways shadow removal (orange). We compare with SRIE (Fu et al., 2016), LIME (Guo et al., 2017), and DeepFlash (Capece

et al., 2019).

uniform illumination on ﬂash images of people, ap-

parently this is due to trying to simulate shadows. In

the case of ﬂash images that have low illuminated ar-

eas and also high illuminated areas like the Rubik’s

Cube on Figure 4, (Capece et al., 2019) present mean-

ingless illumination on their results, and our method

shows considerable better results, that is, our result

looks much more similar to the ground truth.

Figure 4 reveals some aspects about the gener-

ation of ambient lighting on people. Note the syn-

thetic shadows in mouth and under the chin. Almost

all ambient images from train data was taken with

light source that came from above through a typical

light source that exists in homes. Therefore, the model

learns to generate synthetic ambient lighting simulat-

ing a light source that comes from above.

Figure 5 shows that our model synthesizes ambi-

ent shadows on ﬂash images such as shelves, but suf-

fer for restoring overexposed areas produced by the

camera ﬂash. LIME (Guo et al., 2017), and SRIE (Fu

et al., 2016) do not attenuate overexposed areas or

synthesize ambient shadows on these type of scenes,

these methods do not handle this kind of issues of

ﬂash images. DeepFlash architecture (Capece et al.,

2019) performs weak ambient shadows, attenuate

overexposed areas without restoring them, and out-

puts many artifacts on their results. In the case of

sideways shadow removal, all models fail (including

ours).

4.4 Ablation Study

We perform different experiments to validate the ﬁ-

nal conﬁguration of our architecture. Table 2 reports

the quantitative comparison between our controlled

experiments. Furthermore, we also show in Figure 6

qualitative compositions between conditions in Ta-

ble 2.

Table 2: Controlled experiments. This table reports the

mean PSNR and the mean SSIM for distinct architecture

conﬁgurations.

Condition PSNR SSIM

1. Default (R

+ A

) 15.6 7 0.684

2. R + A 15.55 0.676

3. R 15.64 0.681

4. U-Net 14.81 0.643

Our quantitative assessments show that using a

pre-trained model improves signiﬁcantly the model

trained from scratch (condition 4). The other meth-

ods seem to have similar results. This is because

these models, which use the MAE for the objec-

tive function (condition 3), generate blurry results

to minimize the error between estimated images and

the targets. Condition 2, which is similar to the de-

fault model without the attention mechanism, has less

quantitative values than condition 3 because the ad-

versarial loss gives some sharpness on their output

images.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

386

Input Target Default R + A R U-Net

Figure 6: Qualitative comparison for each condition in our controlled experiments on the loss function, the attention map, and

the network architecture.

We explore our qualitative results (Figure 6) for

different loss functions, the attention map, and net-

work architectures.

Loss Function. Table 2 reports the inﬂuence by using

the adversarial loss. Condition 3 represents the same

structure of the generator without considering the ad-

versarial loss, i.e., just an encoder-decoder network,

without a discriminator. This architecture presents

blurred results comparing with our default model. In

this case the reconstruction loss R is not enough to

generate high-frequency details on their results, note

the blurry image of the headphones (Figure 6). The

adversarial loss A ensure a better quality due to the

deep discriminator network, which classiﬁes blurry

results as fake. Condition 2 presents also blurry re-

sults; however, the output images present more uni-

form illumination due to the adversarial loss.

Attention Map. Condition 1, which represent our de-

fault model, present uniform illumination, and high-

frequency details (note the sharpness on the head-

phone respect to the other conditions). Our attention

mechanism guides the reconstruction and adversarial

loss to obtain uniform illumination and also sharpness

results with less artifacts. However, due to the robust-

ness for overexposed areas and shadows, our model

can not re-lighting dark areas with high-frequency de-

tails. We believe that a better formulation of the atten-

tion mechanism could address this problem.

Network Architecture. As we report in Table 2,

we perform the well known U-Net (Ronneberger

et al., 2015) architecture in condition 4. We adopt

the model proposed by (Chen et al., 2018) for en-

hancing extreme low light images, and train it from

scratch. U-Net present blurry output images and also

non-uniform illumination. Our default model, which

uses transfer learning, performs better quantitative

and qualitative results. We believe this is due to the

few samples in the training set.

5 CONCLUSIONS

Ambient lighting generation is a challenging prob-

lem, even more on ﬂash images under low light condi-

tions. Shadows on the ﬂash image have to be removed,

overexposed areas should be reconstructed, and am-

bient shadows must be synthesized as a part of the

simulation of an ambient light source. In this paper,

we propose a model with a guided reconstruction loss

for normalizing the illumination and a guided adver-

sarial loss to model high-frequency illumination de-

tails on ﬂash images. Our results show that our guided

mechanism estimated high-frequency details without

introducing visual artifacts in our synthetic ambient

images. The guided adversarial loss also produces

more realistic ambient illumination on ﬂash images

than the state-of-the-art methods. Our current results

are promising, nonetheless, there are cases where our

model fails such as: restoring overexposed areas, nor-

malizing the lighting for ﬂash images on extreme low

light conditions, and sideways shadow removal on

ﬂash images (see Figure 4). We believe that a more

dedicated approach on the adversarial loss would be

useful to address these issues.

Other methods based on intrinsic image decom-

position (Shen et al., 2013) would be also useful by

recovering the albedo (reﬂectance) and shading of

the ﬂash image, then, modifying directly the shad-

ing component to obtain the ambient image. As we

show on this article, some cases need a more dedi-

cated treatment. We aim to further study these cases

and evaluate new techniques to improve the ambient

lighting generation for ﬂash images in such situations.

ACKNOWLEDGEMENTS

This work was supported by grant 234-2015-

FONDECYT (Master Program) from Cienciactiva of

the National Council for Science, Technology and

Ambient Lighting Generation for Flash Images with Guided Conditional Adversarial Networks

387

Technological Innovation (CONCYTEC-PERU). I

thank all the people who directly or indirectly helped

me with this work.

REFERENCES

Agrawal, A., Raskar, R., Nayar, S. K., and Li, Y. (2005).

Removing photography artifacts using gradient pro-

jection and ﬂash-exposure sampling. ACM Trans.

Graph., 24(3):828–835.

Aksoy, Y., Kim, C., Kellnhofer, P., Paris, S., Elgharib, M.,

Pollefeys, M., and Matusik, W. (2018). A dataset of

ﬂash and ambient illumination pairs from the crowd.

In Proceedings of the European Conference on Com-

puter Vision (ECCV), pages 634–649.

Capece, N., Banterle, F., Cignoni, P., Ganovelli, F.,

Scopigno, R., and Erra, U. (2019). Deepﬂash: Turning

a ﬂash selﬁe into a studio portrait. Signal Processing:

Image Communication, 77:28 – 39.

Chen, C., Chen, Q., Xu, J., and Koltun, V. (2018). Learn-

ing to see in the dark. In The IEEE Conference on

Computer Vision and Pattern Recognition (CVPR).

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In 2009 IEEE conference on com-

puter vision and pattern recognition, pages 248–255.

Ieee.

Eisemann, E. and Durand, F. (2004). Flash photography

enhancement via intrinsic relighting. ACM Trans.

Graph., 23(3):673–678.

Fu, X., Zeng, D., Huang, Y., Zhang, X.-P., and Ding, X.

(2016). A weighted variational model for simultane-

ous reﬂectance and illumination estimation. In The

IEEE Conference on Computer Vision and Pattern

Recognition (CVPR).

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. In

Advances in neural information processing systems,

pages 2672–2680.

Guo, X., Li, Y., and Ling, H. (2017). Lime: Low-light

image enhancement via illumination map estimation.

IEEE Transactions on Image Processing, 26(2):982–

993.

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017).

Image-to-image translation with conditional adversar-

ial networks. In The IEEE Conference on Computer

Vision and Pattern Recognition (CVPR).

Kingma, D. P. and Ba, J. (2015). Adam: A method for

stochastic optimization. In 3rd International Confer-

ence on Learning Representations, ICLR 2015, San

Diego, CA, USA, May 7-9, 2015, Conference Track

Proceedings.

Long, J., Shelhamer, E., and Darrell, T. (2015). Fully con-

volutional networks for semantic segmentation. In

2015 IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 3431–3440.

Mirza, M. and Osindero, S. (2014). Conditional generative

adversarial nets. arXiv preprint arXiv:1411.1784.

Parkhi, O. M., Vedaldi, A., and Zisserman, A. (2015). Deep

face recognition. In Xianghua Xie, M. W. J. and

Tam, G. K. L., editors, Proceedings of the British Ma-

chine Vision Conference (BMVC), pages 41.1–41.12.

BMVA Press.

Petschnigg, G., Szeliski, R., Agrawala, M., Cohen, M.,

Hoppe, H., and Toyama, K. (2004). Digital photogra-

phy with ﬂash and no-ﬂash image pairs. ACM Trans.

Graph., 23(3):664–672.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net:

Convolutional networks for biomedical image seg-

mentation. In International Conference on Medical

image computing and computer-assisted intervention,

pages 234–241. Springer.

Shen, L., Yeo, C., and Hua, B. (2013). Intrinsic image

decomposition using a sparse representation of re-

ﬂectance. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 35(12):2904–2915.

Simonyan, K. and Zisserman, A. (2015). Very deep con-

volutional networks for large-scale image recognition.

In International Conference on Learning Representa-

tions.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

388