Toward a Thermal Image-Like Representation

Patricia L. Su

arez

1 a

and Angel D. Sappa

1,2 b

Escuela Superior Polit

ecnica del Litoral, ESPOL, Facultad de Ingenier

ıa en Electricidad y Computaci

on, CIDIS,

Campus Gustavo Galindo Km. 30.5 V

ıa Perimetral, P.O. Box 09-01-5863, Guayaquil, Ecuador

Computer Vision Center, Ediﬁci O, Campus UAB, 08193 Bellaterra, Barcelona, Spain

Keywords:

Contrastive Loss, Relativistic Standard GAN Loss, Spectral Normalization.

Abstract:

This paper proposes a novel model to obtain thermal image-like representations to be used as an input in

any thermal image compressive sensing approach (e.g., thermal image: ﬁltering, enhancing, super-resolution).

Thermal images offer interesting information about the objects in the scene, in addition to their temperature.

Unfortunately, in most of the cases thermal cameras acquire low resolution/quality images. Hence, in order

to improve these images, there are several state-of-the-art approaches that exploit complementary information

from a low-cost channel (visible image) to increase the image quality of an expensive channel (infrared image).

In these SOTA approaches visible images are fused at different levels without paying attention the images

acquire information at different bands of the spectral. In this paper a novel approach is proposed to generate

thermal image-like representations from a low cost visible images, by means of a contrastive cycled GAN

network. Obtained representations (synthetic thermal image) can be later on used to improve the low quality

thermal image of the same scene. Experimental results on different datasets are presented.

1 INTRODUCTION

In recent year, thermal imaging has increasingly being

used in a range of different ﬁelds in industry, which

has lately led to the manufacturing of low-cost ther-

mal vision sensors. Low-cost thermal sensors are fast

becoming available, and they are making their way

into applications other than heavy industrial usage,

such as surveillance, criminal investigation, military

use, medical research, and building maintenance. Ex-

ploiting these alternate perspectives has the potential

to play a signiﬁcant role in computer vision by im-

proving the accuracy of our existing conventional dig-

ital vision.

Although thermal sensors have come a long way,

there is still a bottleneck related with the poor resolu-

tion, the cost of thermal cameras grows exponentially

with greater resolution. In general, their resolutions

are substantially lower than those of regular digital

cameras working in the visible spectrum. Hence, it is

critical to discover ways to leverage the information

from sensors working at different spectral bands and

combine them together to maximize their advantages.

Recently, some approaches have been proposed

trying to use visible images information to enhance

https://orcid.org/0000-0002-3684-0656

https://orcid.org/0000-0003-2468-0031

the other domain images and hence produce images

with a higher quality and close to human perception

at a lower cost. These approaches are referred to in the

literature as Guidance Image Processing (e.g., (Kopf

et al., 2007), (Hui et al., 2016), (Barron and Poole,

2016)). Most of guidance based method fuse the pro-

vided information at different levels, but in almost

all the cases the given images (e.g., thermal and vis-

ible) are used to feed the model without any concern

on their nature/difference—they capture information

from different spectral bands.

There are some approaches where the guidance is

not performed at an image pixel level but at a feature

level, for instance edges from one image are used to

enhance the other image. The use of edge-based guid-

ing facilitates the reconstruction of higher-frequency

features (e.g., (Xie et al., 2015), (Zhou et al., 2018)).

As mentioned before, the different nature of provided

images reduce the possibility of taking the best from

each representation. In the current work we propose

to generate thermal image-like representations from

visible spectrum images in order to facilitate the fur-

ther guided process, since both images will be repre-

sented in a closer domain (thermal image domain).

Most of the approaches mentioned above are deep

learning based solutions, which by means of efﬁcient

Convolutional Neural Networks (CNNs) signiﬁcantly

Suárez, P. and Sappa, A.

Toward a Thermal Image-Like Representation.

DOI: 10.5220/0011795200003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 4: VISAPP, pages

133-140

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

133

outperform traditional methods. A fundamental as-

pect of CNNs is the large volume of data are required

for their training. Furthermore, in cases like the one

tackled in the current work, the existence of paired

images (thermal and visible spectrum) is required.

Having in mind these drawbacks (large amount of

data, and the existence of paired set) in the current

work a model capable of generating synthetic ther-

mal images from its counterpart in visible space is

proposed. The model is trained with unpaired set of

thermal and visible images, which represent a great

advantage with respect most of the state-of-the-art

approaches. The contribution of this paper can be

pointed out in:

• The usage of contrastive loss (Liu et al., 2021),

to enhance the feature extraction of the genera-

tor. With the introduction of this loss, the learning

of the model is favored from similar regions of

the images contrasted with latent spaces with low

afﬁnity. In the proposed model, a combination of

this loss with identity and adversarial loss is pro-

posed.

• The modiﬁcation of the architecture to use spec-

tral normalization instead of batch normalization

to improve the stylization of the images and avoid-

ing fading of gradients (Miyato et al., 2018).

• The implementation of relativistic GAN to facil-

itate that the generated samples are closer to the

decision limit of the model. This allows the model

to generalize more quickly and improves the qual-

ity of the images.

The manuscript is organized as follows. Section

2 presents works related with the generation of syn-

thetic images to solve related problems. Section 3

presents the proposed cycled GAN modiﬁed archi-

tecture. Experimental results and comparisons with

different implementations are given in Section 4. Fi-

nally, conclusions are presented in Section 5.

2 RELATED WORK

Several approaches have been proposed to generate

synthetic images to be used in the training of models

that need thermal spectrum data sets, or to reinforce

the training of other techniques that solve issues re-

lated to control, detection, classiﬁcation, among oth-

ers. In (Guo et al., 2019), an approach related with

pedestrian detection in thermal imaging scenarios is

proposed. It tackles the limitations of current data

sets by generating synthetic thermal images from their

widely available visible counterpart applying domain

matching. To generate the synthetic data set, a com-

ponent has been created that performs the transfor-

mation from the visible to the far infrared domain to-

gether with the bounding boxes of the detected pedes-

trians. It is implemented through a cycled GAN net-

work that serves as a data augmenter when training

the pedestrian detection model in the thermal domain.

Although interesting results are obtained the approach

is not focused on the quality of the generated synthetic

thermal images but on the pedestrian detection appli-

cation.

Another use of the synthetic thermal images gen-

erated by CNN models is the one presented in Zhang

et al, (Zhang et al., 2018), in that paper it is proposed

to use the synthetic images to train a tracking model.

The authors apply the transformation of paired and

unpaired images. With these images, the results of

the tracking model with thermal images are improved.

Given the evolution of autonomous driving based on

LIDAR sensors, there are some approaches that have

been proposed. Therefore, in Lu et al. (Lu and Lu,

2021) the authors have designed a scheme to estimate

the depth of the scene based on synthetic thermal im-

ages generated from RGB. The synthetic images are

obtained by means of a cycled GAN network that per-

forms the translation from visible to thermal domain

together with a disparity map to maintain the consis-

tency relation of the generated images.

In Liu et al. (Liu et al., 2021), the authors pro-

pose a method to improve scene context for night vi-

sion applications. This model uses synthetic images

generated from visible spectrum images. A GAN net-

work is used for mapping context information, gen-

erating synthetic images with higher quality. It al-

lows improving the capture of ﬁne details in synthetic

images, used to enhance the context of scenes. In

another approach, proposed by Li et al. (Li et al.,

2020), a semantic image segmentation technique is

presented. The authors propose to overcome light-

ing and environmental limitations by using images

from both, real and synthetic thermal infrared cam-

eras, to guide the contour extraction. A synthetic im-

age dataset has been generated by a modiﬁed pix2pix

image transformation proposed by (Isola et al., 2017).

These synthetic images allow to improve the results of

the training for the cases in which the visible images

present limitations.

Another technique that uses thermal images is the

one presented in (Kniaz et al., 2018), where a cross-

modal generative network is proposed to generate

synthetic thermal images that serve as a support for

training a people reidentiﬁcation model. This model

introduces object notations to improve the results of

people re-identiﬁcation. Another application of syn-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

134

thetic data is the one presented in Saleh et al, (Saleh

et al., 2019) where the use of point cloud data from

3D LiDAR sensors for critical safety tasks applied to

autonomous vehicle systems is introduced. Given the

existing techniques to generate synthetic images, the

authors propose to generate the point cloud from syn-

thetic images using a cycle GAN combined with the

real images obtained by 3D LiDAR sensors. This al-

lows to improve the detection results of vehicles from

a bird’s eye view.

In the context of synthetic image generation, Gen-

erative Adversarial Networks (GANs) have facilitated

the process and helped a lot in the area of computer

vision. The transformation of information between

domains is the main functional that these generative

networks achieve. There are currently many types of

generative networks, but we are going to mention only

the once more related to the context of the problem to

be solved in this paper. The learning of these classes

of networks is done through the training process with

a correctly registered data set. However, obtaining the

paired information is sometimes not possible or, in

case it is possible, some times it is difﬁcult to acquire

enough data to train the network. These limitations

have motivated works such as the one presented in

Zhu et al., (Zhu et al., 2017), to be able to generate im-

ages from one domain to another, without the need for

them to be registered, the author in this work present

the cycle consistency loss which allow the unpair im-

age transformation kwnon as cycled GAN. To address

our proposed problem, we are going to use a set of im-

ages from the visible spectrum and map them to the

thermal spectrum. Therefore, the mapping function

is: G: X → Y such that the image distribution of G(X)

is indistinguishable from the distribution of Y using

contradictory loss, originally proposed in (Goodfel-

low Ian et al., 2014). However, for the domain trans-

lation proposed in (Zhu et al., 2017) this mapping is

very loosely constrained, a reverse F mapping is nec-

essary: Y → X and introduce a loop consistency loss

to enforce F(G(X)) ≈ X (and vice versa). The next

section details the changes with respect to the origi-

nal Cycled GAN (Zhu et al., 2017) implemented in

the current work to enhance the translation of infor-

mation. Our proposal is motivated by the fact that

the mapping must include not only shape, but also the

textures and should simulate temperatures of objects,

making the design of the architecture more challeng-

ing. The pre-processing applied to the dataset for the

training process is also presented.

3 PROPOSED APPROACH

The proposed approach is a combination of several

state of the art techniques adapted to the problem of

obtaining synthetic thermal images from visible spec-

trum images. The architecture is based on a Cycled

GAN to perform the transfer of unpaired domains that

is presented in (Zhu et al., 2017). To achieve a closer

translation to the intensity of the pixels of the far in-

frared spectrum, it is proposed to use the contrastive

loss. This allows to improve the quality of the im-

ages. The inclusion of this contrastive loss, presented

in (Liu et al., 2021), allows the proposed architecture

to focus on determining the relationship between in-

put embeddings from regions close to the region being

processed. This method tries to predict the missing

information based on its environment instead of pre-

dicting the values per pixel. To determine the similar-

ity of nearby regions, cosine similarity is used. This

makes it easy to determine the difference based on

its orientation and not just the magnitude like the L1

loss. It must be considered that the discriminator of

the model maximizes the replicas closest to the real

one and minimizes the differences between the target

image and the various embeddings of the processed

nearby image regions. In this way, the discriminator

is updated minimizing the distances between the real

image and the embeddings of the same class while be-

ing maximized otherwise. By forcing the embeds to

relate through the loss of cosine similarity, the dis-

criminator can learn the detailed representations of

real images. Similarly, the generator exploits knowl-

edge of the discriminator, such as intraclass features

and higher order representations of the actual images,

to generate more realistic images. The use of this

class of contrastive loss is already quite widespread,

especially in the computer ﬁeld of vision, where dif-

ferent approaches apply it (e.g., (Yu et al., 2021), (Liu

et al., 2021), (Su

arez et al., 2019), (Park et al., 2020))

given the good results obtained in improving the qual-

ity of the images.

It is important to emphasize that given the re-

sults obtained in the experiments carried out, the RGB

color space of the input images of the model had to be

changed to HSV. This is due to the fact that with the H

channel of this color space it was possible to perform

the transformation of the visible information to ther-

mal with greater accuracy in the simulation of tem-

peratures of the objects presented in the images. This

allows not only to reﬂect good contours and details,

but also to represent the temperatures of the generated

synthetic thermal images with a high ﬁdelity.

Additionally, in this paper we include a relativistic

GAN loss (Jolicoeur-Martineau, 2018), instead of the

Toward a Thermal Image-Like Representation

135

standard GAN loss proposed by (Goodfellow et al.,

2020). This relativistic GAN loss assumes that in

each mini-batch at least half of the data generated are

false, which can be observed by minimizing the learn-

ing divergence. Therefore, we include this relativis-

tic loss because allows estimating that in a mini-batch

of randomly generated data, more realistic than false

samples are obtained. According to the authors of rel-

ativistic loss, they argue that the probability of real

data being real D(xr) should decrease as the probabil-

ity of fake data being real D(x f ) increase. Therefore,

this forces the generated samples to be closer to the

real ones, avoiding model saturation and also acceler-

ate the training process. For this reason, this loss is

very well coupled to the transformation process from

visible to thermal and, according to the results we ob-

tained, it was adapted in our cycled transformation

model. The standard GAN loss function is replaced

with the relativistic standard GAN loss for discrimi-

nator and generator respectively and they are deﬁned

as:

RGAN

= E

(

)

∼(P,Q)

[ f

(C (x

) −C (x

))] (1)

RGAN

= E

(

)

∼(P,Q)

[ f

(C (x

) −C (x

))] (2)

where f and g are functions mapping a scalar input

to another scalar and x

, x

is the real and fake image

respectively.

To replace the consistency cycle loss of the GAN

network, see Eq. 3, a contrastive loss has been imple-

mented in our architecture:

CYCLE

(G,F) = E

∼

data

(

)

[

∥

F(G(x)) − x

∥

] (3)

∼

data

(

)

[

∥

G(F(y)) − y

∥

This loss allows the model to be trained based on

learning the similarity of the latency spaces resulting

from the generating network. According with (An-

donian et al., 2021), for each model input image x,

contrastive learning approaches only need to deﬁne

the similarity distribution to sample a positive input

∼ p

(· | x), and a data distribution for a negative

input x

−

∼ p

−

(· | x), with respect to a sample input x.

Furthermore, they argue that the shape of the ten-

sor V

∈ R

×D

is determined by the architecture of

the network, where S

is the number of spatial lo-

cations of the tensor. Therefore, the tensor is in-

dexed with the notation v

∈ R

, which is the D

dimensional feature vector at spatial location s

. It

has been denoted ¯v

∈ R

−1)×D

as the collection of

feature vectors at all other spatial locations. Accord-

ing to (Andonian et al., 2021) this loss can be written

as follows:

contrasive

(

Y ,Y ) =

∑

l=1

∑

s=1

ℓ

contr

( ˆv

, ¯v

)

The main objective of this type of training based

on constrative loss is that the model matches similar

and different samples. Those that are similar or called

positive should be mapped as close together as pos-

sible. On the other hand, the negative or dissimilar

pairs must be further away from the positive latency

space. These similar representations will become uni-

ﬁed, while the dissimilar ones will separate from the

latency space of the positive pairs.

In addition, the model also implements the iden-

tity loss function so that the intensity levels of the

pixels do not go outside the bounds of the objective

domain during the transformation of the data. This

implies that the generating network must preserve the

most relevant characteristics, learn the level of ther-

mal intensity, the shape of the objects and help main-

tain the stability of the formation model. That is, it

is true that F(x) ≈ x and G(y) ≈ y. λ is an aggregate

term to deﬁne the relative importance of the cycle and

identity losses, compared to the GAN:

identity

(G,F) = E

c∼P

data

(c)

[

∥

F(c) − c

∥

]

+ E

n∼P

data

(n)

[

∥

G(n) − n

∥

Finally the multiple loss function implemented in

our model is deﬁned as:

RGAN

(G,D, X,Y ) + λ

Lcontrastive

(G,H,X) (4)

+λ

Lcontrastive

(G,H,Y ) + γL

Identity(G,F)

where λ,γ are the weights of the contrastive and iden-

tity loss function respectively, and have been deﬁned

empirically according to the results of the experi-

ments.

The architecture (see Fig. 1) also includes a spec-

tral normalization to improve the quality of the gener-

ated synthetic thermal images. This normalization has

been implemented, given the challenge involved in

training a GAN network is to control the performance

of the generator and discriminant networks. The main

goal is to avoid the fading of the gradients, so as not

to collapse the training of the model. To improve the

multimodal translation from the visible to the far in-

frared spectrum, spectral normalization has been in-

troduced in the discriminator. This improves control

of the efﬁciency of this network, avoiding learning in-

stability. It also contributes to the generalization of

the model in less time. This occurs because the dis-

criminator is more efﬁcient at distinguishing the tar-

get distribution pattern. With this normalization it is

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

136

Figure 1: Cycle GAN proposed architecture.

avoided that the derivative of the discriminator net-

work becomes zero and therefore the generating net-

work stops its learning. This normalization acts as

a choice constraint implemented in the discriminator

network.

4 EXPERIMENTAL RESULTS

This section presents quantitative and qualitative re-

sults obtained with the proposed approach. It also in-

cludes a description of the data set used for training

and the pre-processing applied to the images. Finally,

a comparative analysis is carried out using the met-

rics of similarity and maximum noise present in the

synthetic images.

4.1 Datasets

The designed model has been trained with the M3FD

data set (Liu et al., 2022), whose acquisition has

been carried out with a binocular optical and infrared

sensor. The data set contains 4,500 image pairs of

recorded outdoor scenes; from these images 3000

pairs were used for training, 890 pairs for testing and

the rest of the images for validation of the already

trained model. It is worth mentioning that the im-

ages were pre-processed to generate the most realistic

synthetic images possible in the far infrared spectrum.

Therefore, the images were transferred to the HSV

color space. Then the H channel that represents the

(hue) has been selected as input to train model. Ad-

ditionally, to validate the robustness of the model, a

proprietary data set—referred to as Thermal Stereo—

with 200 pairs of registered visible-thermal images

has been considered. The results obtained with the

model trained with the M3FD data set are included in

the comparisons.

4.2 Training Settings

In order to train the model, the visible images have

been converted to the HSV color space. Only the H

channel has been considered to train the model. In

addition, during the training the images have been re-

sized to 256 x 256 pixels. For training the model

the traditional GAN loss has been replaced with the

relativistic GAN loss. The objective with this loss

is that the discriminator globally evaluates the ran-

dom samples against the real input data of the model.

Toward a Thermal Image-Like Representation

137

Figure 2: Experimental results: (1st. row) results with state of the art technique (Zhu et al., 2017); (2nd. row) results from the

proposed approach; (3rd. row) ground truth images from M3FD and Thermal Stereo datasets.

With this loss, a better quality of the generated im-

ages is achieved. The learning rate has been deﬁned at

0.000273. The Adam optimizer has been used, where

and β

have default values of 0.85 and 0.99 , re-

spectively. For quantitative evaluation, the maximum

signal/noise ratio (PSNR) metric and the structural

similarity index (SSIM) metric have been deﬁned. A

TITAN V GPU has been used for training. The train-

ing time of the model lasts about 96 hours.

4.3 Comparisons

The proposed approach has been evaluated by com-

paring it with the state-of-the-art model that performs

the translation of unpaired images presented in (Zhu

et al., 2017). This proposal allows generating syn-

thetic images from images of the visible spectrum to

another unpaired domain. Based on this concept, we

have modiﬁed the loss functions and preprocessed the

input images, in order to generate synthetic images

of the thermal spectrum. Table 1 presents average re-

sults obtained with (Zhu et al., 2017) and the approach

proposed in the current work. The model has been

validated with samples from the M3FD data set and

our own dataset taken from outdoor scenes. Figure

2, shows some illustrations of the synthetic thermal

images obtained from these validation sets. Addition-

ally, for the purposes of quality comparison, Tables 2

and 3 show the best and worst results of the SSIM ob-

tained with each data set—PSNR values are also pro-

vided. Furthermore, to illustrate these comparisons,

Fig. 3 and Fig. 4 present the images that correspond

to the metrics shown in these tables.

Table 1: Average results from the validation sets (M3FD-

Thermal Stereo). Best results in bold.

Approaches

M3FD Thermal Stereo

PSNR SSIM PSNR SSIM

(Zhu et al., 2017) 12.589 0.501 11.939 0.419

Prop. Approach 14.734 0.772 17.0989 0.733

Table 2: Best and Worst SSIM results from the M3FD vali-

dation set. Best results in bold.

Approaches

M3FD

BEST WORST

PSNR SSIM PSNR SSIM

(Zhu et al., 2017) 17.381 0.631 8.90 0.3042

Prop. Approach 22.899 0.869 11.279 0.638

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

138

BEST RESULTS Ground Truths WORST RESULTS Ground Truths

Figure 3: BEST and WORST results obtained with (Zhu et al., 2017): (1st. row) Images from M3FD dataset; (2nd. row)

Images from Thermal Stereo dataset.

BEST RESULTS Ground Truths WORST RESULTS Ground Truths

Figure 4: BEST and WORST results obtained with the proposed approach: (1st. row) Images from M3FD dataset; (2nd. row)

Images from Thermal Stereo dataset.

Table 3: Best and Worst SSIM results from the Thermal

Stereo validation set. Best results in bold.

Approaches

THERMAL STEREO

BEST WORST

PSNR SSIM PSNR SSIM

(Zhu et al., 2017) 17.434 0.641 5.56 0.0682

Prop. Approach 31.502 0.950 11.538 0.4324

5 CONCLUSIONS

This paper improves the domain transformation

mechanism by generating synthetic images of the far

infrared (thermal) spectrum from visible spectrum

images. In order to transfer not only shape, but also

make the model simulate the temperature and tex-

ture of the thermal images, the unpaired cycled GAN

network has been taken and modiﬁcations have been

made in terms of the loss and normalization functions.

As a further work we will explore with other state-

of-the-art techniques based on transformers or diffu-

sion models. The idea is to evaluate the generalization

of the model to generate synthetic images with better

quality.

Toward a Thermal Image-Like Representation

139

ACKNOWLEDGEMENTS

This material is based upon work supported by the

Air Force Ofﬁce of Scientiﬁc Research under award

number FA9550-22-1-0261; and partially supported

by the ESPOL project CIDIS-12-2022; the Span-

ish Government under Project PID2021-128945NB-

I00; and the ”CERCA Programme / Generalitat de

Catalunya”. The authors gratefully acknowledge the

NVIDIA Corporation for the donation of a Titan V

GPU used for this research.

REFERENCES

Andonian, A., Park, T., Russell, B., Isola, P., Zhu, J.-Y.,

and Zhang, R. (2021). Contrastive feature loss for im-

age prediction. In Proceedings of the IEEE/CVF inter-

national conference on computer vision, pages 1934–

1943.

Barron, J. T. and Poole, B. (2016). The fast bilateral solver.

In European conference on computer vision, pages

617–632. Springer.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2020). Generative adversarial networks. Com-

munications of the ACM, 63(11):139–144.

Goodfellow Ian, J., Jean, P.-A., Mehdi, M., Bing, X., David,

W.-F., Sherjil, O., and Courville Aaron, C. (2014).

Generative adversarial nets. In Proceedings of the

27th international conference on neural information

processing systems, volume 2, pages 2672–2680.

Guo, T., Huynh, C. P., and Solh, M. (2019). Domain-

adaptive pedestrian detection in thermal images. In

2019 IEEE International Conference on Image Pro-

cessing (ICIP), pages 1660–1664. IEEE.

Hui, T.-W., Loy, C. C., and Tang, X. (2016). Depth map

super-resolution by deep multi-scale guidance. In Eu-

ropean conference on computer vision, pages 353–

369. Springer.

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017).

Image-to-image translation with conditional adversar-

ial networks. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition (CVPR).

Jolicoeur-Martineau, A. (2018). The relativistic discrimina-

tor: a key element missing from standard gan. arXiv

preprint arXiv:1807.00734.

Kniaz, V. V., Knyaz, V. A., Hladuvka, J., Kropatsch,

W. G., and Mizginov, V. (2018). Thermalgan: Mul-

timodal color-to-thermal image translation for person

re-identiﬁcation in multispectral dataset. In Proceed-

ings of the European Conference on Computer Vision

(ECCV) Workshops, pages 0–0.

Kopf, J., Cohen, M. F., Lischinski, D., and Uyttendaele, M.

(2007). Joint bilateral upsampling. ACM Transactions

on Graphics (ToG), 26(3):96–es.

Li, C., Xia, W., Yan, Y., Luo, B., and Tang, J. (2020). Seg-

menting objects in day and night: Edge-conditioned

cnn for thermal image semantic segmentation. IEEE

Transactions on Neural Networks and Learning Sys-

tems, 32(7):3069–3082.

Liu, J., Fan, X., Huang, Z., Wu, G., Liu, R., Zhong, W., and

Luo, Z. (2022). Target-aware dual adversarial learn-

ing and a multi-scenario multi-modality benchmark to

fuse infrared and visible for object detection. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 5802–5811.

Liu, R., Ge, Y., Choi, C. L., Wang, X., and Li, H. (2021).

Divco: Diverse conditional image synthesis via con-

trastive generative adversarial network. In Proceed-

ings of the IEEE/CVF Conference on Computer Vi-

sion and Pattern Recognition (CVPR), pages 16377–

16386.

Lu, Y. and Lu, G. (2021). An alternative of lidar in night-

time: Unsupervised depth estimation based on single

thermal image. In Proceedings of the IEEE/CVF Win-

ter Conference on Applications of Computer Vision,

pages 3833–3843.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y.

(2018). Spectral normalization for generative adver-

sarial networks. arXiv preprint arXiv:1802.05957.

Park, T., Efros, A. A., Zhang, R., and Zhu, J.-Y. (2020).

Contrastive learning for conditional image synthesis.

In ECCV.

Saleh, K., Abobakr, A., Attia, M., Iskander, J., Naha-

vandi, D., Hossny, M., and Nahvandi, S. (2019). Do-

main adaptation for vehicle detection from bird’s eye

view lidar point cloud data. In Proceedings of the

IEEE/CVF International Conference on Computer Vi-

sion Workshops, pages 0–0.

arez, P. L., Sappa, A. D., and Vintimilla, B. X. (2019).

Image patch similarity through a meta-learning metric

based approach. In 2019 15th International Confer-

ence on Signal-Image Technology & Internet-Based

Systems (SITIS), pages 511–517. IEEE.

Xie, J., Feris, R. S., and Sun, M.-T. (2015). Edge-guided

single depth image super resolution. IEEE Transac-

tions on Image Processing, 25(1):428–438.

Yu, N., Liu, G., Dundar, A., Tao, A., Catanzaro, B., Davis,

L. S., and Fritz, M. (2021). Dual contrastive loss and

attention for gans. In Proceedings of the IEEE/CVF

International Conference on Computer Vision (ICCV),

pages 6731–6742.

Zhang, L., Gonzalez-Garcia, A., Van De Weijer, J., Danell-

jan, M., and Khan, F. S. (2018). Synthetic data gener-

ation for end-to-end thermal infrared tracking. IEEE

Transactions on Image Processing, 28(4):1837–1850.

Zhou, D., Wang, R., Lu, J., and Zhang, Q. (2018). Depth

image super resolution based on edge-guided method.

Applied Sciences, 8(2):298.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017).

Unpaired image-to-image translation using cycle-

consistent adversarial networks. In Proceedings of

the IEEE international conference on computer vi-

sion, pages 2223–2232.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

140