Aerial to Street View Image Translation using Cascaded Conditional

GANs

Kshitij Singh, Alexia Briassouli

and Mirela Popa

Department of Data Science and Knowledge Engineering, Maastricht University, The Netherlands

mirela.popa@maastrichtuniversity.nl

Keywords:

Cross View Image Translation, Conditional GANs, Semantic Segmentation, U-net.

Abstract:

Cross view image translation is a challenging case of viewpoint translation which involves generating the

street view image when the aerial view image is given and vice versa. As there is no overlap in the two

views, a single stage generation network fails to capture the complex scene structure of objects in these two

views. Our work aims to tackle the task of generating street level view images from aerial view images on

the benchmarking CVUSA dataset by a cascade pipeline consisting of three smaller stages: street view image

generation, semantic segmentation map generation, and image reﬁnement, trained together in a constrained

manner in a Conditional GAN (CGAN) framework. Our contributions are twofold: (1) The ﬁrst stage of our

pipeline examines the use of alternate architectures ResNet, ResUnet++ in a framework similar to the current

State-of-the-Art (SoA), leading to useful insights and comparable or improved results in some cases. (2) In

the 3rd stage, ResUNet++ is used for the ﬁrst time for image reﬁnement. U-net performs the best for street

view image generation and semantic map generation as a result of the skip connections between encoders and

decoders, while ResU-Net++ performs the best for image reﬁnement because of the presence of the attention

module in the decoders. Qualitative and quantitative comparisons with existing methods show that our model

outperforms all others on the KL Divergence metric and ranks amongst the best for other metrics.

1 INTRODUCTION

The task of generating outdoor scenes from a variety

of viewpoints is a challenging one that is gaining a

lot of attention recently with applications in domains

like autonomous driving, virtual reality, geo-tagging

etc. Generation of a novel viewpoint involves trans-

forming objects in a scene from a given view to the

desired view in a natural setting, while maintaining

the photo-realism of the transformation.

Cross view image translation is a special case of

viewpoint translation, where the desired view has no

overlap with the given view (aerial to street or vice

versa). This is much more challenging due to occlu-

sion and the large degree of deformation while trans-

forming from one view to another. Moreover, when

transforming from aerial view to street view, there is

uncertainty in the orientation in which the street view

will be synthesized. Existing methods (Zhai et al.,

2017) (Regmi and Borji, 2018) (Tang et al., 2019)

show that a single stage image translation model fails

to transfer ﬁne details of the objects. Thus, a multi

https://orcid.org/0000-0002-0545-3215

step process is needed, with image reﬁnement after

street view image generation (Tang et al., 2019). A se-

mantic segmentation map generator, for comparison

with the ground truth semantic map, is added to the

multi-step process, to guide image generation. The

ﬁnal pipeline consists of 3 steps, where a street view

image is ﬁrst generated, and is then provided to image

reﬁnement and semantic map generation networks.

Our work builds upon (Tang et al., 2019), investi-

gating which architectures are best suited for each of

the 3 steps/subtasks (street view image generation, se-

mantic map generation, and image reﬁnement). Our

contributions are: (1) Stage 1 of our pipeline exam-

ines alternate SoA architectures ResNet, ResUnet++

in a framework similar to (Tang et al., 2019), leading

to useful insights and comparable or improved results.

(2) In stage 3, ResUNet++ is used for the ﬁrst time for

image reﬁnement.

Conditional GANs (CGAN), proven to be very ef-

fective in image translation (Isola et al., 2017), are

used as the framework for each step. In addition to

U-Net, which is the standard for image translation,

CGAN, ResNet and ResU-Net++ (Jha et al., 2019) are

372

Singh, K., Briassouli, A. and Popa, M.

Aerial to Street View Image Translation using Cascaded Conditional GANs.

DOI: 10.5220/0010814000003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 4: VISAPP, pages

372-379

ISBN: 978-989-758-555-5; ISSN: 2184-4321

studied for their suitability in each subtask: they are

trained together in a constrained manner in a cascade

network framework, to study the effect of each one of

them in the ﬁnal street view image generation process.

The investigation of these generators provides impor-

tant insights into the suitability of network features

like skip connections and attention for each subtask.

This paper is structured as follows. Sec. 2 presents

the background of this work, Sec. 3 provides details

on the architectures and methods used. Experiments

and results are presented in Sec. 4, with qualitative

and quantitative comparisons of the proposed pipeline

against existing methods. Finally, conclusions and

plans for future work are presented in Sec. 5.

2 BACKGROUND, RELATED

WORK

Conditional GANs (CGANs) (Mirza and Osindero,

2014) generate domain-speciﬁc images in a con-

trolled setting, so they have been used for image to

image translation (Isola et al., 2017), (Odena et al.,

2017), (Choi et al., 2018). In order to make the task

feasible, semantic segmentation maps of the opposite

view are provided to the CGAN generator network,

allowing the network to focus on reproducing color,

texture, and structure. (Regmi and Borji, 2018) pro-

posed X-Fork and X-Seq for cross view image trans-

lation, using CGANs. The SoA in this task is Se-

lectionGAN (Tang et al., 2019) which used CGAN,

similarly to (Regmi and Borji, 2018), but with multi

channel attention. This multi channel attention is

handcrafted to a certain extent, which potentially con-

strains its generalizability. Recently, (Toker et al.,

2021) introduced a novel multi-task architecture with

joint image synthesis and retrieval, achieving SoA

cross-view image based geo-localization.

We examine the following research aspects:

(1) how newer architectures perform in cascaded

CGANs, (2) how our system compares to the SoA on

quantitative metrics used in (Tang et al., 2019), (3) the

effect of adding tasks (like segmentation map gener-

ation, image reﬁnement), while training concurrently

with cross view image synthesis, (4) the role of atten-

tion in image generation. We focus on street level

view generation when the corresponding top level

view is given as an input, which is more difﬁcult than

the inverse (generating aerial level images from street

view), as it requires generating a view which has more

details using an input view which has fewer details.

Figure 1: Diagram of the pipeline. Dashed lines indicate the

different adversarial and L1 losses used to train the network.

3 METHODOLOGY

Motivation for the Chosen Pipeline: Objects in

aerial and street views have different dimensions and

orientation, leading to ambiguity when transferring

from one view to another. A single stage image to

image translation model cannot capture details, espe-

cially when the degree of overlap between the views

is small (as in our case). This issue is addressed by

using a cascade CGANs pipeline (Tang et al., 2019),

with 3 generators trained together in an adversarial

manner guided by pixel losses (Figure 1). The three

generators have different tasks, described below.

1. Generator 1 (coarse street view): This generator

deals with coarse street view image generation.

The aerial view and semantic segmentation map

of the street view are input and the network gen-

erates a coarse reconstruction of the street view.

2. Generator 2 (street view semantic map): This gen-

erator deals with street view semantic map gen-

eration. It takes the coarse street view (i.e output

of Generator 1) as input and generates the seman-

tic segmentation map of the street view.

3. Generator 3 (coarse to ﬁne street view): This gen-

erator deals with the task of image reﬁnement.

Similar to Generator 2, it takes the coarse street

view (i.e output of Generator 1) as input and gen-

erates a ﬁne version of the street view. This re-

ﬁned street level view is the ﬁnal output.

Generator Architecture Choices:

U-Net: CGANs for image to image translation (Isola

et al., 2017) used U-Nets, as they share low level fea-

tures between the input and output views by skip con-

nections between layer i and layer n − i of the stan-

Aerial to Street View Image Translation using Cascaded Conditional GANs

373

dard encoder-decoder module. This is appropriate for

image to image translation, where input and output

images share low level features such as color, shape,

dimensions, orientation, etc. We use the implementa-

tion of U-Net by (Isola et al., 2017), which can also

be used with fewer ﬁlters for coarse image generation,

greatly reducing the number of parameters of the net-

work and saving on training time.

ResNet: Residual networks for image classiﬁcation

(He et al., 2016) replaced the layers in the middle

of the network by blocks with skip connections, con-

necting alternate blocks. In our two image transla-

tion tasks (street view image generation and coarse to

ﬁne image reﬁnement) the input and output images

have high level of structural similarities, so we use

a ResNet style architecture. There are 3 blocks, each

downsampling by a factor of 2, so a 256x256 image is

downsampled to 64x64 with 256 channels. It is then

passed through 6 ResNet blocks, containing a convo-

lutional block that does not change the image dimen-

sions. Output from block i is added to output of block

i+1, which is then given as input to block i +2. After

the ResNet block, the image is passed to 3 upsampling

blocks to bring the output back to a 256x256 image.

ResU-Net++: ResU-Net++ (Jha et al., 2019) com-

bines key features of U-Net and ResNet. It has skip

connections between block i and block n − i simi-

lar to U-Net, and skip connections between consec-

utive blocks similar to ResNet. Attention has recently

been proven useful for image translation (Zhang et al.,

2019), (Kim et al., 2019), and is explored here in the

context of cross view image translation and coarse to

ﬁne image reﬁnement. ResU-Net++ has never been

used previously for image translation related tasks.

We examine it for our three subtasks, since U-Net and

ResNet have already proven to be effective for image

related tasks using CGAN, and its inclusion of atten-

tion is expected to show promising results.

PatchGAN Discriminator: (Pathak et al., 2016) in-

troduced pixel loss (L1 or L2) in addition to adver-

sarial GAN loss. We use L1 loss, as it leads to less

blurring than L1/L2 (Zhao et al., 2016), (Larsen et al.,

2016). (Isola et al., 2017) argued that L1 loss captures

low, but fails to capture high frequencies, so we use a

70x70 PatchGAN discriminator, which has gives the

sharpest images both in spatial and spectral domains.

Loss Function Formulation: In CGANs, both the

discriminator and generator receive the conditioning

variable (aerial view image in this case). For the dis-

criminator, we ﬁrst construct fake sample pairs using

the aerial view and the street view from the generator,

pass this through the discriminator, and calculate the

loss based on the prediction by the discriminator (2nd

component in Eqs 1, 2). Next,we do the same with

real sample pairs (aerial view and ground truth street

view) (1st component in Eqs 1, 2). The discrimina-

tor’s combined adversarial loss L

adv1

for identifying

generated coarse street view (I

) from ground truth

street view(I

) is the sum of these two:

adv1

= min

max

cGAN

(G, D

) =

[logD(I

, I

)] + E

[log(1 − D(I

, I

))]

(1)

The discriminator adversarial loss L

adv2

for the re-

ﬁned generated street view image (I

) is:

adv2

= min

max

cGAN

(G, D

) =

[logD(I

, I

)] + E

[log(1 − D(I

, I

))]

(2)

L1 Loss: Each of the 3 generators produces an image

which is compared to the ground truth with L1 loss.

Between the coarse (I

) and ground truth street

view image (I

gen1

= min

L1(G) = E

[||I

− I

] (3)

Between the generated semantic map (I

) and ground

truth semantic map of street view (I

gen2

= min

L1(G) = E

[||I

− I

] (4)

Between the reﬁned (I

) and ground truth street view

image(I

gen3

= min

L1(G) = E

[||I

− I

] (5)

The total L1 loss is then deﬁned as:

= min

(L1

gen1

+ L1

gen2

+ L1

gen3

) (6)

Overall Loss: The Overall Loss is the weighted sum

of the adversarial and L1 loss. The weighting factors

and λ

are hyper-parameters which can be tuned.

min

max

L = L

adv1

+ λ

adv2

+ λ

(7)

Selecting λ

and λ

: In Eq. 7, regularization term

denotes the importance of reducing the adversarial

loss in coarse to ﬁne image reﬁnement, in comparison

to coarse street view generation. Higher values indi-

cate that discriminating the reﬁned street view from

the ground truth is more difﬁcult than discriminating

the coarse view, so λ

> 1 seems logical. Experiments

for λ

= 0.5, 1, 5, 10 showed λ

= 5 to be optimum.

is the regularization term for reducing L1 loss

compared to the other terms in the overall loss. (Isola

et al., 2017) observed that, for image to image transla-

tion, the discriminator trains faster than the generator

(i.e it can identify fake images before the generator

can keep up with generating realistic images and thus

fails to train efﬁciently). Also, color is a very impor-

tant in image translatio,n especially during viewpoint

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

374

translation where objects in the scene (mountains, sky

etc) have a narrow range of realistic spectral values for

humans. This motivates using to a high value of λ

(λ

= 10). For cross view image translation, (Regmi

and Borji, 2018) found it should be even higher and

used λ

= 100 which is also used here. Tests using

higher or lower values of λ

validated this choice.

Network Training: In order to optimize the network,

we perform gradient descent on the discriminator and

the generator alternatively. We ﬁrst train the discrim-

inator and update weights via backpropagation, keep-

ing the 3 generators ﬁxed. Then we ﬁx the discrimi-

nator weights, train the 3 generators and update their

weights via backpropagation. The process is end-to-

end and continues for 30 epochs. Similarly to (Regmi

and Borji, 2018) and (Tang et al., 2019), an Adam op-

timizer is used for both discriminator and generator

training with initial learning rate = 0.0002 and mo-

ment parameters β

= 0.5 and β

= 0.999. For each

epoch, the intermediate model is saved so results can

be compared and progress can be monitored.

4 EXPERIMENTS, RESULTS

Dataset: The dataset used is Crossview USA

(CVUSA) (Workman et al., 2015), with millions of

pairs of aerial images obtained from Bing Maps and

ground-level images from Google Street view. For

our task of cross view image translation, we use a

subset of CVUSA created by (Zhai et al., 2017), with

35,552 training and 8,884 testing tuples. This exact

split and distribution of images is used to compare our

results with the SoA. The semantic segmentation map

of the ground-level view is included with the aerial

and ground-level images, to guide the image genera-

tion process. As shown in Figure 2, a sample image

consists of a tuple (aerial view, street view, semantic

segmentation of street view). The resolution is 768

X 256 (i.e each image in the tuple has a resolution of

256 X 256). In order to train the model, they have to

be cropped and converted into separate paired images.

Figure 2: The entire image is divided into 3 windows of

equal width and height (256x256) showing different views

of the same place.

In this dataset, the street level views are consistent

in attributes like height from the ground and direction,

so the network can learn to generate street level im-

ages for this consistent set of attribute values. There-

fore data augmentation techniques like ﬂipping, rota-

tion etc are not applied, as they would change the di-

rection attribute, breaking the consistency. However,

all existing methods for this task using this dataset

perform random cropping from 256x256 to 224x224

and then resizing back to 256x256. Since we want to

compare the performance of our proposed model with

the SoA, we applied this cropping and resizing.

Qualitative Evaluation of Generated Images: In

order to visually assess the performance of the gen-

erator in GANs for photo-realistic images of a view-

point we visually assess structure, color, and textures.

Quantitative Evaluation of Generated Images:

Qualitative assessment is subjective, time consuming

and labor intensive, so the quantitative metrics are

also measured. They are also used in the SoA, al-

lowing us to quantitatively compare our results to it.

Pixel Level Similarity Measures: Peak Signal to

Noise Ratio (PSNR) between a reconstructed and in-

put view: high PSNR indicates better reconstruction.

Sharpness Difference (SD), for loss in sharpness dur-

ing viewpoint reconstruction. Structural-Similarity

(SSIM), between the ground truth and generated im-

age based on structure, luminance, and contrast. A

higher value of SSIM indicates higher similarity.

High Level Feature Similarity Measures: Incep-

tion Score (IS), to evaluate the diversity of a generated

image within a class while simultaneously being rep-

resentative of that class. We use AlexNet(Krizhevsky

et al., 2012) trained on Places (Zhou et al., 2017) with

365 categories, as in (Regmi and Borji, 2018). Then

we calculate the probability of the generated street

view image belonging to each class, for Top-1 and

Top-5 Inception Scores. Kullback-Leibler divergence

(KL) is also used. The probability distribution is cal-

culated by passing the image through the pre-trained

model used for the IS calculation.

Figure 3: Generated semantic segmentation vs ground truth.

Architecture Selection: U-Net performs extremely

well in the semantic segmentation of Generator 2

(Figure 3). Generator 2 generates a semantic segmen-

tation map of the street view, trained using a L1 loss

and added to the overall loss of the generator to help

Aerial to Street View Image Translation using Cascaded Conditional GANs

375

the other two generators train faster.

Semantic segmentation is only used for loss guid-

ance, while the other two generators actually produce

street level views, so we ﬁx the architecture of Gen-

erator 2 as U-Net, reducing the number of conﬁgu-

rations to be tried from 27 to 9. The 9 conﬁgura-

tions, with U-Net as Generator 2, contain the follow-

ing combinations for Generators 1 and 3 (G1, G3): (a)

U-Net, U-Net, (b) U-Net, ResNet, (c) U-Net, ResU-

Net++, (d) ResNet, U-Net, (e) ResNet, ResNet, (f)

ResNet, ResU-Net++, (g) ResU-Net++, U-Net, (h)

ResU-Net++, ResNet, (i) ResU-Net++, ResU-Net++.

Figure 4: Ground truth image; Left window is the aerial

view and the right window is the street view.

Figure 5: Generated reﬁned street view for models (a)-(i).

Figure 5 shows the generated street view of these

conﬁgurations for the ground truth image in Figure 4.

Some observations on our results are listed below.

1. The color reproduction of all the tried conﬁgura-

tion seems to be accurate compared to the ground

truth. This shows that the L1 loss works well.

2. ResNet performs poorly as G1, which deals with

coarse street views, i.e. structure. ResNet lacks

skip connections between encoder and decoder

(unlike U-Net, ResU-Net++), which are important

for structure in viewpoint translation (Isola et al.,

2017). The quality of images generated in G3 is

Table 1: High level feature qualitative metrics for the 9 con-

ﬁgurations (except for KL Div., higher is better).

Accuracy(%) Inception Score

Top 1 Top 5 All Top 1 Top 5 KL Div.

(a) 50.45 79.84 3.33 2.38 3.58 4.98 ± 1.25

(b) 47.29 75.92 3.23 2.30 3.47 9.87 ± 1.60

(d) 31.82 57.65 3.16 2.28 3.36 14.09 ± 1.65

(e) 25.65 54.32 3.09 2.12 3.23 17.61 ± 1.63

(f) 33.21 62.39 3.13 2.22 3.26 13.29 ± 1.66

(g) 57.49 88.51 3.32 2.39 3.449 6.27 ± 1.52

(h) 49.45 77.49 3.29 2.29 3.32 8.27 ± 1.56

(i) 55.06 85.23 3.58 2.54 3.66 3.55 ± 1.05

also lower, hence, ResNet is not ideal for view-

point translation or image reﬁnement.

3. Best Conﬁguration, Role of Attention: The best

results are in (c) (Generator 1: U-Net Generator

3: ResU-Net++) and (i) (Generator 1: ResU-

Net++ Generator 3: ResU-Net++), so they are

evaluated quantitatively. ResU-Net++ has skip

connections between encoders and decoders with

an attention module in the decoders. Attention

improves the ability of the network to convert

coarse generated images into ﬁner grained by fo-

cusing on relevant parts of the image.

4. The output using conﬁguration (g) has structure

and details comparable to the other 2 selected

conﬁgurations. However, on closer inspection on

multiple test images, it seems it introduces ran-

dom brightly colored artifacts (Figure 6).

Figure 6: Artifacts with G1: ResU-Net++, G3: U-Net.

Quantitative Assessment of Generated Images: Ta-

bles 1 and 2 show quantitative results for all 9 conﬁg-

urations. The two conﬁgurations with the best quali-

tative results also show the best quantitative results.

Conﬁguration (g) had some comparable/better met-

rics (Top-5 accuracy and SD), but introduces artifacts

(Figure 6), so we do not examine it further.

Table 2: Pixel level qualitative metrics of the conﬁgurations.

PSNR SSIM SD

(a) 21.549 0.479 18.707

(b) 21.267 0.469 18.732

(d) 19.728 0.456 17.170

(e) 18.521 0.428 16.091

(f) 19.853 0.449 17.670

(g) 20.473 0.483 19.893

(h) 21.161 0.474 18.876

(i) 21.769 0.498 19.482

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

376

The Two Best Conﬁgurations: As seen from the

qualitative and quantitative assessment, conﬁguration

(c), which we will call Model 1, (U-Net, ResU-

Net++) and conﬁguration (i), which we will call

Model 2 (ResU-Net++, ResU-Net++) give the best

street view transformation image. Figure 7 shows

the generated reﬁned street views for both models are

similar to the ground truth for different test samples,

encompassing various viewpoints. The semantic map

provided is not detailed, so some inaccuracies in it

carry over to the generated street view images.

Figure 7: Model 1, 2 reﬁned generated street view.

Coarse vs Reﬁned: We ﬁrst generate a coarse in-

termediate image with Generator 1 and reﬁne it with

Generator 3. Figure 8 shows the coarse output for

Model 1 lacks detail compared to Model 2. This can

be attributed to the fact that for Generator 1, Model 1

uses U-Net and Model 2 uses ResU-Net++, the latter

being more complex, despite having fewer parame-

ters. For Model 2, the coarse image displays the same

artifacts that appeared in the reﬁned image of con-

ﬁguration (g), but are mostly resolved in the reﬁned

image when using Model 2, while adding a few de-

tails. Model 1 on the other hand recovers more details

in the image reﬁnement process resulting in a starker

difference between the coarse and reﬁned images.

In conclusion, the reﬁned output of Model 1 is

slightly better than Model 2, so Model 1 (G1: U-Net,

G2: U-Net, G3: ResU-Net++) is our ﬁnal chosen

model. However, it should be noted that the coarse in-

termediate output of Model 2 (G1: ResU-Net++, G2:

U-Net, G3: ResU-Net++) shows promising results, so

there is a room for further improvement if a suitable

image reﬁnement process can be designed for it.

Figure 8: Model 1, 2 coarse vs reﬁned street view.

Figure 9: Comparison of reﬁned street view generated for

Model 1 with and without semantic segmentation map.

Removing Semantic Guidance: We investigate the

effects of removing semantic segmentation from the

input, to obtain insights into the limitations of this

task. In Figure 9, we observe that we usually ob-

tain photo-realistic images (or parts that are photo-

Aerial to Street View Image Translation using Cascaded Conditional GANs

377

realistic) of the street view even when semantic maps

are not used. This shows the model learns to generate

the structure of objects in the street, although there re-

mains a signiﬁcant difference when compared to the

ground truth, as there is no way of knowing the ex-

act dimensions of the objects (e.g. the height of trees

or the design of the windows/doors of houses). How-

ever, there are still some improvements that can be

made with regards to the general layout of the objects.

Figure 10: Ablation study: Comparison of 4 baselines.

Table 3: High level feature qualitative metrics for ablation

study (except for KL Div., higher is better).

Accuracy(%) Inception Score

Top 1 Top 5 All Top 1 Top 5 KL Div.

A 44.18 81.50 3.12 2.28 3.25 9.47 ± 1.83

B 44.91 82.98 3.17 2.33 3.29 9.11 ± 1.85

C 57.44 87.17 3.70 2.64 3.79 3.59 ± 1.13

D 58.23 87.91 3.76 2.67 3.89 2.84 ± 0.93

Ablation Study: We perform an ablation study to see

how each of the networks in the pipeline affects the

quality of images generated, and their qualitative and

quantitative relative importance. We compare 4 base-

lines: Baseline A uses G1. Baseline B uses G1 and

G2. Baseline C uses G1 and G3. Baseline D uses all

3 generators. From the results of the ablation study

(Figure 10, Tables 3, 4), it can be said that Baselines

A and B are similar to each other, B being slightly

higher. When compared to Baseline C, it can be said

that even though the addition of semantic segmenta-

tion generator in the pipeline (Baseline B) improves

performance, the impact is not as signiﬁcant as Base-

line C, where the coarse to ﬁne image reﬁnement is

added to coarse street view generation. Baseline C

and Baseline D metrics are quite similar and much

better than Baseline A and B. This again emphasizes

the importance of image reﬁnement in this pipeline.

Comparison with SoA: In this section, we compare

the results from Model 1 and Model 2 with existing

methods: SelectionGAN (Tang et al., 2019), X-Fork

and X-Seq (Regmi and Borji, 2018), Zhai et al(Zhai

et al., 2017), Pix2Pix (Isola et al., 2017).

Table 4: Pixel level qualitative metrics for ablation study.

Baseline (Generators used) PSNR SSIM SD

A (1) 21.098 0.456 18.809

B (1 and 2) 21.032 0.461 19.016

C (1 and 3) 22.090 0.498 19.312

D (All 3) 22.355 0.509 19.459

Table 5: High level feature metrics for existing methods

(except for KL Div., higher is better).

Accuracy(%) Inception Score

Top 1 Top 5 All Top 1 Top 5 KL Div.

Pix2Pix 41.87 72.87 3.26 2.42 3.51 9.47 ± 1.69

Zhai et al. 14.03 52.29 1.84 1.52 1.87 27.43 ± 1.63

X-Fork 49.65 81.16 3.38 2.54 3.57 7.18 ± 1.56

X-Seq 54.61 83.46 3.82 2.67 4.01 5.19 ± 1.31

SelGAN 65.51 89.66 3.81 2.72 3.92 2.96 ± 0.97

Our Model 1 58.23 87.91 3.76 2.67 3.89 2.84 ± 0.93

Table 6: Pixel level metrics for existing methods.

Method PSNR SSIM SD

Pix2Pix 21.57 0.46 18.90

Zhai et al. 17.49 0.42 16.62

X-Fork 21.65 0.48 18.99

X-Seq 21.67 0.47 18.99

SelectionGAN 23.15 0.53 19.61

Proposed Model 1 22.36 0.51 19.46

Quantitative Comparison: When compared to ex-

isting methods on quantitative metrics, in Tables 5

and 6, it can be seen that our Model 1 outperforms

all models except SelectionGAN on all metrics (the

only exception is Inception Score for which it is worse

than X-Seq). When compared to SelectionGAN, our

proposed Model 1 has comparable performance on all

metrics, and beats SelectionGAN in KL Divergence.

Figure 11: Comparison of our Model 1 with SoA methods.

Qualitative Comparison: In Figure 11 we qualita-

tively compare the results of Model 1 with existing

methods for a test image. The output images of the

other models are taken from the (Tang et al., 2019)

paper, so it is not fair to compare image resolution,

but only structure. We see that our model performs

comparably well to the SoA SelectionGAN, maintain-

ing the structure of objects. The other models do not

maintain structure that well, and produce signiﬁcant

distortion, or miss some objects in the scene com-

pletely (for example the house in the background).

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

378

5 CONCLUSIONS, FUTURE

WORK

In this paper we examine cross view image transla-

tion, generating a street view from the correspond-

ing aerial view using a cascade pipeline, where coarse

street view image generation, semantic segmentation,

and image reﬁnement, are combined and trained to-

gether . We tested SoA generator models U-Net,

ResNet, and ResU-Net++ and found best results were

obtained for the conﬁguration (Generator 1: U-Net,

Generator 2: ResNet, Generator 3: ResU-Net++).

This demonstrates the importance of sjkip connec-

tions for street view generation and of attention for

image reﬁnement. The role of each of the 3 subtasks

in the pipeline was studied and it was concluded that

each subtask improved overall performance qualita-

tively and quantitatively. Future work includes inves-

tigating appropriate networks for further reﬁnement

of the output images to address artifacts related to per-

spective projection, and how to incorporate varying

sources of input data (such as aerial input data from

drones at varying heights, or video input).

REFERENCES

Choi, Y., Choi, M., Kim, M., Ha, J.-W., Kim, S., and Choo,

J. (2018). Stargan: Uniﬁed generative adversarial net-

works for multi-domain image-to-image translation.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 8789–8797.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017).

Image-to-image translation with conditional adversar-

ial networks. In Proceedings of the IEEE conference

on computer vision and pattern recognition, pages

1125–1134.

Jha, D., Smedsrud, P. H., Riegler, M. A., Johansen, D.,

De Lange, T., Halvorsen, P., and Johansen, H. D.

(2019). Resunet++: An advanced architecture for

medical image segmentation. In 2019 IEEE Interna-

tional Symposium on Multimedia (ISM), pages 225–

2255. IEEE.

Kim, J., Kim, M., Kang, H., and Lee, K. (2019). U-gat-

it: Unsupervised generative attentional networks with

adaptive layer-instance normalization for image-to-

image translation. arXiv preprint arXiv:1907.10830.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. Advances in neural information processing

systems, 25:1097–1105.

Larsen, A. B. L., Sønderby, S. K., Larochelle, H., and

Winther, O. (2016). Autoencoding beyond pixels

using a learned similarity metric. In International

conference on machine learning, pages 1558–1566.

PMLR.

Mirza, M. and Osindero, S. (2014). Conditional generative

adversarial nets. arXiv preprint arXiv:1411.1784.

Odena, A., Olah, C., and Shlens, J. (2017). Conditional

image synthesis with auxiliary classiﬁer gans. In In-

ternational conference on machine learning, pages

2642–2651. PMLR.

Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., and

Efros, A. A. (2016). Context encoders: Feature learn-

ing by inpainting. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 2536–2544.

Regmi, K. and Borji, A. (2018). Cross-view image synthe-

sis using conditional gans. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion, pages 3501–3510.

Tang, H., Xu, D., Sebe, N., Wang, Y., Corso, J. J., and

Yan, Y. (2019). Multi-channel attention selection gan

with cascaded semantic guidance for cross-view im-

age translation. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 2417–2426.

Toker, A., Zhou, Q., Maximov, M., and Leal-Taixe, L.

(2021). Coming down to earth: Satellite-to-street view

synthesis for geo-localization. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 6488–6497.

Workman, S., Souvenir, R., and Jacobs, N. (2015). Wide-

area image geolocalization with aerial reference im-

agery. In Proceedings of the IEEE International Con-

ference on Computer Vision, pages 3961–3969.

Zhai, M., Bessinger, Z., Workman, S., and Jacobs, N.

(2017). Predicting ground-level scene layout from

aerial imagery. In Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition,

pages 867–875.

Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A.

(2019). Self-attention generative adversarial net-

works. In International conference on machine learn-

ing, pages 7354–7363. PMLR.

Zhao, H., Gallo, O., Frosio, I., and Kautz, J. (2016).

Loss functions for image restoration with neural net-

works. IEEE Transactions on computational imaging,

3(1):47–57.

Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Tor-

ralba, A. (2017). Places: A 10 million image database

for scene recognition. IEEE transactions on pattern

analysis and machine intelligence, 40(6):1452–1464.

Aerial to Street View Image Translation using Cascaded Conditional GANs

379