Towards Generating 3D City Models with GAN and Computer Vision

Methods

Sarun Poolkrajang

and Anand Bhojan

School of Computing, National University of Singapore, Singapore

Keywords:

Generative Adversarial Networks, Neural Networks, Computer Vision, City Generation.

Abstract:

City generation for video games is a resource and time-consuming task. With the increasing popularity of

open-world games, studies on building virtual environments have become increasingly important for the game

research community and industry. The game development team must engage in urban planning, designate im-

portant locations, create population assets, integrate game design, and assemble these elements into a cohesive-

looking city. Based on our limited knowledge and survey, we are the ﬁrst to propose a holistic approach that

integrates all features in generating a city, including the natural features surrounding it. We employ a gener-

ative adversarial network architecture to create a realistic layout of an entire city from scratch. Subsequently,

we utilize classical computer vision techniques to post-process the layout into separate features. The chosen

model is a simple Convolutional GAN, trained on a modest dataset of 2x2 km² snippets from over two thou-

sand cities around the world. Although the method is somewhat constrained by the resolution of the images,

the results indicate that it can serve as a solid foundation for building realistic 3D cities.

1 INTRODUCTION

Open-world games have been a popular genre of big

budget, ”AAA” (Triple-A) games for the past few

years now. Though the setting of the games ranges

from medieval to futuristic, there remains a common

denominator; most of them feature some form of city

or settlement areas. Be it for story or atmospheric pur-

poses, there is no question it is an integral part of the

genre. However, due to the amount assets required

to fully realize a city in-game, many developers ﬁnd

them difﬁcult and tedious, while players are often left

with small two-block villages being labeled cities.

Despite the large leaps in consumer-grade graphi-

cal performance, the city sizes in video games have

yet to approach that of our contemporary real life

counterparts. Among the biggest sits Cyberpunk

2077, estimated to be around 100-130 km

. For com-

parison, the Central Region of Singapore alone spans

132 km

. As player appetite for more immersive

game worlds increase, the developers would also need

more tools to work with in order to match their au-

dience’s expectations. In this paper, we explore the

usage of a generative adversarial network, which con-

https://orcid.org/0009-0004-1723-8258

https://orcid.org/0000-0001-8105-1739

tains generators (G) with learned weights to generate

all the features present in a city from scratch, and dis-

criminators (D) with learned weights to distinguish

between layouts of real cities and fake ones.

1.1 Contributions

We gathered over 10 thousand 2x2km

snippets from

two thousand real cities and color-coded the desired

features into rasterized images. A Generative Adver-

sarial Network is then trained on the dataset(s) of such

images to generate entire maps of new cities. Af-

ter new maps are generated, we use Computer Vi-

sion techniques to separate the features from the re-

sulting images. Then, we used procedural techniques

to generate roads and buildings of different sizes on

top of said map in a 3D graphics engine. We inferred

that such a method, as opposed to simulation meth-

ods, would not utilize much GPU resource. We felt

this was important, as having the inference process be

more resource consuming than rendering the actual

city in 3D may limit practical usage.

1.2 Novelty

Currently, there are very few holistic learning-based

methods for city generation available on the market.

Poolkrajang, S. and Bhojan, A.

Towards Generating 3D City Models with GAN and Computer Vision Methods.

DOI: 10.5220/0012315000003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 1: GRAPP, HUCAPP

and IVAPP, pages 211-219

ISBN: 978-989-758-679-8; ISSN: 2184-4321

211

Many proposed methods focused on one aspect of

city generation, be it generating the road network, the

building footprints or textures, or the terrain itself.

Some have proposed simulation methods, which we

believe may yield the best results, but are also very

computationally expensive. Many of the adopted so-

lutions are using procedural methods, which we be-

lieve may be more limiting compared to deep learn-

ing solutions, as proposed procedural methods often

result in grid-like shapes, while real cities have more

variety.

We propose a holistic framework as a solution.

Every feature relevant to an urban city, be it roads,

buildings, houses, parks, rivers, etc. can be gener-

ated together. We believe there is merit to generating

all these features together, as the relationship between

them affect how cities in real life look.

We also believe a neural network that can be in-

ferred on relatively low-end hardware is important.

Otherwise, if the city generation process is resource

consuming, smaller developers, who may not have a

large team to build a city manually, would also be

locked out simply because they may not have the bud-

get for a render farm with the latest Quadro/RTX 6000

cards. We trained our model mainly on Google’s TPU

Cloud services, and performed inference and render-

ing on an RTX 3070.

2 RELATED WORKS

This section reviews existing works focusing on ur-

ban features and model synthesis. The ﬁrst part of

this section reviews the currently available tools on

major graphical engines for this task. The second part

explores procedural approaches to the task, and the

third part reviews the approaches that used deep learn-

ing techniques. The ﬁnal part of this section discusses

existing methods for generating the cities in 3D model

form.

2.1 Industry Methods

In the video game industry, creating cities as back-

drops or foregrounds is a big undertaking (O’Sullivan,

2021). In order to have a fully realized and cohe-

sive city, both the 3D assets and the layout of the

cities must be realistic to a degree. In many in-

stances, the developers/modelers had to make the

cities from scratch, meaning both the layouts and the

minute details throughout the city must also be man-

ually crafted, making it a very time-consuming pro-

cess. Several projects utilized scanning real cities and

building their in-world 3D city on top of that, saving

time on the urban design process and the overall ar-

chitecture. For games that are not open-world, the de-

velopers may opt to build only part of a city in detail,

and leave the non-playable areas messy and incoher-

ent. For open-world games, in many instances, they

must build out the whole city. The Grand Theft Auto

franchise is perhaps the most famous one for such

an undertaking, building entire ﬁctional cities, though

they are often based on real ones (Brady, 2023).

Aside from manually constructing an entire city

or scanning a real city for further modiﬁcations, pro-

cedural generation is another popular way to gener-

ate cities from scratch. Houdini, a popular procedu-

ral/simulation software, has several rule-sets that can

be used to generate urban features. The recent ver-

sions of Unreal Engine adopted a plugin from Hou-

dini, to procedurally build a 3D city (Inc., 2023). This

method ﬁrst generates a layout for roads with one pro-

cedural rule-set, then asks the user to deﬁne zones

in the city, then ﬁlls up the gaps between the roads

with procedurally generated buildings, based on an-

other rule-set.

Figure 1: Unreal Engine 5’s Tools (Inc., 2023).

Blender also has plugins for reading Geograph-

ical Information System (GIS) data and generating

meshes on top (vvoovv, 2023). This allows for a rel-

atively easy way to generate a 3D model recreation

of a real city. The building shapes are based on data

recorded in OpenStreetMap, including the footprint

shape recorded as vectors. The geometry nodes in the

program also allows for rule-sets to generate cities or

buildings from scratch.

Outside the aforementioned industries, there are

a few procedural methods used for urban planning.

ArcGIS claims to have a procedural rule-set that can

be used on top of real GIS data, generating improved

urban plans for real cities (ESRI, 2022).

GRAPP 2024 - 19th International Conference on Computer Graphics Theory and Applications

212

Figure 2: vvoovv’s Blender-OSM tool (Inc., 2023).

2.2 Related Literature: Procedural

Methods

There have been several research papers about city

generation using procedural methods. Most of them

can be grouped into two categories: road network

generation and building generation, with a few excep-

tions covering both.

2.2.1 Generating City Layouts

Most procedural methods involve generating a grid,

which would then form the basis of a city, serving as

locations for roads and buildings. These grids are then

broken up with some form of noise or texture, giving

some randomness to the city (See: Figure 3). The

buildings are then drawn on top of the generated grid,

and the building models themselves are also gener-

ated with another procedural algorithm.

Figure 3: Many proposed procedural methods involve gen-

erating a straight grid ﬁrst, before adding noise to it to create

a more realistic road network.

Gao et al. (Gao et al., 2022) proposed a proce-

dural generation method to generate urban road net-

works based on data from OpenStreetMap (OSM).

They used the topological networks recorded within

OSM’s database to represent the road networks, and

proposed a rule-set to generate such networks so that

they can be used to recreate the roads in 3D. Benes et

al. (Bene

s et al., 2014) proposed a different method

that takes into account rivers and trafﬁc. This method

slowly generates smaller concentrations of roads ﬁrst,

mirroring how settlements in the past are formed in-

dependently before slowly being joined over time into

mega-cities. After the roads are generated, they sim-

ulated the trafﬁc ﬂow within and between the settle-

ments, and use that as a condition to generate more

roads. The iteration repeats and more roads are gen-

erated on top, mimicking how in real life, cities ex-

pand with more roads to serve higher trafﬁc demands.

Weber et al. (Weber et al., 2009) proposed what we

believe is currently the most detailed method; starting

from one user-deﬁned road, they generated new roads

and buildings on top of the old repeatedly, modeled

as a time series. This mirrors how many city-builder

games start out, with one road cutting through a piece

of land for the player to build around. The method

seems to require natural features, such as a river, to

have been predeﬁned in the user’s input, lest they be

entirely omitted; a limitation we hope to overcome.

2.2.2 Generating Building Models

In terms of generating the buildings, Biljecki et al.

(Filip Biljecki, 2016) proposed a procedural method

for generating 3D LODs for buildings in a city by

acquiring the footprint shape and reducing a gener-

alized building model to ﬁt the footprint shape. Oliva

(Oliva, 2023) had also released a procedural building

generation plug-in for Blender. The plug-in uses a

few pre-existing assets, and then generates the build-

ings based on a given shape of the footprint, with the

number of ﬂoors and windows as parameters.

2.2.3 Combined Methods

In combining the two sides, Kelly et al. (Kelly and

McCabe, 2006) surveyed several methods to generate

cities, from the use of the voronoi texture to help gen-

erate a road network, to using geometric primitives

to create buildings. Seok Kim et. al (Joon-Seok Kim,

2018) proposed a procedural method to generate cities

starting from a base terrain, then generating a city lay-

out and road networks on top of that. Gaisbauer et

al. (Bo Lin, 2020) proposed a method to generate

smaller video game maps with procedural methods

using wave function collapse (WFC). Wave function

collapse involves using quantum mechanics, where a

wave function represents the probability distribution

of particles in a physical system, and uses a set of

input patterns or tiles to generate an output pattern.

The algorithm analyzes the local neighborhoods and

determining other possible conﬁgurations. They also

proposed using simple spatial mappings to generate a

3D voxel model representing the generated city. This

is the closest research to what we are looking for, as

the paper explored both generating a city layout and a

3D post-processing portion. However, their methods

also did not cover the generation of natural features.

Towards Generating 3D City Models with GAN and Computer Vision Methods

213

2.3 Related Literature: Learning

Methods

Unlike procedural methods, existing papers on this

topic using deep learning or other learning methods

mainly focus on generating city layouts, and less so

on generating the buildings. This section will focus

on the former aspect.

Most papers proposing a learning method uses

some form of neural network. Zhang et al proposed

MetroGAN (Weiyu Zhang, 2022) for generating a

satellite imagery-like morphology of cities by split-

ting the geographical features into different condi-

tions for the generator. However, the end results

largely contain road networks and little else, making

them unsuitable for post-processing in a 3D program.

Albert et al. (Albert et al., 2018) similarly generated

urban patterns using GANs, resulting in a heatmap-

like image of the buildings. This mainly captures the

shapes of building densities across a city, and is in fact

evaluated with building density patterns, but the re-

sulting images don’t have detailed building footprints.

Bachl et al. (Bachl and Ferreira, 2020) proposed a

learning method for urban styles when viewed at the

ground level, performing a style transfer task to gen-

erate a city with textures of another overlaid on top.

Song et al.(Jieqiong Song, 2021) proposed MapGen-

GAN, which focused on generating map-like images

from real satellite images, but did not generate new

cities. Shen et al. (Jiaqi Shen, 2020) performed a

style-transfer task, where an input road network was

ﬁlled with building blocks. They used GAN for the

task, and trained the generator on a dataset of real

cities in China with erased building footprints. The

generator outputs a map with the same road network

as the input data, but with building footprints ﬁlled

in. The discriminator is then trained to distinguish

between that output and the ground truth. There was

no road generation component to this paper, however.

Similarly, Fedorova (Fedorova, 2021) tackled a simi-

lar task, but instead of ﬁlling in all the buildings, they

focused on generating a missing block. Maps of real

cities with one block removed serves as the input, and

the output is that same city, but with the missing block

ﬁlled. This paper was highly focused on getting the

shapes of that generated block right, based on popu-

lation density of the surrounding areas.

2.4 Current Limitations

As discussed, most papers are focused on generating

city layouts, with the main features being roads and

buildings. However, none of the papers covered gen-

erating features such as rivers, lakes, and parks. In ad-

Figure 4: Samples captured from London, UK.

dition, most of the past research were not made with

the intent to be converted into a 3D model, which re-

sulted in layouts with highly detailed roads, but not

detailed building footprints, or vice versa. Many pro-

cedurally generated road networks did not include any

building generation, and simply assumed the negative

space between the roads to be ﬁlled with buildings.

3 DATA COLLECTION AND

PRE-PROCESSING

3.1 Data Source

Not many databases on the internet collect detailed

data about cities around the world. Among the

most famous, open-sourced ones is OpenStreetMap

(OSM). With a well built API for accessing their

database, we decided on building our dataset from

them.

We sampled rectangular snippets of cities of

2x2km

area. The features we collected are the

road network, rivers, lakes, canals, ocean, natural

parks, meadows, and building footprints, which were

grouped into landed housing, apartments, and com-

mercials.

As for our cities, we selected from a list of top 700

most populated cities in the world, a list of cities with

at least 1 million population, a list of biggest North

American cities, and a list of biggest European cities.

We collected ﬁve snippets from each city. 1

Collecting ﬁve snippets from over two thousand

cities gave us roughly ten thousand samples before

cleaning out samples that contained nothing, or only

a few roads. After cleaning, our biggest dataset is at

6591 samples.

3.2 Pre-Processing

OpenStreetMap’s API allows us to extract each urban

feature individually. We then color-coded them onto

a black canvas and stored them into 256x256 images.

The colors are as described in Table 1. We chose each

color based on what we inferred would be easily sep-

arated with computer vision methods.

GRAPP 2024 - 19th International Conference on Computer Graphics Theory and Applications

214

Table 1: Hex Color codes for each feature.

Feature Color Code

Meadow e0ff8a

Natural 54ff60

Grassland b8ffbd

Water 0f02fa

River 54a1ff

Canal 54e3ff

Lake 54a1ff

Apartment fff700

House fa0297

Commercial fa4c02

The cities are ﬁrst sampled based on the latitude

and longitude coordinates of the city centers on Open-

StreetMap, then an additional four sample is created

by setting an offset to the north, south, east, and west

to capture those adjacent areas.

Figure 5: Samples captured from Bangkok (left) and Singa-

pore (right). Both have quite a lot of missing details.

Our datasets, however, are limited by two factors.

The size of 256x256 is often too small to fully cap-

ture 2x2km

snippets of dense cities, often result-

ing in pixelation of some features. This can lead to

noises and artifacts in the generated images, and sub-

sequently, in the 3D model. Then, there’s the limi-

tation of OpenStreetMap’s database, which is crowd-

sourced and requires people to input data of the places

they’re from manually. This creates a selection bias,

as smaller cities will have less data on them. On top

of that, since OpenStreetMap is an English-speaking

platform, data from non-English-speaking countries

will also be far more limited. A reason we chose

to speciﬁcally sample North American and European

cities is because they contain more data, and so we ex-

pected this would lead to a biased generator that leans

toward grid-styled cities common in the west.

4 TRAINING

4.1 Choosing the Generative Model

One of the aims for this project is the to generate ev-

erything from scratch, not even with a starting road or

terrain, and so we avoided methods that require some

form of user input. Thus, GAN models meant for a

style transfer task such as pix2pixGAN or CycleGAN

are not appropriate for this task.

We chose a simple Deep Convolutional GAN

(DCGAN) for the task. A simple DCGAN is proven

and versatile, and since it requires no user input, we

had more ﬂexibility in how we color-code the features

of a city into an image (Dragan et al., 2022).

4.2 Model Architecture

We four deconvolutional blocks on the generator and

four convolutional blocks on the discriminator, with

both blocks using a Leaky ReLU layer. The generator

upsamples a random noise from 16x16 to 256x256,

and the discriminator downsamples from 256x256

down to 16x16 before ﬂattening it. The activation

function of the generator is a tanh function. The dis-

criminator uses a dense layer to perform the ﬁnal clas-

siﬁcation.

The model was trained on Google’s Cloud TPU

v2-8. There were eight TPU devices in total, though

we did not paralellize the training process. The train-

ing script was written in Python, utilizing the Tensor-

ﬂow library, as it was developed for TPUs in mind.

The dataset was also hosted on a Google Cloud virtual

machine during training. The inference after train-

ing was done on an Nvidia RTX 3070, running on the

CUDA platform, also via Tensorﬂow.

We ﬁrst trained the model on our largest dataset,

and we quickly found the results unusable. We then

experimented with four more datasets created from

sub-sections of the previous datasets. We found the

model trained from the set with 3197 samples, only

used cities from the list of the world’s biggest, to per-

form the best. The FID scores are shown in Table 2.

Table 2: FID scores for each datasets.

Criteria FID Score

Top 200 Cities by population 533.2098

Top 500 Cities by population 427.9565

Cities with over 1M population 413.7835

The above set + Top 500 NA Cities 520.8111

The above set + Top 500 EU Cities 475.0358

The FID scores only judge the quality of the gen-

erated images, which we used as a comparative mea-

sure to evaluate which dataset performs the best. We

will perform further post-processing before evaluat-

ing the city layouts.

Towards Generating 3D City Models with GAN and Computer Vision Methods

215

5 POST PROCESSING WITH

COMPUTER VISION

TECHNIQUES

We used classical computer vision methods to sepa-

rate the city features encoded in the images. These

methods produces separate images for each feature,

allowing our 3D engines to easily create 3D equiva-

lents from the given layout.

5.1 Hue-Saturation-Value Masking

Figure 6: Original output (left) and masked output (right).

Features such as nature and water are encoded with

striking blue and green coloring. Since both are dis-

tinct hues, and we know what their exact hex values

are, they can be easily separated using a simple HSV

mask. The by-product of this method is that we also

got rid of noises of these features. A lone blue pixel

near a road will not have the same HSV value due

to compression, for example, and can be ﬁltered out

with this method.

5.2 Bandpass Filter

Figure 7: Original output (left), a Bandpass-ﬁltered output

(right).

The road networks encoded in white pixels act like a

texture in our images. They have a speciﬁc frequency

in the image, making a Bandpass Filter the perfect

tool for isolating them. Using an implementation with

a pair of Butterworth Low and High Pass ﬁlters, we

obtain an isolated road network.

We targeted the frequencies of the image for this

task and experimented with several parameters. We

found that passing the higher frequencies would end

up capturing the buildings as a part of the road net-

work, as their color is often the most dominating in a

map. By keeping the frequencies low, we are able to

ﬁlter out the buildings and also smooth out the noise

in the road networks. The ﬁltered image has more

consistent roads than the original image, where some

roads often end abruptly due to noise in the image.

5.3 Bandreject Filter

Figure 8: Original output (left), and a Bandreject-ﬁltered

output (right).

The building footprints are the dominant color in the

maps. They are the opposite of the roads striking

through, thus, intuitively, a bandreject ﬁlter should al-

low us to isolate the building footprints. The previ-

ously extracted road network can then be subtracted

out from the current bandreject result, preventing

building footprints from spawning on top of roads.

5.4 Reading into a 3D Engine

We used Blender and its Python library to perform the

conversion to 3D. Similar to the methods used by Lin

et al.(Bo Lin, 2020), we create a Voxel model ﬁrst

before perform further processing. We ﬁrst read the

image map and its supplementary ﬁltered images. We

then read the map, layer by layer, and create simple

meshes on top of them. For the building layers, we

may need to also multiply the colors of the original

output with the ﬁltered output, so that we can still tell

how tall each building should be, relative to one an-

other. We’ve set the commercial buildings to be the

highest, followed by apartment buildings, and then

landed housing. Each class of building will have a

range of height they can be randomly generated on

top of.

The buildings themselves can be represented with

simple cube meshes or more complex procedurally

generated models. Their positions are based on the

white spots on the bandreject output. Figure 9 shows

some of the ﬁnal output we achieved

GRAPP 2024 - 19th International Conference on Computer Graphics Theory and Applications

216

Figure 9: Final outputs rendered in different styles. A prim-

itive mesh-only render (top left), a procedurally generated

building render(top right), and a textured render (bottom).

6 EVALUATION AND FURTHER

RESEARCH

Bloomberg (O’Sullivan, 2021) once interviewed

game designers on the topic of city generation, and

what makes for a realistic one. It was said that,

while audiences may not notice bad (as in fake) urban

planning, they will notice a contradiction; anything

that makes the city look incoherent. For example, a

non-circular roundabout or an out-of-era building will

stand out to the player. We evaluated our model based

on this observation.

6.1 Subjective Evaluation Survey

We surveyed 30 participants with 10 pairs of cities be-

ing compared side by side. The participants are asked

which city they prefer at ﬁrst glance, without being

told one is a faked layout and the other is based on a

real city. We thought it best not to tell the participants

that there is a fake, as there are some imperfections

with our 3D models (i.e. ﬂoating top building ﬂoors),

so we did not want those nitpicks to sway the partic-

ipants. Rather, we want their impressions of a city at

ﬁrst glance, and whether they feel it is familiar (real)

or not (fake).

We paired the fake and real cities with similar fea-

tures. Four out of ten pairs on our survey resulted in

Table 3: Surveyed Pairs.

Category Selected Fake (%)

River through city 17

Dual Avenue 47

Intersections 27

Intersections 63

Intersections 13

Main Roads w/ Branches 80

Main Roads w/ Branches 73

River vs landed 7

River vs landed 90

River through city 26

the majority of participants choosing the fake outputs

over the real ones. This suggests that there are some

areas where the generated cities are good enough to

fool people. Looking into our categories, we see that

the fake cities performed very well for areas that are

a simple main road branching off into smaller roads.

On intersections, only one pair won, and when it

comes to rivers, the fake model only won once.

Looking at the renders subjectively, we believe

this is because the bodies of water generated by our

model do not form a line, creating a river. Rather,

they are often lakes around the map. And when it

comes to intersections, the model struggles with cre-

ating coherent intersection shapes. The only pair the

fake one won is the pair where the real map had some

graphical issues with its roads.

6.2 Comparisons to Past Research

Among the papers reviewed above, most are excellent

in generating one aspect of the city, though there are

no previous model that can generate all the features.

In this section, we cherry-picked the best generated

features from each paper and compared them with our

results.

In terms of the road layouts, many past papers

were able to generate detailed road networks, either

by procedural or deep learning methods. The past

results were well recorded onto images, better than

ours. However, they do not cover an area as large.

Most of them are also restricted to grid-like layouts,

or are only capable of urban city-center roads. Our

model is capable of generating a suburban-like area,

where the urban concentration tapers off to one or two

roads leading out.

In terms of building footprints, our models often

results in too concentrated, and the footprints aren’t

detailed enough as a result. Compared to other pa-

pers, some that focused on smaller blocks were able

to generate more detailed footprints. However, com-

pared to papers generating larger city blocks, the level

of detail is not as far behind.

Towards Generating 3D City Models with GAN and Computer Vision Methods

217

Figure 10: Comparison of generated road layouts of a past

paper (Bo Lin, 2020) and our own.

Figure 11: Comparison of generated building footprints of

a past paper with a smaller area (left) (Jiaqi Shen, 2020),

a similar area (middle)(Weiyu Zhang, 2022) and our own

(right).

7 SUGGESTIONS FOR FUTURE

RESEARCH

For future research, we may suggest training and

generating each feature of the city separately in-

stead. Based on the method used for MetroGAN

(Weiyu Zhang, 2022), giving each training feature a

different lost function can result in a more accurate

layout. Methods suggested by Lin et al. (Bo Lin,

2020) also involved splitting city features before com-

bining them to get the ﬁnal results. If we generate

the terrain and water, then the road network, then the

building footprints, once feature at a time, the result-

ing image from each feature should suffer less from

pixelation and eliminate the issue of noise caused by

the extraction methods altogether.

Image resolution is also another limiting factor.

We observed some noise caused by the pixelation of

the images, which caused some of the 3D renders to

be imperfect. The process of separating each feature

for the 3D render also created some noise, resulting in

some undesired renders. Perhaps performing this task

on a larger resolution can work. Diffusion models

have since proven robust, so perhaps a future iteration

of this idea could utilize that, as newer diffusion mod-

els can generate higher resolution images. Exploring

the use of a super-resolution model may also be ben-

eﬁcial, as that may allow the end result to be upscaled

and matched 1:1 with its real size. Right now, the

2x2km plot mapped onto a 256x256 image results in

approximately 8x8 square meters per pixel. Having a

super-resolution model that can always guarantee 1x1

square meter per pixel should also help alleviate the

compression issues.

Over the past year, improvements in diffusion

methods have since proven robust, especially in vi-

sual quality (Rombach et al., 2022). Perhaps it is

worth to explore such methods combined with other

map generation techniques. Such hybrid approaches

may be able to leverage the better visual quality of dif-

fusion methods (Stability.ai, 2022). It will be one of

our future works to combine diffusion methods with

our map generation methods.

We initially explored the idea of using OSM’s vec-

tor information directly, instead of ﬂattening every-

thing into rasterized images. On paper, using the vec-

tors should produce more precise results, and would

be far easier to post-process. However, due to the

way OSM is set up, taking a rectangular snap will not

”crop” the vectors. Instead, certain features will con-

tinue out of bounds (i.e. a snippet of London may

contain a segment of the Thames thrice the length

of the snippet itself). Combined with the incredibly

dense amount of data included, the data processing

part would be far more complicated, and we deemed

it infeasible for the scope

8 CONCLUSION

In this paper, we have explored several aspects of gen-

erating a 3D city from scratch. Speciﬁcally, we have

collected snippets for over two thousand cities around

the world and created layout maps from them. Then,

we trained a GAN that learnt to generate those maps

from scratch and used computer vision techniques to

extract the data of each feature from the generated

maps before rendering them in Blender.

The experiment shows that the model performs the

best when the data is somewhat balanced. The smaller

datasets can be sufﬁcient in training the model. We

do feel the process is rather limited by the resolu-

tion of the images, as the pixelation that was already

present in the training dataset is then taught to the

GAN model. The generated cities also lack some

features. Most generated water bodies are lakes and

rarely a river. The resulting 3D renders are still lack-

ing some assets and features. The buildings have no

orientation to the roads they are supposed to be facing,

the roads going over bodies of water are just ﬂoating

roads instead of a bridge, and the terrain is completely

ﬂat.

Compared to past papers, we introduced a novel

method to generate the whole cities, with all the fea-

tures, instead of focusing on one or two features such

as roads or buildings. We also proposed a compre-

hensive method to generate 3D models on top of the

image layout, using a mix of procedural and computer

vision methods.

GRAPP 2024 - 19th International Conference on Computer Graphics Theory and Applications

218

ACKNOWLEDGEMENTS

This work is supported by the Singapore Min-

istry of Education Academic Research grant T1

251RES2205, “Realtime Distributed Hybrid Render-

ing with 5G Edge Computing for Realistic Graphics

in Mobile Games and Metaverse Applications”.

REFERENCES

Albert, A., Strano, E., Kaur, J., and Gonzalez, M. (2018).

Modeling urbanization patterns with generative adver-

sarial networks.

Bachl, M. and Ferreira, D. C. (2020). City-gan: Learning

architectural styles using a custom conditional gan ar-

chitecture.

Bene

s, J., Wilkie, A., and Krivanek, J. (2014). Procedural

modelling of urban road networks. Computer Graph-

ics Forum, 33.

Bo Lin, Wassim Jabi, R. D. (2020). Urban space simu-

lation based on wave function collapse and convolu-

tional neural network. SimAUD ’20: Proceedings of

the 11th Annual Symposium on Simulation for Archi-

tecture and Urban Design.

Brady, R. (Retrieved 2023). Architecture in video games:

The satirical urban commentaries of grand theft auto.

Dragan, C.-M., Saad, M. M., Rehmani, M. H., and

O’Reilly, R. (2022). Evaluating the quality and diver-

sity of dcgan-based generatively synthesized diabetic

retinopathy imagery. MEDAL23: Advances in Deep

Generative Models for Medical Artiﬁcial Intelligence

(Springer Nature series).

ESRI (2022). About arcgis.

Fedorova, S. (2021). Gans for urban design. SimAUD 2021.

Filip Biljecki, H. Ledoux, J. S. (2016). Generation of multi-

lod 3d city models in citygml with the procedural

modelling engine random3dcity. IPSPRS Ann. Pho-

togramm. Remote Sens. Spatial Inf. Sci.Volume: IV-

4/W1.

Gao, T., Gao, Q., and Yu, T. (2022). A procedural gen-

eration method of urban roads based on osm. pages

242–248.

Inc., E. G. (Retrieved 2023). City sample quick start - gen-

erating a city and freeway in unreal engine 5.

Jiaqi Shen, Chuan Liu, Y. R. e. a. (2020). Machine learn-

ing assisted urban ﬁlling. Proceedings of the 25th In-

ternational Conference on Computer-Aided Architec-

tural Design Research in Asia (CAADRIA).

Jieqiong Song, Jun Li, H. C. e. a. (2021). Mapgen-gan: A

fast translator for remote sensing image to map via un-

supervised adversarial learning. IEEE Journal of Se-

lected Topics in Applied Earth Observations and Re-

mote Sensing.

Joon-Seok Kim, Hamdi Kavak, A. C. (2018). Procedural

city generation beyond game development. SIGSPA-

TIAL Special 10(2):34-41.

Kelly, G. and McCabe, H. (2006). A survey of procedural

techniques for city generation. 14.

Oliva, P. (Retrieved 2023). Buildify 1.0.

O’Sullivan, F. (2021). What designers of video game cities

understand about real cities.

Rombach, R., Blattmann1, A., Lorenz, D., and Esser, P.

(2022). High-resolution image synthesis with latent

diffusion models.

Stability.ai (2022). Stable diffusion public release.

vvoovv (Retrieved 2023). blender-osm: Openstreetmap and

terrain for blender.

Weber, B., M

uller, P., Wonka, P., and Gross, M. (2009).

Interactive geometric simulation of 4d cities. EURO-

GRAPHICS 2009.

Weiyu Zhang, Yiyang Ma, D. Z. e. a. (2022). Metrogan:

Simulating urban morphology with generative adver-

sarial network. arXiv:2207.02590 [cs.CY].

Towards Generating 3D City Models with GAN and Computer Vision Methods

219