Application of U-Net and Auto-Encoder to the Road/Non-road

Classiﬁcation of Aerial Imagery in Urban Environments

Amanda Spolti

, Vitor C. Guizilini

, Caio C. T. Mendes

, Matheus D. Croce

Andr

e R. de Geus

, Henrique C. Oliveira

, Andr

e R. Backes

and Jefferson R. Souza

School of Computer Science, Federal University of Uberl

andia, Uberl

andia, MG, Brazil

Faculty of Civil Engineering, Architecture and Urbanism, State University of Campinas, Campinas, SP, Brazil

School of Information Technologies, University of Sydney, Sydney, Australia

Department of Computer Science, Federal University of S

ao Carlos, S

ao Carlos, SP, Brazil

Institute of Mathematics and Computer Science, University of S

ao Paulo, S

ao Paulo, SP, Brazil

arbackes@yahoo.com.br, {geus.andre, jrsouza}@ufu.br

Keywords:

Road Detection, Deep Learning, Auto-Enconder, U-Net.

Abstract:

One of the challenges in extracting road network from aerial images is an enormous amount of different car-

tographic features interacting with each other. This paper presents a methodology to detect the road network

from aerial images. The methodology applies a Deep Learning (DL) architecture named U-Net and a fully

convolutional Auto-Encoder for comparison. High-resolution RGB images of an urban area were obtained

from a conventional photogrammetric mission. The experiments show that both architectures achieve satisfac-

tory results for detecting road network while maintaining low inference time once DL networks are trained.

1 INTRODUCTION

In the past few years, the usage of high-resolution

aerial images along with machine learning techniques

applications have been extensively used for carto-

graphic features extraction, mainly road and street

geometry. However, extracting such features is not

a trivial task considering the large number of diffe-

rent objects interacting with the roads (Mendes and

Dal Poz, 2011). Several studies from the most diverse

ﬁelds are being developed such as route optimization

(Silva et al., 2016), landscape changes monitoring and

natural resources preservation (Galo, 2000).

Over the years, artiﬁcial intelligence problems

that had previously been solved using very compli-

cated code began to be explored by Machine Lear-

ning (ML). This technique is based on the principle

of extracting patterns and data features to feed the

algorithm that learns automatically. The introduc-

tion of ML allowed solution using real-world data to

help in the process of decision-making in several are-

as (Goodfellow et al., 2016).

Considering the evolution of this computational

area, the simple learning classiﬁers are being replaced

by more effective methods that better represent the

human brain functionality, such as neural networks.

Recently, the Deep Learning (DL) architectures have

gained attention due to the current capacity of storage

and processing large amounts of data, despite being

developed a few decades ago. Nowadays, the scien-

tiﬁc world has much more ability to manipulate such

amount of data, and therefore several studies are be-

ing designed and applying DL. It can attack the most

diverse problems since a DL architecture can be very

ﬂexible and created to ﬁt a speciﬁc dataset.

One of the branches of DL is called Convolutional

Neural Network (CNN). The CNN also can be ap-

plied in the Natural Language Processing (NLP) pro-

blems (Goodfellow et al., 2016). Classiﬁcation is

a task where convolution is vastly used, especially

in visual recognition. In 2015, Ronnerberger (Ron-

neberger et al., 2015) built a fully convolutional net-

work to attack the problem of biomedical image seg-

mentation. However, the architecture was modiﬁed to

ﬁt the particularities of the task named U-Net. One

of these particularities is the fact that thousands of

biomedical training images are quite unlikely to have.

In this work the U-Net deep neural networks ar-

chitecture was applied over the problem of road net-

work detection to verify its accuracy in a dataset,

where there are two class labels (road or non-road),

but the context is equally important. We compared

Spolti, A., Guizilini, V., Mendes, C., Croce, M., R. de Geus, A., Oliveira, H., Backes, A. and Souza, J.

Application of U-Net and Auto-Encoder to the Road/Non-road Classiﬁcation of Aerial Imagery in Urban Environments.

DOI: 10.5220/0009337306070614

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 4: VISAPP, pages

607-614

ISBN: 978-989-758-402-2; ISSN: 2184-4321

607

the results obtained with a fully convolutional Auto-

Encoder. The dataset is composed of portions of

aerial images with dimension 256 × 256 pixels from

an urban area Figure 1. Also, the class labels are black

for representing roads and white representing other

objects. As mentioned, the context is crucial since

many different objects can be easily mixed with roads

such as roofs, cars, parking lots, sidewalks, etc.

Figure 1: The problem. The left image is acquired from an

airplane, where it has road and non-road objects. The right

image is the ground truth image.

The remainder of this paper is organized as fol-

lows: Section II discusses related works; Section III

presents the proposed methodology, detailing the U-

Net along with the architectures used for compari-

son; the experimental setup, results, and analysis are

shown in Section IV; ﬁnally, Section V draws the con-

clusions and suggests directions for future work.

2 RELATED WORK

The work of (Jordan and Ana-Maria, 2013) proposed

a new strategy to identify buildings from aerial im-

age analysis. The training stage was carried out by an

operator who selected points of interest in the image,

structures, and aspects of non-interest, such as vege-

tation and streets. A combination of shape attributes

with an automated region growth technique and Sup-

port Vector Machine is used to rank between points

of interest and insigniﬁcant points. This strategy was

tested by manipulating 20 portions of aerial images

with a resolution of 256 × 256 pixels that contained

different topologies. This strategy obtained higher re-

sults (97,8% accuracy) when compared to other tech-

niques discussed in the literature. Thus, it concludes

that the shape descriptors encode an appropriate set of

features that allow identifying objects of interest cor-

rectly in most images, despite the different shapes and

sizes of roofs.

Using deep learning architecture, the method pre-

sented in (Mendes et al., 2016) considered the prob-

lem of road detection, where given an image, the goal

was to classify each pixel of the image in road or

non-road. The architecture was a Convolutional Neu-

ral Network, targeting a model that employs a size-

able contextual window while maintaining efﬁcient

inference properties. For that, a Network-in-Network

(NiN) architecture was applied, and the model was

converted into a fully convoluted network after the

training. The KITTI Vision Benchmark Suite was

used for training and evaluation of the proposed ar-

chitecture. This database offers 289 training images

along with its ground truth and 290 test images. A

precision of 92% was obtained both using the NiN ar-

chitecture and without, however, with the NiN archi-

tecture, the time of inference was signiﬁcantly lower.

The results revealed that the inference time of the pro-

posed approach is unique at this level of accuracy,

with two orders of magnitude faster than other meth-

ods with similar performance.

A neural network architecture, named SegNet,

was proposed by (Badrinarayanan et al., 2015), which

uses a fully convolutive pixel-wise semantic segmen-

tation. The core of the segmentation mechanism con-

sists of an encoder network and a corresponding de-

coder network, that produces as output segmented

pixel information corresponding to each considered

class. The encoder network includes 13 convolutional

layers. Each layer of the encoder has a similar de-

coder layer, so the decoder network also has 13 layers.

The SegNet architecture performance was measured

in two different situations, the ﬁrst being the classiﬁ-

cation of roads, trees, walks, cars, etc. The second is

the indoor scene segmentation that is of immediate in-

terest to various augmented reality (AR) applications.

In the case of the second scene, because it is a prob-

lem that contains several classes, the result obtained

was not satisfactory. However, for the classiﬁcation

of roads, the SegNet was efﬁcient and achieved re-

sults with 90% of accuracy. SegNet was compared

to other architectures regarding training time, mem-

ory, and efﬁciency. Some of them had better results;

however, more memory was required by the amount

of data stored. SegNet is more efﬁcient because it

only saves the maximum pool of indexes of resource

maps and uses them in its decoder network to achieve

a good performance; then using less memory and still

obtaining satisfactory results.

Another work using the DL technique was pro-

posed by (Kussul et al., 2017) to classify satellite im-

ages about the soil coverage and their cultures. To re-

store missing data due to clouds and shadows present

in the images, a preprocessing phase was necessary.

For classiﬁcation purposes, a fully connected super-

vised neural network (MLP - Multilayer Perceptron)

and a Random Forest was applied, comparing such

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

608

techniques with a convolutional neural network. The

experiments were carried out using 19 multitemporal

images of the region of Ukraine acquired by RS satel-

lites of Landsat8 and Sentinel-1A. Using two archi-

tectures variations of a convolutional neural network,

called 1-D and 2-D to explore spectral and spatial fea-

tures respectively. An accuracy of 93.5% and 94.6%

respectively were obtained, while Random Forest and

MLP obtained 88.7% and 92.7% respectively. Conse-

quently, the proposed architecture proved more useful

for the described problem.

Another approach based on deep learning for fea-

ture detection in the scope of remote sensing was pro-

posed by (Zou et al., 2015) treating the problem of ex-

traction as a problem of features reconstruction. The

proposed method selects the most reconstructive char-

acteristics as the discriminative ones. In the experi-

ments, 2800 orbital images were divided into seven

categories (grass, farm, industry, river, forest, resi-

dential, parking) for performance evaluation. To ad-

dress the problem of feature reconstruction, a Deep

Belief Network (DBN) was used. An iterative algo-

rithm for learning features was developed to obtain re-

liable reconstruction weights and characteristics with

small reconstruction errors. On average, an accuracy

of 77% was reached, and the category with the most

misclassiﬁcation was the industry category with 65%,

and the least confused class being the forest with an

accuracy of 93.5%. Considering the complexity of the

classiﬁcation type, then the experiments validated the

efﬁciency of the proposed method.

Our proposal uses CNN architectures U-Net and

Auto Encoder, which have not yet been applied for ex-

traction of cartographic features using aerial images,

speciﬁcally on roads. Consequently, the proposal was

not used in the scientiﬁc literature for such problem.

3 PROPOSED METHODOLOGY

Our methodology presents two DL networks archi-

tectures: U-Net and Auto-Encoder. We have chosen

these two networks because they are representative of

the state-of-the-art in the proposed task and, more im-

portantly, are computationally efﬁcient and capable of

considering a large amounts of contextual informa-

tion, which is crucial in this case. The purpose of

this paper is showing comparison of both in the road

network detection using aerial images. Each DL ar-

chitecture is described in subsections below.

3.1 U-Net Architecture

The U-Net has two phases: contraction and expan-

sion. In the contraction path, the input image goes

through 2 convolutions 3 × 3, stride 1, generating 8

feature channels and the ReLU activation function in

the ﬁrst step followed by a 2 × 2 max pooling opera-

tion with stride 2. After each max pooling operation,

the number of feature channels is increased by a fac-

tor of two, and the input size is reduced by the same

factor due to the effects of the max pooling. In the

contraction path, a step is deﬁned by 2 convolutions

and a max-pooling operation.

After 4 steps, the resulting output is fed to an up-

convolution 2 × 2, stride 2 and 64 feature channels

in the ﬁrst step which is the beginning of the expan-

sion path. Every step of this phase consists of the

up-convolution with the parameters described above.

Moreover, the output of the same stage of the con-

traction path is concatenated, and convolutions are ap-

plied as in the contraction path. After each step, the

number of feature channels is reduced by a factor of

two. In the last layer, convolution with a single ker-

nel 1 × 1 is applied, and the resulting tensor passes

through the sigmoid function. Output is a single chan-

nel image with pixel values in the interval [0, 1], dur-

ing inference, thresholded at 0.5 and mapped to black

or white for visualization purposes. Black (zero) pix-

els are pixels classiﬁed as roads and white otherwise.

Figure 2 represents the U-Net used in our work.

3.2 Auto-Encoder Architecture

It is based on (Long et al., 2015) proposing fully con-

volutional networks for semantic segmentation. A

fully convolutional network does not have any fully

connected layer, such that all neurons from one layer

are connected to all neurons from the next layer. This

was common in most topologies up to this point, espe-

cially in deeper layers. However, there are beneﬁts in

eliminating full connections: 1) decrease in the num-

ber of trainable parameters; 2) preservation of spatial

correlation; 3) images of any size can be equally pro-

cessed using the same network.

Aiming for future online applications and onboard

processing, a relatively simple topology was used in

this work, composed of three convolutional layers,

with ﬁlter size 5 × 5 and three deconvolutional lay-

ers (the convolutional transpose), with the same ﬁlter

size. Each layer is followed by a ReLU (Xu et al.,

2015), to introduce non-linearities, and convolutional

layers receive a max-pooling of 2, to downsample

spatial dimensions, while deconvolutional layers up-

sample input images by the same factor, doubling

their spatial dimensions. Convolutional and decon-

volutional layers are connected by a fourth convolu-

tional layer, with ﬁlter size 3 × 3, no activation func-

Application of U-Net and Auto-Encoder to the Road/Non-road Classiﬁcation of Aerial Imagery in Urban Environments

609

Figure 2: U-Net architecture diagram used in our work.

tion and dropout (Srivastava et al., 2014) of 0.6 (60%

of nodes are randomly turned off during training, to

increase the network’s ability to generalize over in-

puts). The number of neurons in each convolutional

layer was respectively 64, 128 and 256, and these

numbers were inverted in the deconvolutional layers.

The ﬁnal layer produces 1-channel outputs, with

a sigmoid activation function to provide values be-

tween [0, 1] that serve as a probabilistic classiﬁcation

for each pixel. The crossentropy loss function was

optimized during training phase, based on ground-

truth labeled information, using an Adam optimizer

(Kingma and Ba, 2014) with a learning rate of 10

−4

Figure 3 shows Auto-Encoder applied in our work.

3.3 Differences between U-Net and

Auto-Encoder

The main difference between the architectures U-Net

and Auto-Encoder are the skip connections, which

connect each stage of the contraction path to the cor-

responding stage of the expansion. These skip con-

nections tend to avoid local minima during training.

Moreover, one can view them as simplifying or divi-

ding the training. For instance, during early iterations

of training, if the network has no skip connections,

it has to adjust all the weights of the network to ge-

nerate any meaningful result. Meanwhile, a network

with skip connections has to modify only the weights

of the layers that comprise the shortest path from in-

put to output to do the same. Although this difference

simpliﬁes the training (at least the initial iterations),

whether it is beneﬁcial to the overall accuracy can be

only accessed through experimental data.

4 EXPERIMENTAL RESULTS

To evaluate the performance of the proposed method-

ology, we implemented the described system in Ten-

sorﬂow (Abadi et al., 2015) and Keras. We tested its

performance on images test set. Python language was

selected, due to its widespread use in DL applications,

and a Titan X Pascal GPU card (12GB) was used to

increase computational performance for model train-

ing. Once training is complete, model can be stored

for later use, allowing fast inference in similar GPU

cards, standard CPU machines or even in onboard

computers, for online classiﬁcation.

The dataset is composed of 3814 images of 256 ×

256 divided into 80% for training, 10%, for testing

and 10% for validation. The architectures were im-

plemented in Python using the Keras (Chollet et al.,

2015) library with Tensorﬂow. The network is trained

for 300 epochs or until the validation accuracy stops

improving (for 30 epochs in a row).

To analyze the U-Net and Auto-Encoder perfor-

mance to predict images with different levels of difﬁ-

culty, then ﬁve images were selected from the testing

database. Figure 4 shows the original image, its label

and how the U-Net and Auto-Encoder performed. In

Image 1 and Image 2, the U-Net and Auto-Encoder

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

610

Figure 3: Auto-Encoder CNN diagram used in our work.

classiﬁcation was very accurate. However, in Im-

age 3, 4 and 5 the classiﬁcation was a little weak in

both architectures because these images contain ob-

jects that can be easily mixed with roads consider-

ing the color similarity. In Image 5, the U-Net ar-

chitecture could distinguish the roof from the roads

besides the color similarity however the classiﬁcation

is very different from the ground truth. On the other

hand, with the Auto-Encoder architecture the predic-

tion was wrong again because of the color that can be

easily mistaken even for humans.

Table 1 shows a comparison between the U-Net

and Auto-Encoder architectures. It is possible to no-

tice that the Auto-Encoder showed the best perfor-

mance. The metrics of the two architectures were very

similar, however the Auto-Encoder still obtained bet-

ter results except for the accuracy measurement where

U-Net acquired 89.1% and the Auto-Encoder 88.0%

Table 1: Results.

U-Net Auto-Encoder

Accuracy 0.891 0.880

Precision 0.800 0.899

Recall 0.856 0.919

F-Measure 0.820 0.907

To verify the viability of the methodology, a new

image with similar data to those used during training

was classiﬁed. New aerial images obtained by an air-

plane were classiﬁed allowing a qualitative analysis of

model capability in classifying roads and non-roads.

Figure 5 and 6, both architectures shows similar

results. However, Figure 6 demonstrates the complete

classiﬁed aerial image by Auto-Encoder showing that

this CNN predicted more roads compared to U-Net.

Even though in speciﬁc scenarios the results are

not satisfactory, techniques such as skeletonization

and mathematical morphology can be applied to re-

construct the part of the image that was not well clas-

siﬁed. This technique has the classiﬁed result as input

and a geometric network (vectors) as output.

5 CONCLUSIONS

The proposed methodology applied two deep learn-

ing architectures to the problem of road classiﬁcation

from an aerial image. The aim is to classify the ob-

jects in an aerial image as a road or non-road helping

in the mapping process of a region, saving resources

and time of the responsible operator.

The results showed the performance of the U-Net

to classify an aerial image and its objects in road and

non-road. On the other hand, for future works, it

is essential to improve the architecture so it is able

to analyze the context where every pixel is located

so objects with color similarity will not be mistaken

with roads as shown in Figure 4 - Image 5. One pos-

sible solution is to observe the connection between

the black pixels in the resulting image, in Image 5

for example using the Auto-Encoder architecture, the

prediction is completely disconnected, and therefore

it can be eliminated or improved with mathematical

morphology.

Application of U-Net and Auto-Encoder to the Road/Non-road Classiﬁcation of Aerial Imagery in Urban Environments

611

(a) Image 1 (b) Label (c) Prediction U-Net (d) Prediction Auto-Encoder

(e) Image 2 (f) Label (g) Prediction U-Net (h) Prediction Auto-Encoder

(i) Image 3 (j) Label (k) Prediction U-Net (l) Prediction Auto-Encoder

(m) Image 4 (n) Label (o) Prediction U-Net (p) Prediction Auto-Encoder

(q) Image 5 (r) Label (s) Prediction U-Net (t) Prediction Auto-Encoder

Figure 4: Several levels images along with its label and prediction applying the U-Net and Auto-Encoder architectures.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

612

Figure 5: Image reconstruction with U-Net. The reconstructed image is the junction of the 256 × 256 classiﬁed ﬁgures in a

test set, black represents roads and white non-roads.

Figure 6: Image reconstruction with Auto-Encoder. The reconstructed image is the junction of the 256×256 classiﬁed ﬁgures

in a test set, black represents roads and white non-roads.

Also, a larger database containing different im-

ages perspectives would help the network to learn

and therefore better classify these images. Ultimately,

even with its mistakes, the Auto-Encoder architecture

can be advantageous as input in techniques such as

skeletonization mentioned before therefore assisting

in many applications such as route optimization, min-

imizing resources, time and operators effort. Lastly,

we can investigate the use of multitemporal images in

our problem.

ACKNOWLEDGEMENTS

The authors would like to acknowledge Federal Uni-

versity of Uberl

andia and University of Campinas,

Application of U-Net and Auto-Encoder to the Road/Non-road Classiﬁcation of Aerial Imagery in Urban Environments

613

CNPq (National Council for Scientiﬁc and Techno-

logical Development) under Grant #400699/2016-8

and #301715/2018-1. This study was ﬁnanced in part

by the Coordenac¸

ao de Aperfeic¸oamento de Pessoal

de N

ıvel Superior - Brazil (CAPES) - Finance Code

001.

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,

Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,

M., Ghemawat, S., Goodfellow, I., Harp, A., Irving,

G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kud-

lur, M., Levenberg, J., Man

e, D., Monga, R., Moore,

S., Murray, D., Olah, C., Schuster, M., Shlens, J.,

Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Van-

houcke, V., Vasudevan, V., Vi

egas, F., Vinyals, O.,

Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and

Zheng, X. (2015). TensorFlow: Large-scale machine

learning on heterogeneous systems.

Badrinarayanan, V., Kendall, A., and Cipolla, R. (2015).

Segnet: A deep conv. encoder-decoder arch. for image

segmentation. arXiv preprint.

Chollet, F. et al. (2015). Keras. https://github.com/ fchol-

let/keras.

Galo, M. (2000). Aplicacao de redes neurais artiﬁciais e

sens. remoto na carac. ambiental do parque est. morro

do diabo. Sao Carlos.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep

Learning. MIT.

Jordan, T. and Ana-Maria, C. (2013). A supervised train-

ing and learning method for building ide. in remotely

sensed imaging. In SRSE, pages 73–78.

Kingma, D. and Ba, J. (2014). Adam: Method for stochastic

optimization. In ICLR.

Kussul, N., Lavreniuk, M., Skakun, S., and Shelestov, A.

(2017). Deep learning classiﬁcation of land cover and

crop types using remote sensing data. IEEE Geo-

science and Remote Sensing Letters, pages 778–782.

Long, J., Shelhamer, E., and Darrell, T. (2015). Fully con-

volutional networks for semantic segmentation. In

CVPR, pages 3431–3440.

Mendes, C. C. T., Fremont, V., and Wolf, D. F. (2016). Ex-

ploiting fully convolutional neural networks for fast

road detection. In IEEE Int. Conference on Robotics

and Automation (ICRA), pages 3174–3179.

Mendes, T. S. G. and Dal Poz, A. P. (2011). Classiﬁcacao

de imagens aereas de alta-resolucao utilizando redes

neurais artiﬁciais e dados de varredura a laser. Simp.

Brasileiro de Sensoriamento Remoto, pages 7792–

7799.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-

net: Convolutional networks for biomedical im-

age segmentation. In Medical Image Computing

and Computer-Assisted Intervention, pages 234–241.

Springer.

Silva, F., Minette, L. J., Souza, A. P. d., Moraes, A. C. d.,

and Schettino, S. (2016). Classiﬁcation of forest roads

and determination of route using geographic informa-

tion system. Revista Arvore, pages 329–335.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,

and Salakhutdinov, R. (2014). Dropout: A simple way

to prevent neural networks from overﬁtting. JMLR,

pages 1929–1958.

Xu, B., Wang, N., Chen, T., and Li, M. (2015). Empiri-

cal evaluation of rectiﬁed activations in convolutional

network.

Zou, Q., Ni, L., Zhang, T., and Wang, Q. (2015). Deep

learning based feature selection for remote sensing

scene class. GRSL, pages 2321–2325.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

614