A Deep Learning Tool to Solve Localization in Mobile Autonomous

Robotics

Sergio Cebollada

, Luis Pay

, Mar

ıa Flores

, Vicente Rom

, Adri

an Peidr

and Oscar Reinoso

Department of Systems Engineering and Automation, Miguel Hern

andez University, Elche, Spain

Keywords:

Mobile Robotics, Omnidirectional Images, Holistic Description, Deep Learning, Hierarchical Localization.

Abstract:

In this work, a deep learning tool is developed and evaluated to carry out the visual localization task for mobile

autonomous robotics. Through deep learning, a convolutional neural network (CNN) is trained with the aim

of estimating the room where an image has been captured, within an indoor environment. This CNN is not

only used as tool to solve a room estimation, but it is also used to obtain global-appearance descriptors of the

input image from its intermediate layers. The localization task is addressed in two different ways: globally,

as an image retrieval problem and hierarchically. About the global localization, the position of the robot is

estimated by using a nearest neighbour search between the holistic description obtained from a test image and

the training dataset (using the CNN to obtain the descriptors). Regarding the hierarchical localization method,

ﬁrst, the CNN is used to solve the rough localization step and after that, it is also used to obtain global-

appearance descriptors; second, the robot estimates its position within the selected room through a nearest

neighbour search by comparing the obtained holistic descriptor with the visual model contained in that room.

Throughout this work, the localization methods are tested with a visual dataset that provides omnidirectional

images from indoor environments under real-operation conditions. The results show that the proposed deep

learning tool is an efﬁcient solution to carry out visual localization tasks.

1 INTRODUCTION

Over the past few years, omnidirectional imaging has

been proposed by several authors to solve mobile au-

tonomous robotics tasks, since it has proved to be a

robust option (Pay

a et al., 2017). This type of cam-

eras are able to provide a high quantity of informa-

tion from the environment that surrounds them, with

a ﬁeld of view of 360 deg., with only one snapshot.

For instance, Abadi et al. propose an omnidirec-

tional vision system to detect obstacles through an al-

gorithm to carry out autonomous navigation (Abadi

et al., 2015). More recently, Liu et al. propose accu-

rate estimation of the position and orientation of the

robot within an outdoor environment by means of om-

nidirectional images (Liu et al., 2018).

https://orcid.org/0000-0003-4047-3841

https://orcid.org/0000-0002-3045-4316

https://orcid.org/0000-0003-1117-0868

https://orcid.org/0000-0002-3706-8725

https://orcid.org/0000-0002-4565-496X

https://orcid.org/0000-0002-1065-8944

To carry out the localization and mapping tasks

by using this visual information, an extraction of the

most relevant information must be tackled. Among

the two most common methods, this work proposes

the use of global-appearance (or holistic) description

methods, since this methodology leads to more di-

rect localization algorithms based on a pairwise com-

parison between descriptors. For example, Korrap-

ati and Mezouar use global-appearance descriptors to

create topological maps by means of omnidirectional

images (Korrapati and Mezouar, 2017), and Do et

al. use global-appearance description together with

Group LASSO Regression to develop an autonomous

mobile navigation (Do et al., 2018).

As for the hierarchical localization, the process

conducted in previous works such as (Pronobis and

Jensfelt, 2011) or (Pay

a et al., 2018) consists basi-

cally in (1) carrying out a rough but fast localization in

a high-level map composed by representative descrip-

tors and after that, (2) solving the ﬁne localization in a

low-level map composed by the instances that are rep-

resented by the descriptors selected in the rough step.

These previous works have proved the effectiveness

232

Cebollada, S., Payá, L., Flores, M., Román, V., Peidró, A. and Reinoso, O.

A Deep Learning Tool to Solve Localization in Mobile Autonomous Robotics.

DOI: 10.5220/0009767102320241

In Proceedings of the 17th International Conference on Informatics in Control, Automation and Robotics (ICINCO 2020), pages 232-241

ISBN: 978-989-758-442-8

of hierarchical maps to solve the localization prob-

lem departing from global-appearance descriptors ob-

tained from omnidirectional images. In particular, the

aim of the present work is to carry out the localization

task in indoor environments using omnidirectional vi-

sual information as a simple image retrieval problem

and also to solve the localization by means of hierar-

chical topological models.

Regarding the use of Artiﬁcial Intelligence (AI),

these techniques have been proposed in many contri-

butions to improve the performance of mapping and

localization algorithms in mobile robotics. For in-

stance, Dymczyk et al. propose the use of a clas-

siﬁer to classify landmark observations and conduct

the localization task more robustly (Dymczyk et al.,

2018). Meattini et al. present a human-robot interface

in which the robot learns the optimal hand conﬁg-

uration for grasping through electromyography sen-

sors and merging pattern recognition and factoriza-

tion techniques (Meattini et al., 2018). Within the

AI approaches, the deep learning branch has gained

much popularity in solving these problems by means

of computer vision. These methods try to construct

automatically high level data models through archi-

tectures that allow linear, non-linear, multiple and it-

erative transformations (Bengio et al., 2013) from the

initial data matrices. The idea is to train the architec-

ture to reach a model that is capable of creating rep-

resentations which best deﬁne the inputs. Regarding

the robotics topic, a number of previous works pro-

pose the use of deep learning techniques. For exam-

ple, Lenz et al. use a deep learning approach to solve

the problem of detecting robotic grasps (Lenz et al.,

2015); as for mobile robotics, Zhu et al. propose deep

reinforcement learning to address target-driven visual

navigation (Zhu et al., 2017). The aim of the present

work is to solve the visual localization task through

Convolutional Neural Networks (CNNs), since these

networks have been successfully used to solve com-

puter vision applications such as face recognition or

navigation in self-driving cars. The idea in this case

is to create a CNN that is able to distinguish between

different rooms from an indoor environment in order

to estimate correctly in which room the robot cur-

rently is. There are well known CNN architectures,

such as AlexNet, that was introduced by Krizhevsky

et al. (Krizhevsky et al., 2012). This network consists

of eight layers (ﬁve convolutional layers and three

fully connected layers) with a ﬁnal 1000-way softmax

and three pooling layers, and it is trained to classify

images into 1000 object categories. GoogLeNet was

proposed by Szegedy et al. (Szegedy et al., 2015). It

has 22 layers and it is also trained for object classi-

ﬁcation but it uses 12 times fewer parameters than

AlexNet. A broad review of the more outstanding

CNNs can be found in (Pak and Kim, 2017).

These popular networks, together with many oth-

ers that have produced successful results, have been

used in the present work as starting point to de-

velop new tools with different objectives, that is, these

CNNs are reused to carry out different tasks. We use

the following methods to adapt these networks to our

needs.

• Reusing common CNN architectures. Transfer

learning is a technique that consists in reusing the

architecture and parameters of a CNN as a start-

ing point to build a new CNN with a different

aim. The main idea is to get proﬁt of most of

the intermediate layers, because their parameters

have been tuned from millions of images and con-

tain useful information. This technique can save a

huge amount of time for training and even obtain

better results than creating a new network from

scratch. This idea has already been used by au-

thors such as Wozniak et al., who use the transfer

learning technique to retrain the VGG-F network

to classify places among 16 rooms acquired by a

humanoid robot (Wozniak et al., 2018). Neverthe-

less, transfer learning works only if no early lay-

ers need to be modiﬁed, because the downstream

architecture and parameters are no longer valid.

Therefore, in these situations, transfer learning

can not be used and training a network from

scratch is necessary. Creating an entire network

architecture is complex, hence, rather than trying

to build an architecture from scratch, the present

work proposes to develop the CNN through using

common architectures developed by experts. In

this way, the approach is similar to transfer learn-

ing (starting with pre-existing architectures), but

starting from scratch with the parameters tuning.

• Generation of global-appearance descriptors from

the intermediate layers activation. The process

is basically the following. Once the CNN is prop-

erly available to face the desired task, the hidden

layers perform vector description which originally

is used to solve the CNN task, but it can be ex-

tracted as a global-appearance descriptor of the

input image and used for a different purpose. This

idea has already been proposed by some authors

such as Mancini et al., who use this visual in-

formation to carry out place categorization with

a Na

ıve Bayes classiﬁer (Mancini et al., 2017).

Pay

a et al. propose CNN-based descriptors to cre-

ate hierarchical visual models for mobile robot lo-

calization (Pay

a et al., 2018). Moreover, Cebol-

lada et al. (Cebollada et al., 2019) tackle an eval-

uation of global-appearance descriptors obtained

A Deep Learning Tool to Solve Localization in Mobile Autonomous Robotics

233

from different layers of the pre-trained places

CNN (Zhou et al., 2014) for mobile localization.

Therefore, the objective of this work is to evalu-

ate the performance of convolutional neural networks

which have been adapted and used to carry out the

mapping and localization tasks for mobile robotics in

indoor environments. The proposed experiments will

measure the efﬁciency of this tool through its ability

to estimate the position of the robot and the comput-

ing time required for it. Additionally, only images ob-

tained by an omnidirectional vision system are used

as source of information to solve the mapping and

localization tasks. These images are obtained from

an indoor dataset captured under real-operation con-

ditions.

The remainder of the paper is structured as fol-

lows. Section 2 presents brieﬂy the CNN devel-

oped for this work. Section 3 explains the localiza-

tion method proposed by means of the deep learn-

ing tool. After that, section 4 outlines the experi-

ments that were carried out to evaluate the validity of

the proposed method for localization. Last, section 5

presents the conclusions and future works.

2 THE CONVOLUTIONAL

NEURAL NETWORK

DEVELOPED

As section 1 outlines, the objective of this work is to

develop and test a localization framework which per-

forms efﬁciently in mobile robotics through visual in-

formation. A CNN is proposed as tool to carry out this

task. The aim is to solve the visual localization hierar-

chically. This paper presents the idea of developing a

CNN which is able to estimate the room in which the

robot captured the image. Afterwards, a holistic de-

scriptor is obtained from an intermediate layer of the

same CNN to estimate more accurately the position

of the image within the predicted room. This process

will be explained deeply in section 3. Hence, a clas-

siﬁcation CNN must be developed ﬁrstly, to estimate

the room within the environment.

The CNN basically consists in predicting the la-

bel of the given input data (in this case, images). The

labels (also known as targets) represent the possible

categories within the environment. Before using this

tool for prediction, the model requires a training with

a huge variety of input data (x

train

) and their corre-

sponding labels (y

train

). Then, the CNN is ready to

receive new data (x

test

) and estimate their categories

estimated

2.1 The Dataset

The dataset of images used to train the CNN

is the Freiburg Dataset, which has been obtained

from the COLD (COsy Localization Database)

database (Pronobis and Caputo, 2009). The COLD

database is composed by images captured from differ-

ent indoor environments through several sensors un-

der three illumination conditions (cloudy days, sunny

days and at nights) and they are also affected by pres-

ence of dynamic changes such as people walking or

furniture changes and also by the blur effect. These

images were captured following a trajectory along

the whole environment. Among all the images pro-

vided, this work uses the omnidirectional images cap-

tured from the Freiburg environment. This dataset is

also used to evaluate the localization task. Never-

theless, before training the CNN, a conversion from

omnidirectional to panoramic images is carried out

with the aim of comparing the obtained results with

other global-appearance description methods based

on panoramic or standard images. Additionally, the

use of panoramic images constitutes an interesting op-

tion, since CNNs traditionally work with conventional

(non panoramic) images.

Fig. 1 shows the bird’s eye view of the Freiburg

environment and the path that the robot traversed to

obtain the images. The images of the Freiburg dataset

were captured in 9 different rooms. The cloudy

dataset was captured during cloudy days and it is the

least affected by illumination conditions. Hence, this

dataset is used as training dataset. The sunny and

night datasets provided by the Freiburg COLD DB are

used to evaluate the localization task under changes

of illumination. Additionally, in order to establish

a trustworthy comparison with previous works, the

dataset is downsampled with the objective of obtain-

ing visual information with a distance of 20 cm be-

tween consecutive images. The resulting images com-

pose the training dataset and the rest of images are

used to create a test dataset which will be used to eval-

uate the CNN accuracy and also the efﬁciency of the

hierarchical localization method proposed. Table 1

shows the datasets used for this work and the number

of images that each of them contains.

Due to the amount of parameters which compose

a CNN, a large image dataset is required to tune them.

Nevertheless, the datasets available to solve a speciﬁc

task are not always as large as required to train a CNN

from scratch and then, the deep model trained can

not reach enough accuracy. This issue has been com-

monly solved through data augmentation. This tech-

nique basically consists in creating new data by apply-

ing different effects over the original images. To cite

ICINCO 2020 - 17th International Conference on Informatics in Control, Automation and Robotics

234

Printer area

Corridor

Kitchen

arge office

Two

persons

office 1

Two

persons

Office 2

Window

Start point

One

person

Office

Bath room

Stairs area

Corridor

One

person

office

Stairs area

Two

persons

office

Bath room

Window

Start point

Part B

Maps of the two parts of the laboratory in Freiburg with approximate paths followed by the robot during data acquisition. The standard path is

represented with blue dashes and the extended path is represented with red dashes. Arrows indicate the direction in which the robot was driving.

Figure 1: Bird’s eye view of the Freiburg and environment.

Extracted from (Ullah et al., 2007). The red dashed line is

the path selected to obtain the images.

Table 1: Number of images of the training and test datasets

in each room. Images obtained from the Freiburg environ-

ment.

Name

Number

of images

in Training

Number

of images

in Test

1. Printer area 44 285

2. Corridor 212 1182

3. Kitchen 51 229

4. Large Ofﬁce 34 132

5. 2-persons ofﬁce 1 46 233

6. 2-persons ofﬁce 2 26 158

7. 1-person ofﬁce 31 218

8. Bathroom 49 190

9. Stairs area 26 151

Total number 519 2778

one example, Guo and Gould proposed to use data

augmentation to improve a CNN training with the aim

of solving an object detection task (Guo and Gould,

2015). The data augmentation proposed in this work

consists in applying visual effects over the original

images that can actually occur when images are cap-

tured in real-operation conditions: Random rotation,

(a) Original.

(b) Rotated.

(d) Noise.

Figure 2: Example of data augmentation. (a) Original im-

age captured within the Freiburg environment. An effect is

applied over each image: (b) random rotation, (c) darkness,

(d) Gaussian noise.

reﬂection, darkness/brightness addition to the image,

Gaussian noise, occlusions and blur effect. The ﬁg.

2 shows examples of some of the effects applied over

an original image. Hence, in order to train the CNN,

instead of using the 519 images of the original train-

ing dataset, the network is trained with the augmented

version (composed by 49824 images).

2.2 The Architecture and Training

In this work, we propose to use the AlexNet archi-

tecture as the base of the proposed CNN tool. The

choice of AlexNet as starting point architecture to

carry out the learning is due to the successful perfor-

mance showed by other authors regarding its use for

transfer learning such as (Han et al., 2018) and also

for the simplicity of its architecture.

Therefore, some layers of the AlexNet architec-

ture are replaced to adapt the output to the classiﬁ-

cation task desired (estimation among the 9 rooms

which belong to the Freiburg environment in this

work) and also to receive panoramic images as in-

put. As for the replacement of layer to achieve the

A Deep Learning Tool to Solve Localization in Mobile Autonomous Robotics

235

classiﬁcation desired, the three last layers which are

replaced are the fully connected layer f c

, the soft-

max layer and the classiﬁcation layer. Additionally,

regarding the input layer, since this layer was conﬁg-

ured in AlexNet to receive 227 × 227 × 3 images, it

is replaced to receive 128 ×512 × 3 images. Through

this last change, despite the parameters of the convo-

lutional layers are reset, we avoid a resizing of the

input images which could affect their resolution and

hence, effectiveness of the network created. After

these changes of the original CNN, the network is

ready to be trained with the new data in the train-

ing dataset. The ﬁg. 3 shows the ﬁnal architecture

used along this work. We trained the CNN off-line

on NVIDIA GEFORCE GTX 1080TI

GPU system.

The training time was around 4 hours. After every 30

iterations, the performance of the partially trained net-

work was evaluated by using the data for validation.

3 MAPPING AND

LOCALIZATION THROUGH

THE CNN

As explained previously in section 1, one of the aims

of this work is to use the holistic descriptors gener-

ated by the intermediate layers of the CNN to carry

out the localization task. Regarding this description

method, it basically consists in introducing the im-

age into the CNN and retaining the data stored in one

of the layers. In the case of the fully connected lay-

ers which compose the classiﬁcation phase, they di-

rectly provide data arranged in a vector, hence, these

data can be directly used as global-appearance de-

scriptor. Apart from the descriptors obtained from the

fully connected layers, Cebollada et al. showed that

the data from the 2D convolutional layers conv

and

conv

in the training phase are also interesting to ob-

tain characteristic information from the images (Ce-

bollada et al., 2019). For these layers, the data are

arranged in N

matrices, where N

is the number

of channels in the convolutional layer. Hence, ﬁrst,

a channel is selected among the rest, and after that,

the matrix is re-arranged in a vector that is used as

descriptor. The descriptors obtained from the conv

and conv

layers led to better localization results than

the descriptors obtained from f c

, f c

and f c

. This

is due to the fact that CNNs learn to detect features

like color and edges in the ﬁrst convolution stages and

then, in deeper layers, the network learns more com-

plicated features related to the problem to solve (in

the case of AlexNet, object classiﬁcation). Moreover,

the size of the descriptors obtained from the convolu-

2D Convolutional (

)

Dropout

Image  Features

Features  Classification

Input

[128x512x3]

2D Convolutional (

)

Relu

Normalization

Pooling

Relu

Normalization

Pooling

2D Convolutional (

)

Relu

Pooling

Softmax

Output

2D Convolutional (

)

2D Convolutional (

)

Relu

Fully connected(

)

Fully connected(

)

Fully connected(

)

Relu

Dropout

Relu

Figure 3: The CNN architecture created by departing from

the AlexNet architecture. The input layer is replaced to re-

ceive images with [128 × 512 × 3] size and the last three

layers ( f c

, softmax and the classiﬁcation layer) are also

replaced to adapt the network to the classiﬁcation task pro-

posed.

tional layers are smaller, hence, the localization algo-

rithm requires lower computing time.

Therefore, this work evaluates the use of the layers

conv

, conv

, f c

and f c

of the retrained CNN

to obtain holistic descriptors for solving the map-

ping and localization tasks. This paper also presents

a comparison between these global-appearance de-

scriptors and classic descriptors based on analytic

tools such as HOG (Histogram of Oriented Gradi-

ents) (Dalal and Triggs, 2005) or gist (Oliva and Tor-

ralba, 2006) to solve the mapping and localization

ICINCO 2020 - 17th International Conference on Informatics in Control, Automation and Robotics

236

task by means of panoramic images.

This way, regarding the mapping task, the CNN

is trained for two purposes: (1) estimating the room

(used in the ﬁrst step of the hierarchical localiza-

tion) and (2) obtaining holistic description informa-

tion from a layer (used to solve the conventional lo-

calization task as an image retrieval problem and also

in the second step of the hierarchical localization).

As for the conventional localization task, the

whole process is as follows. A test image im

test

captured from an unknown position within the en-

vironment. The holistic descriptor

#»

test

is obtained

from the CNN and after that, it is compared with

all the descriptors contained in the training dataset

D = {

#»

,...,

#»

Ntrain

} and the most similar de-

scriptor

#»

is retained. Last, the position of im

test

estimated as the coordinates where im

was captured.

Finally, if a hierarchical localization alternative

is desired instead of the conventional method, the

process conducted in previous works such as (Pay

et al., 2018) or (Cebollada et al., 2019) consists ba-

sically in an nearest neighbour search with differ-

ent levels of granularity. Nevertheless, the process

proposed in this work consists in the following. A

test image im

test

is introduced into the CNN and an

estimation about the most likely room c

in which

the image was captured is tackled (rough localiza-

tion step). Apart from the estimated room, the CNN

also provides the holistic descriptor

#»

test

from a se-

lected layer. Afterwards, a nearest neighbour search

is carried out (ﬁne localization step). That is, the ob-

tained descriptor

#»

test

is compared with the descrip-

tors D

= {

#»

,...,

#»

} from the training

dataset which belong to the predicted room c

, and

then, the most similar descriptor

#»

is retained. Fi-

nally, the position of im

test

is estimated as the coordi-

nates where im

was captured. The ﬁg. 4 shows a di-

agram regarding the hierarchical localization method

proposed in the present work.

4 EXPERIMENTS

The training of the CNN, as well as the experiments

detailed in this section have been carried out with a

PC with a CPU Intel Core i7-7700

 at 3.6 GHz.

Moreover, the training of the CNN was tackled with

a GPU NVIDIA GEFORCE GTX 1080TI

. This

paper presents two experiments. Additionally, the

datasets presented in the subsection 2.1 were used to

carry out the training of the CNN, the mapping task

and later evaluation of the localization method pro-

posed.

Throughout the experiments tackled to evaluate

the goodness of the localization methods, two param-

eters are considered to check the accuracy and efﬁ-

ciency: (1) the average localization error, which mea-

sures the Euclidean distance between the position es-

timated and the real position where the test image was

captured (obtained by the ground truth); and (2) the

average computing time required to estimate the po-

sition of the test image.

4.1 Experiment 1: Comparison between

Localization Methods

This subsection presents the results obtained with

the proposed localization algorithm, which uses the

global-appearance descriptors obtained from different

layers of the trained CNN. Moreover, these results

are also compared with other global-appearance de-

scription methods based on classical analytic meth-

ods, whose conﬁguration is selected from previous

works (Cebollada et al., 2019). The results obtained

through the use of analytic descriptors (HOG and

gist) and the descriptors based on deep learning are

shown in the table 2. This table shows the size of

the descriptor, the average localization error (cm) and

the average computing time to estimate the position

of the test images (ms). Regarding the localization

error, the descriptor obtained from the layer conv

presents the minimum value, followed by the descrip-

tors from the layers conv

and f c

. As for the com-

puting time, the fastest option is also achieved with

the conv

layer, since the data obtained for this layer

are calculated in a very early stage of the CNN ar-

chitecture and the holistic descriptor calculated from

this layer has a relatively small size. In general, the

values obtained through using the CNN trained with

the Freiburg training dataset to obtain the holistic de-

scriptors improves the localization task in comparison

to the descriptors calculated by analytical methods.

Considering the localization error and the computing

time measured, either layers conv

, conv

, f c

or f c

can be considered to carry out this task. Nevertheless,

despite f c

outputs good computing time, the local-

ization error obtained is quite worse comparing to the

rest of layers. The information provided by this de-

scriptor with 9 components allows fast computations

but the information provided is not enough to charac-

terize the main information of the images.

4.2 Experiment 2: Hierarchical

Localization

As it was explained in section 3, the hierarchical lo-

calization consists in solving the localization task in

A Deep Learning Tool to Solve Localization in Mobile Autonomous Robotics

237

⃗





CNN Classifier





The descriptors in the

room 



are considered

Descriptors in





Distance calculation





= (

⃗





⃗







,

)

 = 1,…,





,1



,

(

,1

, 

,1

)

⃗



,1

(

,

, 

,

)

⃗



,

K = arg min( ⃗



)





= 





,





= 





,

⃗



= {

1

, … , 





}

Most likely

room





Figure 4: Hierarchical localization diagram. After capturing a test image im

test

, it is introduced into the CNN and the most

likely room is estimated c

. At the same time, the holistic descriptor

#»

test

is obtained from one of the layers and a nearest

neighbour search is done with the descriptors from the training dataset included in the room predicted. The most similar

descriptor (the one which produces the minimum distance with

#»

test

) is retained. The position of im

test

is estimated as the

position where im

was captured.

Table 2: Conventional localization results obtained through the use of the holistic descriptors obtained from the Freiburg CNN

and through the use of gist and HOG description methods. The table shows the size of descriptor, the average localization

error and the average computing time.

Descriptor Size Avg. Error (cm) Avg. Computing time (ms)

Layers from

Freiburg CNN

conv

180 5.07 ± 0.17 6.7

conv

180 5.09 ± 0.17 7.7

f c

4096 5.09 ± 0.17 44.55

f c

4096 5.14 ± 0.18 46.26

f c

9 16.60 ± 29.72 7.52

Analytical

methods

gist 128 5.19 ± 0.18 10.75

HOG 64 16.34 ± 0.78 45.02

several steps. The hierarchical localization proposed

through this work is based on two steps: a rough and

a ﬁne localization. As for the rough localization step,

an evaluation of the trained CNN is carried out. To

train the network, the performance basically consists

in using the pre-trained CNN together with the aug-

mented dataset by following training options. The ob-

tained CNN is evaluated with the cloudy test dataset

by introducing these images into the network and ob-

taining the percentage of accuracy acc

, which is cal-

culated as acc

= (N

test

) × 100, where N

the number of images whose room is correctly pre-

dicted and N

test

is the total number of images that

compose the cloudy test dataset. Through this evalu-

ation, the accuracy obtained is 98.71%. Additionally,

the ﬁg. 5 shows the confusion matrix obtained. From

it, we can observe that few wrong predictions are pro-

duced. Furthermore, all these mistakes are produced

ICINCO 2020 - 17th International Conference on Informatics in Control, Automation and Robotics

238

with wrong rooms which are adjacent to the correct

one. For instance, in the case of the images that be-

long to the stairs area and were wrongly classiﬁed, the

mistaken rooms were the bathroom and the corridor,

which are contiguous to the correct room. Therefore,

the conclusion is that the trained CNN is ready to pre-

dict in which room the input image was captured.

1. Printer area

2. Corridor

3. Kitchen

4. Large office

5. 2-P office 1

6. 2-P office 2

7. 1-P office

8. Bathroom

9. Stairs area

Predicted Class

1. Printer area

2. Corridor

3. Kitchen

4. Large office

5. 2-P office 1

6. 2-P office 2

7. 1-P office

8. Bathroom

9. Stairs area

True Class

Confusion matrix

282 3

229

132

233

148 10

218

190

18 122

1178

282

1178

229

132

233

148

218

190

122

14 2 1 1 10 18

282 1178 229 132 233 148 218 190 122

Figure 5: Confusion matrix obtained after solving the ﬁrst

step of the hierarchical localization (room classiﬁcation)

with all the test images, with the trained CNN.

Regarding the ﬁne localization step, it consists in

ﬁnding the nearest neighbour by comparing the holis-

tic descriptor obtained from a layer of the CNN and

the descriptors of the trained dataset included in the

predicted room. This experiment has evaluated the ef-

ﬁciency of the ﬁve holistic descriptors obtained from

the different CNN layers by means of measuring the

average error and the average computing time cal-

culated according to the process described in ﬁg. 4

to carry out the localization task. Moreover, with

the aim of comparing the results obtained through

the proposed method with other hierarchical local-

ization methods, this experiment also establishes a

comparison between the proposed method (rough step

with CNN and ﬁne step with nearest neighbour) and

the methods evaluated in previous works (Cebollada

et al., 2019) (rough and ﬁne steps solved by nearest

neighbour). For these methods of previous works, the

global-appearance descriptors used are the descriptor

gist and the descriptor obtained from the layer f c

of the AlexNet; and the high-level map is composed

by 10 representatives which were selected by using a

spectral clustering algorithm. Fig. 6 shows the results

obtained through using the method proposed with the

different descriptors from the CNN layers and also the

results obtained by the methods proposed in previous

works.

Regarding the different methods evaluated to

carry out the hierarchical localization, this experiment

shows that the method proposed performs substan-

tially better than alternative methods previously pro-

posed. Fig. 6 shows that the ﬁve description meth-

ods based on the Freiburg CNN present more accu-

racy regarding localization error and also the time

required to solve this task is lower than the meth-

ods based solely on the nearest neighbour. Among

the ﬁve holistic descriptors obtained from the CNN,

conv

and conv

output the best solutions since their

localization error as well as their computing time is

lower than the obtained through the fully connected

layers. These results match the previous conclusion

reached in (Cebollada et al., 2019) about the use of

2D convolutional layers to obtain holistic descriptors.

As for the results obtained by conventional and

hierarchical localization methods, the conclusion ob-

tained after comparing the results of the table 2 and

the ﬁg. 6 is that the hierarchical localization method

introduces a faster performance, but it also produces

an increase of the localization error. This is due to

the fact that the CNN allows a faster rough localiza-

tion step, but this network produces a small number of

wrong room predictions that have a negative inﬂuence

on the average localization error.

5 CONCLUSIONS

In this work, a study is tackled regarding the use of

deep learning to build hierarchical topological mod-

els for localization. We also evaluate the ability of the

proposed deep learning tool to create holistic descrip-

tors to solve the localization problem based on near-

est neighbour. Regarding the hierarchical localization

method proposed, this consists in creating a convolu-

tional neural network for classiﬁcation. This classi-

ﬁer is not only used in the rough localization step to

predict the correct room where a test image was cap-

tured, but it is also used to obtain a holistic descrip-

tor which characterizes the image. For this work, ﬁve

layers have been evaluated: conv

, conv

, f c

and f c

. The training and evaluation of all the local-

ization and description methods have been carried out

with a panoramic images dataset which contains real

conditions effects such as changes in the position of

furniture, people walking, blur effect, etc.

As for the use of this CNN to produce global-

appearance descriptors for solving the conventional

localization by means of a nearest neighbour method,

the ﬁve descriptors extracted from different layers of

the CNN are evaluated together with other analytic

holistic methods commonly used for these purposes.

The results obtained show that the proposed methods

A Deep Learning Tool to Solve Localization in Mobile Autonomous Robotics

239

5,22 5,22

5,23

5,28

16,96

141,11

43,04

8,20

8,62

17,84

18,23

16,96

299,54

26,70

Hierarchical Localization Method

Avg. error (cm)

Avg. Computing time (ms)

100

150

200

0,00

100,00

200,00

300,00

400,00

Proposed

method (conv4)

Proposed

method (conv5)

Proposed

method (fc6)

Proposed

method (fc7)

Proposed

method (fc8)

Nearest

neighbour (gist

descriptor)

Nearest

neighbour (fc6-

AlexNet

descriptor)

Avg. error (cm) Avg. Computing time (ms)

Hierarchical Localization Comparison

Figure 6: Hierarchical localization methods. Nearest neighbour with either gist descriptor or the layer f c

of AlexNet and

the proposed method based on retrieving the room from the Freiburg CNN and after, solving the ﬁne localization by nearest

neighbour with the descriptor obtained from either the layers conv

, conv

, f c

and f c

or f c

of the CNN

are more robust, since they output lower localization

error and computing time than the results obtained by

analytic methods (gist and HOG).

Regarding the hierarchical localization proposed

in this work, this has been compared with a method

based on obtaining the nearest neighbour through dif-

ferent levels of the model. Prior to this comparison,

through the ﬁg. 5, we have showed the accuracy of the

trained CNN to estimate the correct room within the

environment evaluated. As for the whole localization

process, this work shows the evaluation of both meth-

ods by using different global-appearance description

methods. The method proposed in this paper has

proved to be more efﬁcient, since its computing time

and localization error are lower than the obtained by

means of the nearest neighbour method.

Among the ﬁve holistic descriptors obtained from

the trained CNN, the descriptor from the layer f c

can

be discarded, because this descriptor does not charac-

terize properly enough the images for the proposed

tasks. The descriptors related to the conv

and conv

layers have produced the optimal localization solu-

tions among all the methods evaluated, since the size

of the descriptor is relatively small and it leads to low

computing time. Despite their size, their localization

results are also the most accurate. They produce an

average error around 5 cm departing from a training

dataset whose average distance between adjacent im-

ages is around 20 cm.

In future works, we will spread the evaluation

in order to evaluate the goodness of the proposed

methods under changes of illumination. Furthermore,

we will check whether this CNN is useful to ob-

tain global-appearance descriptors in similar environ-

ments. We will also consider other newer and more

complex CNN architectures such as ResNet or VGG

Net. Last, we would also like to create and evaluate

a CNN based directly on omnidirectional images in-

stead of panoramic.

ACKNOWLEDGEMENTS

This work has been supported by the General-

itat Valenciana and the FSE through the grants

ACIF/2017/146 and ACIF/2018/224, by the Span-

ish government through the project DPI 2016-78361-

R (AEI/FEDER, UE): “Creaci

on de mapas medi-

ante m

etodos de apariencia visual para la navegaci

de robots.” and by Generalitat Valenciana through

the project AICO/2019/031: “Creaci

on de modelos

jer

arquicos y localizaci

on robusta de robots m

oviles

en entornos sociales”.

The authors declare that there are no competing inter-

ests regarding the publication of this paper.

REFERENCES

Abadi, M. H. B., Oskoei, M. A., and Fakharian, A. (2015).

Mobile robot navigation using sonar vision algorithm

applied to omnidirectional vision. In 2015 AI &

Robotics (IRANOPEN), pages 1–6. IEEE.

ICINCO 2020 - 17th International Conference on Informatics in Control, Automation and Robotics

240

Bengio, Y., Courville, A., and Vincent, P. (2013). Represen-

tation learning: A review and new perspectives. IEEE

transactions on pattern analysis and machine intelli-

gence, 35(8):1798–1828.

Cebollada, S., Pay

a, L., Mayol, W., and Reinoso, O. (2019).

Evaluation of clustering methods in compression of

topological models and visual place recognition us-

ing global appearance descriptors. Applied Sciences,

9(3):377.

Cebollada, S., Pay

a, L., Rom

an, V., and Reinoso, O. (2019).

Hierarchical localization in topological models under

varying illumination using holistic visual descriptors.

IEEE Access, 7:49580–49595.

Cebollada, S., Pay

a, L., Valiente, D., Jiang, X., and

Reinoso, O. (2019). An evaluation between global

appearance descriptors based on analytic methods

and deep learning techniques for localization in au-

tonomous mobile robots. In ICINCO 2019, 16th In-

ternational Conference on Informatics in Control, Au-

tomation and Robotics (Prague, Czech Republic, 29-

31 July, 2019), pages 284–291. Ed. INSTICC.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection. In In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition, San Diego, USA. Vol. II, pp. 886-893.

Do, H. N., Choi, J., Young Lim, C., and Maiti, T. (2018).

Appearance-based localization of mobile robots using

group lasso regression. Journal of Dynamic Systems,

Measurement, and Control, 140(9).

Dymczyk, M., Gilitschenski, I., Nieto, J., Lynen, S., Zeisl,

B., and Siegwart, R. (2018). Landmarkboost: Efﬁ-

cient visualcontext classiﬁers for robust localization.

In 2018 IEEE/RSJ International Conference on Intel-

ligent Robots and Systems (IROS), pages 677–684.

Guo, J. and Gould, S. (2015). Deep cnn ensemble with

data augmentation for object detection. arXiv preprint

arXiv:1506.07224.

Han, D., Liu, Q., and Fan, W. (2018). A new image clas-

siﬁcation method using cnn transfer learning and web

data augmentation. Expert Systems with Applications,

95:43–56.

Korrapati, H. and Mezouar, Y. (2017). Multi-resolution

map building and loop closure with omnidirectional

images. Autonomous Robots, 41(4):967–987.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in neural information process-

ing systems, pages 1097–1105.

Lenz, I., Lee, H., and Saxena, A. (2015). Deep learning for

detecting robotic grasps. The International Journal of

Robotics Research, 34(4-5):705–724.

Liu, R., Zhang, J., Yin, K., Pan, Z., Lin, R., and Chen, S.

(2018). Absolute orientation and localization estima-

tion from an omnidirectional image. In Paciﬁc Rim

International Conference on Artiﬁcial Intelligence,

pages 309–316. Springer.

Mancini, M., Bul

o, S. R., Ricci, E., and Caputo, B. (2017).

Learning deep nbnn representations for robust place

categorization. IEEE Robotics and Automation Let-

ters, 2(3):1794–1801.

Meattini, R., Benatti, S., Scarcia, U., De Gregorio, D.,

Benini, L., and Melchiorri, C. (2018). An semg-based

human–robot interface for robotic hands using ma-

chine learning and synergies. IEEE Transactions on

Components, Packaging and Manufacturing Technol-

ogy, 8(7):1149–1158.

Oliva, A. and Torralba, A. (2006). Building the gist of as-

cene: the role of global image features in recognition.

In Progress in Brain Reasearch: Special Issue on Vi-

sual Perception.Vol. 155.

Pak, M. and Kim, S. (2017). A review of deep learning in

image recognition. In 2017 4th international confer-

ence on computer applications and information pro-

cessing technology (CAIPT), pages 1–3. IEEE.

Pay

a, L., Gil, A., and Reinoso, O. (2017). A state-of-the-art

review on mapping and localization of mobile robots

using omnidirectional vision sensors. Journal of Sen-

sors, 2017.

Pay

a, L., Peidr

o, A., Amor

os, F., Valiente, D., and Reinoso,

O. (2018). Modeling environments hierarchically with

omnidirectional imaging and global-appearance de-

scriptors. Remote Sensing, 10(4):522.

Pronobis, A. and Caputo, B. (2009). COLD: COsy Lo-

calization Database. The International Journal of

Robotics Research (IJRR), 28(5):588–594.

Pronobis, A. and Jensfelt, P. (2011). Hierarchical multi-

modal place categorization. In ECMR, pages 159–

164.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 1–9.

Ullah, M. M., Pronobis, A., Caputo, B., Luo, J., and Jens-

felt, P. (2007). The cold database. Technical report,

Idiap.

Wozniak, P., Afrisal, H., Esparza, R. G., and Kwolek, B.

(2018). Scene recognition for indoor localization of

mobile robots using deep cnn. In International Con-

ference on Computer Vision and Graphics, pages 137–

147. Springer.

Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva,

A. (2014). Learning deep features for scene recog-

nition using places database. In Advances in neural

information processing systems, pages 487–495.

Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-

Fei, L., and Farhadi, A. (2017). Target-driven visual

navigation in indoor scenes using deep reinforcement

learning. In 2017 IEEE International Conference on

Robotics and Automation (ICRA), pages 3357–3364.

A Deep Learning Tool to Solve Localization in Mobile Autonomous Robotics

241