Evaluation of Long-term Deep Visual Place Recognition

Farid Alijani

1 a

, Jukka Peltom

aki

1 b

, Jussi Puura

, Heikki Huttunen

3 c

Joni-Kristian K

ainen

1 d

and Esa Rahtu

1 e

Tampere University, Finland

Sandvik Mining and Construction Ltd, Finland

Visy Oy, Finland

Keywords:

Tracking and Visual Navigation, Content-Based Indexing, Search, and Retrieval, Deep Convolutional Neural

Network, Deep Learning for Visual Understanding.

Abstract:

In this paper, we provide a comprehensive study on evaluating two state-of-the-art deep metric learning meth-

ods for visual place recognition. Visual place recognition is an essential component in the visual localization

and the vision-based navigation where it provides an initial coarse location. It is used in variety of autonomous

navigation technologies, including autonomous vehicles, drones and computer vision systems. We study recent

visual place recognition and image retrieval methods and utilize them to conduct extensive and comprehensive

experiments on two diverse and large long-term indoor and outdoor robot navigation datasets, e.g., COLD

and Oxford Radar RobotCar along with ablation studies on the crucial parameters of the deep architectures.

Our comprehensive results indicate that the methods can achieve 5 m of outdoor and 50 cm of indoor place

recognition accuracy with high recall rate of 80 %.

1 INTRODUCTION

The question of “where this photo was taken?” has

been a widespread research interest in multiple ﬁelds,

including computer vision and robotics for many

years (Zhang et al., 2020). Recently, researchers

have employed advanced deep learning techniques to

address this question (Masone and Caputo, 2021).

The performance of a good navigation algorithm is

deeply incorporated with an accurate robot localiza-

tion which makes it an important research topic in

robotics.

Visual place recognition is the problem of recog-

nizing a previously visited place using the visual con-

tent and information. Similar to the image retrieval

problem, the visual content can be matched with the

places already stored in the gallery database. It is

the ﬁrst step in hierarchical visual localization (Sarlin

et al., 2019; Xu et al., 2002) that consists of two steps:

(1) coarse localization and (2) pose reﬁnement.

https://orcid.org/0000-0003-3928-7291

https://orcid.org/0000-0002-9779-6804

https://orcid.org/0000-0002-6571-0797

https://orcid.org/0000-0002-5801-4371

https://orcid.org/0000-0001-8767-0864

Visual localization itself is a core component of

vision-based mobile robot navigation (DeSouza and

Kak, 2002; Bonin-Font et al., 2008). Our paper

presents and study an extensive evaluation of two

state-of-the-art deep learning methods for long-term

visual place recognition in indoor and outdoor envi-

ronments.

Two recent papers (Sattler et al., 2020; Pion et al.,

2020) demonstrate the superior localization perfor-

mance of approaches that are solely rely on deep met-

ric learning methods, compared with the conventional

engineered features or feature transforms. They mea-

sure the reﬁned 6-Degree-of-Freedom localization ac-

curacy in 3D, but do not factorize the contributions of

(1) coarse localization and (2) pose reﬁnement.

In this paper, we focus mainly on the coarse lo-

calization step for our contribution. In particular,

we deﬁne the coarse localization as place recogni-

tion problem and study the performance and settings

of two state-of-the-art deep metric learning architec-

tures with two diverse and large long-term indoor and

outdoor robot navigation datasets. We utilized one

of the best performing CNN architectures in deep vi-

sual place recognition, e.g., NetVLAD (Arandjelovi

et al., 2018), and another in deep image retrieval by

Radenovi

c et al. (Radenovi

c et al., 2019).

Alijani, F., Peltomäki, J., Puura, J., Huttunen, H., Kämäräinen, J. and Rahtu, E.

Evaluation of Long-term Deep Visual Place Recognition.

DOI: 10.5220/0010834700003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 5: VISAPP, pages

437-447

ISBN: 978-989-758-555-5; ISSN: 2184-4321

437

Both methods utilize the deep metric learning

methods to learn feature embedding and representa-

tion of images where the distance of images captured

in nearby places are clearly smaller than those from

the distant locations. We compare two methods us-

ing two largest available outdoor and indoor datasets:

Oxford Radar RobotCar (Barnes et al., 2020) and

CoSy Localization Database (COLD) (Pronobis and

Caputo, 2009). We also provide an ablation study on

the main performance factors of deep learning meth-

ods, the selection of the backbone network and the

amount of training data.

The rest of this paper is organized as follows.

Section 2 brieﬂy discusses the related work in visual

place recognition. In section 3, we provide the base-

line methods utilized in our paper. Section 4 explains

two real-world indoor and outdoor datasets to address

the visual place recognition problem. In section 5,

we show the experimental results and ﬁnally, we con-

clude the paper in section 6.

2 RELATED WORK

Visual Place Recognition Methods. Given a query

image, an image retrieval system aims to retrieve all

images from a large database containing similar fea-

tures to the query image. Visual place recognition can

be also interpreted as an image retrieval system which

recognize a place by matching it with all places from

the reference dataset. Prior research has thoroughly

surveyed visual place recognition methods in recent

papers. Lowry et al. (Lowry et al., 2016) deﬁne the

problem and survey the classical hand-crafted or shal-

low learned descriptors. while Zhang et al. (Zhang

et al., 2020) and Masone et al. (Masone and Caputo,

2021) focus entirely on the deep learning approaches

incorporated with visual place recognition.

The conventional methods of place recognition

use mainly handcrafted local features and global de-

scriptors on their core to obtain feature descriptors.

Popular local feature descriptors are SIFT (Lowe,

2004), SURF (Bay et al., 2008) and HOG (Dalal and

Triggs, 2005). Commonly used global descriptors

including DBoW (Galvez-L

opez and Tardos, 2012),

FAB-MAP (Cummins and Newman, 2008; Cummins

and Newman, 2011) and the landmark-based relocal-

ization approach (Williams et al., 2011), are all based

on handcrafted local image features and have been

widely used for visual SLAM or localization tasks.

Today, handcrafted features are being constantly

outperformed by deep features that can be trained

to be robust to geometric transformations and illu-

mination changes. The deep architectures often em-

ploy pre-trained backbone networks that extract pow-

erful semantic features. The backbone networks are

trained with image classiﬁcation datasets such as Im-

ageNet (Russakovsky et al., 2015). By localization

speciﬁc ﬁne-tuning the deep features are optimized

for image matching (Zhang et al., 2020; Radenovic

et al., 2018).

The main objective of utilizing deep architec-

tures for image retrieval and visual place recogni-

tion is to learn powerful and meaningful feature map-

ping which allows to compare images using similarity

measures such as Euclidean distance or cosine simi-

larity.

In this context, large number of architectures

have been proposed: MAC (Azizpour et al., 2015),

SPoC (Yandex and Lempitsky, 2015), CroW (Kalan-

tidis et al., 2016), GeM (Radenovi

c et al., 2019), R-

MAC (Tolias et al., 2016b), modiﬁed R-MAC (Gordo

et al., 2017) and NetVLAD (Arandjelovi

c et al.,

2018). For this paper, we selected two methods which

perform well with public benchmarks in which the

original code is publicly available: GeM (Raden-

ovi

c et al., 2019) and NetVLAD (Arandjelovi

c et al.,

2018).

Visual Place Recognition Datasets. For the past few

years, researchers have collected and published sev-

eral datasets for the problem of visual place recogni-

tion with various sensor modalities, including monoc-

ular and stereo cameras, LiDAR, IMU, radar and

GNSS/INS sensors. For example, Cummins and

Newman (Cummins and Newman, 2008) introduced

the New College and City Centre dataset. The other

popular datasets are the KITTI Odometry bench-

mark (Geiger et al., 2012), Canadian Adverse Driving

Conditions dataset (Pitropov et al., 2021), Ford cam-

pus vision and LiDAR dataset (Pandey et al., 2011),

InLoc dataset (Taira et al., 2021), extended CMU

seasons dataset (Badino et al., 2011; Sattler et al.,

2018), M

alaga Urban dataset (Blanco-Claraco et al.,

2014), Alderley dataset (Milford and Wyeth, 2012),

Aachen dataset (Sattler et al., 2012) and Nordland

dataset (Olid et al., 2018).

Our objective is to address both outdoor and in-

door place recognition and focus more speciﬁcally on

long-term visual place recognition, i.e., how robust

the deep representations are to changes over time, in-

cluding illumination and weather. Therefore, we se-

lected the two challenging datasets, COLD (Pronobis

and Caputo, 2009) for indoor experiments and Oxford

Radar RobotCar (Barnes et al., 2020) for outdoor ex-

periments, that provide the same places captured in

various different conditions.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

438

3 METHODS

In this work, we concentrate on deep image repre-

sentation obtained by deep convolutional neural net-

work (CNN) architectures in which given an input

an image, it produces a global descriptor, feature

vector, to describe the visual content of the image.

In the following, we brieﬂy explain the processing

pipelines of the Generalized Mean (GeM) by Rade-

novi

c et al. (Radenovi

c et al., 2019) that obtains

good performance with image retrieval datasets and

NetVLAD (Arandjelovi

c et al., 2018) that performs

well on place recognition datasets.

3.1 Radenovi

c et al.

For training, Radenovi

c et al. (Radenovi

c et al., 2016;

Radenovi

c et al., 2019) adopt the Siamese neural net-

work architecture. The Siamese architecture is trained

using positive and negative image pairs and the loss

function enforces large distances between negative

pairs (images from two distant places) and small dis-

tances between positive pairs (images from the same

place). Radenovi

c et al. (Radenovi

c et al., 2019) use

the contrastive loss (Chopra et al., 2005) that acts

on matching (positive) and non-matching (negative)

pairs and is deﬁned as follows:

L =

(

) for matching images

max



0,M −l(

)



otherwise

(1)

where l is the pair-wise distance term (Euclidean dis-

tance) and M is the enforced minimum margin be-

tween the negative pairs.

and

denote the deep

feature vectors of images I

and I

computed using

the convolutional head of a backbone network such

as AlexNet, VGGNet or ResNet.

The typical feature vector lengths K are 256, 512

or 2048, depending on the backbone. Feature vectors

are global descriptors of the input images and pooled

over the spatial dimensions. The feature responses

are computed from K convolutional layers X

follow-

ing with max pooling layers that select the maximum

spatial feature response from each layer of MAC vec-

tor as follows:

f = [ f

... f

], f

= max

x∈X

(x) . (2)

Radenovi

c et al. originally used the MAC vec-

tors (Radenovi

c et al., 2016), but in their more re-

cent paper (Radenovi

c et al., 2019) compared MAC

vectors to average pooling SPoC vector and General-

ized Mean pooling (GeM) vector and found that GeM

pooling layer provides the best average retrieval accu-

racy.

Radenovi

c et al. (Radenovi

c et al., 2019) propose

GeM pooling layer to modify the MAC (Azizpour

et al., 2015; Tolias et al., 2016a) and SPoC (Yan-

dex and Lempitsky, 2015). This is a pooling layer

which takes χ as an input and produces a vector

f = [ f

, f

,..., f

]

as an output of the pooling pro-

cess which results in:

|χ

∑

x∈χ

(3)

MAC and SPoC pooling methods are special cases

of GeM depending on how pooling parameter p

derived in which p

→ ∞ and p

= 1 correspond to

max-pooling and average pooling, respectively. The

GeM feature vector is a single value per feature map

and its dimension varies depending on different net-

works, i.e. K = [256, 512, 2048]. It also adopts a

Siamese architecture to train the networks for image

matching.

The Radenovi

c et al. (Radenovi

c et al., 2019) main

pipeline is shared by the most deep metric learning

approaches for image retrieval, but the unique com-

ponents are the proposed supervised whitening post-

processing and effective positive and negative sam-

ple mining. More details are described in (Radenovi

et al., 2016) and (Radenovi

c et al., 2018) and avail-

able in the code provided by the original authors.

3.2 NetVLAD

The main advantage of Radenovi

c et al. (Radenovi

et al., 2019) architecture is its straightforward imple-

mentation as it uses standard CNN layers available in

PyTorch and TensorFlow libraries: conv-layers, soft-

max, L

-normalization and the ﬁnal aggregation lay-

ers MAC, SPoC and GeM. NetVLAD architecture,

on the contrary, contains special layer that provides

higher dimensional feature vector.

NetVLAD (Arandjelovi

c et al., 2018) implements

a function f as a global feature vector for a given

image I

as f (I

). This function is used to ex-

tract the feature vectors from the entire database I

identiﬁed as gallery set. Then visual search be-

tween f (q), query image, and f (I

) takes place us-

ing Euclidean distance d(q, I

) and to obtain the top-

N matches. NetVLAD is inspired by the conventional

VLAD (J

egou et al., 2010) which uses handcrafted

SIFT descriptors (Lowe, 2004) and uses VLAD en-

coding to form f (I). NetVLAD is a data optimized

version of VLAD for place recognition or image re-

trieval. NetVLAD is deﬁned by a set of parame-

ters θ and identiﬁed as f

(I) in which the Euclidean

distance d

) = ||f

) − f

)|| depends on the

same parameters.

Evaluation of Long-term Deep Visual Place Recognition

439

In order to learn the representation end-to-end,

NetVLAD contains two main building blocks. (1)

Cropped CNN at the last convolutional layer, iden-

tiﬁed as a dense descriptor with the output size of

H ×W ×D, correspond to set of D-dimensional de-

scriptors extracted at H ×W spatial locations. (2)

Trainable generalized VLAD layer, e.g., NetVLAD

which pools extracted descriptors into a ﬁxed im-

age representation in which its parameters trained via

backpropagation.

The original VLAD image representation V is

D×K matrix in which D is the dimension of the input

local image descriptor~x

and K is the number of clus-

ters. It is reshaped into a vector after L

-normalization

and ( j, k) element of V is calculated as follows:

V ( j, k) =

∑

i=1

(~x

)(x

( j) −c

( j)), (4)

where x

( j) and c

( j) are the jth dimensions of the

ith descriptor and kth cluster centers, respectively. It

should be noted that a

(~x

) = 0 , 1 corresponds to

whether or not the descriptor ~x

belongs to kth visual

word. Compared to original VLAD, NetVLAD layer

is differentiable thanks to its soft assignment of de-

scriptors to multiple clusters:

V ( j, k) =

∑

i=1

∑

( j) −c

( j)) (5)

where w

, b

and c

are gradient descent optimized

parameters of the k-th cluster.

For our implementation, we obtain 64 clusters

with 512-dimensional VGG16 backbone and 2048-

dimensional ResNet50 backbones. Consequently, the

NetVLAD feature vector dimension becomes 512 ×

64 = 32, 768 and 2048 ×64 = 131, 072, respectively.

Arandjelovi

c et al. (Arandjelovi

c et al., 2018) used

PCA dimensionality reduction method as a post-

processing stage of the implementation. However,

we utilized the full size of feature vector. Since

NetVLAD layer can be easily plugged into any other

CNN architecture in an end-to-end manner, we in-

vestigate its performance with VGG16 and ResNet50

backbones and report the results in section 5.

Apart from designing a CNN architecture as an

image representation, obtaining enough annotated

training data and designing appropriate loss func-

tion for place recognition task is crucial. Follow-

ing (Arandjelovi

c et al., 2018), we adopted a weakly

supervised triplet ranking loss L

for training. The

idea of triplet ranking loss is two folds: (1) to obtain

training dataset of tuples (q,{p

},{n

}) in which

for every query image q, there exists set of positives

} with at least one image matching the query and

negatives {n

}, (2) to learn an image representation

so that d

(q, p

) < d

(q,n

),∀j. L

is deﬁned as

a sum of individual losses for all n

and computed as

follows:

∑

max(



(q) − f

−|f

(q) − f

+ α



,0), (6)

where α is the given margin in meter. If the margin

between the distance to the negative image and to the

best matching positive is violated, the loss is propor-

tional to the amount of violation.

4 DATASETS

For our experiments, we selected two publicly avail-

able and versatile outdoor and indoor datasets. We

assigned three different tests, as query sequences ac-

cording to their difﬁculty levels: (1) Test 01 with sim-

ilar set of images to the gallery set, but acquired at

different time; (2) Test 02 with moderately changed

conditions (e.g, time of day or illumination) and (3)

Test 03 with different set of images from the gallery

set. In the following, we brieﬂy describe the datasets

and selection process of training, gallery and the three

query sequences. Training data is required to ﬁne-

tune the networks for the indoor and outdoor datasets

by pairing positive and negative matches from train-

ing and gallery sets.

4.1 Oxford Radar RobotCar

The Oxford Radar RobotCar dataset (Barnes et al.,

2020) is a radar extension to the Oxford RobotCar

dataset (Maddern et al., 2017). The dataset pro-

vides optimized ground-truth data using a Navtech

CTS350-X Millimetre-Wave FMCW radar. The data

acquisition was performed in January 2019 over 32

traversals in central Oxford with a total route of

280 km of urban driving.

This dataset addresses a wide variety of challeng-

ing conditions including weather, trafﬁc, and light-

ing alterations (Figure 1). The combination of one

Point Grey Bumblebee XB3 trinocular stereo and

three Point Grey Grasshopper2 monocular cameras

provide a 360 degree visual coverage of the scene

around the vehicle platform. The Bumblebee XB3 is

a 3-sensor multi-baseline IEEE-1394b stereo camera

designed for improved ﬂexibility and accuracy. It fea-

tures 1.3 mega-pixel sensors with 66° horizontal FoV

and 1280 ×960 image resolution logged at maximum

frame rate of 16 Hz.

The three monocular Grasshopper2 cameras with

ﬁsh-eye lenses mounted on the back roof of the ve-

hicle are synchronized and obtained 1024 ×1024 im-

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

440

(a) (b) (c) (d)

(e) (f)

Figure 1: Examples from the Oxford Radard RobotCar outdoor dataset. Top: Images from the same location in the three

selected test sequences: a) Gallery: cloudy b) Test 01: cloudy c) Test 02: sunny d) Test 03: rainy (Grasshopper2 left monocular

camera). Bottom: 19 km route of the test sequences, e) satellite view f) GNSS/INS.

ages at average frame rate of 11.1 Hz with 180° hori-

zontal FoV. To simplify our experiments, we selected

images from only one of the cameras. We selected

the Point Grey Grasshopper2 monocular camera, e.g.,

left, despite the fact that using multiple cameras could

potentially improve the results. The selected camera

points toward the left side of the road and thus en-

codes the stable urban environment such as the build-

ings, vehicles and trafﬁc lights.

From the dataset, we selected sequences for a

training set, to perform network ﬁne-tuning, a gallery

set against which the query images from the test se-

quence are matched and three distinct test sets: (1) the

same day but different time, (2) the different day but

approximately at same time and (3) the different day

and different time along with different weather con-

ditions. Table 1 summarizes different sets used for

training, gallery and testing sequences.

Table 1: The Oxford Radar RobotCar outdoor sequences

used in our experiments.

Sequence Size Date Start [GMT] Condition

Train 37,724 Jan. 10 2019 11:46 Sunny

Gallery 36,660 Jan. 10 2019 12:32 Cloudy

Test 01 29,406 Jan. 10 2019 14:50 Cloudy

Test 02 32,625 Jan. 11 2019 12:26 Sunny

Test 03 28,633 Jan. 16 2019 14:15 Rainy

4.2 COLD

The CoSy Localization Database (COLD) (Prono-

bis and Caputo, 2009) comprises annotated data se-

quences, acquired using visual and laser range sen-

sors on a mobile platform. The dataset provides a

large-scale, ﬂexible testing environment for evaluat-

ing mainly vision-based topological localization and

semantic knowledge extraction methods aiming to

work on mobile robots in realistic indoor scenarios.

It consists of several video sequences collected

in three different indoor laboratory environments lo-

Evaluation of Long-term Deep Visual Place Recognition

441

cated in three different European cities: the Visual

Cognitive Systems Laboratory at the University of

Ljubljana, Slovenia; the Autonomous Intelligent Sys-

tems Laboratory at the University of Freiburg, Ger-

many; and the Language Technology Laboratory at

the German Research Center for Artiﬁcial Intelli-

gence in Saarbr

ucken, Germany.

The COLD data acquisition was performed using

three different mobile robotic platforms (an Activ-

Media People Bot, an ActiveMedia Pioneer-3 and an

iRobot ATRV-Mini) with two Videre Design MDCS2

digital cameras to obtain perspective and omnidirec-

tional views. Each frame is registered with the as-

sociated absolute position recovered using laser and

odometry data and annotated with a label represent-

ing the corresponding place.

The data was collected over a path when visit-

ing several rooms and ofﬁce environments and under

different illumination conditions, including cloudy,

night and sunny.

For our experiments, we selected the extended,

e.g., long path on Map B of Saarbr

ucken labora-

tory. The training sequence is Sunny-seq3, gallery se-

quence is Cloudy-seq1, and the three test sequences

are (1) Sunny-seq1, (2) Cloudy-seq2 and (3) Night-

seq3. See Figure 2 for examples. We used the cap-

tured images acquired using the monocular center

camera form this setup. Table 1 summarizes differ-

ent sequences used for training, gallery and testing.

Table 2: The COLD indoor sequences used in our experi-

ments.

Sequence Size Date Start [GMT] Condition

Train 1036 July 7 2006 14:59 Sunny

Gallery 1371 July 7 2006 17:05 Cloudy

Test 01 1104 July 7 2006 14:28 Sunny

Test 02 1021 July 7 2006 18:59 Cloudy

Test 03 970 July 7 2006 20:34 Night

5 EXPERIMENTS

We organize our experiments such that they address

the following research questions: (1) how accurate lo-

calization can be achieved using image retrieval meth-

ods? (2) which of the two selected deep metric learn-

ing methods performs the best (NetVlad or Raden-

ovi

c et al. in Section 3)? and (3) how much data

speciﬁc training data is required?

Performance Metric. Similar to (Arandjelovi

c et al.,

2018), we measure the place recognition performance

by the fraction of correctly matched queries. Fol-

lowing (Chen et al., 2011), we denote the fraction of

top-N shortlisted correctly recognized candidates as

recall@N. Given the available ground-truth annota-

tions and thresholds for indoor and outdoor datasets,

recall@N varies accordingly. To evaluate the perfor-

mance of the methods, described in Section 3, we re-

port only the top-1 matches, i.e., recall@1 for multi-

ple thresholds τ.

The methods in Section 3 are used to compute

a feature vector representation for the given query

image f (q). After obtaining the image representa-

tion, a similarity score which indicates how precise

two images belong to the same location is crucial

to measure the performance. In this way, the fea-

ture vector is matched to all gallery image representa-

tions of f (G

),i = 1, 2,..., M using Euclidean distance

q,G

= ||f (q) − f (G

)||

and the smallest distance is

selected as the top-1 best match. If the best match

position is within the given distance threshold, e.g.,

q,G

≤ τ, it is identiﬁed as true positive. In other

cases, it is identiﬁed as false positive. We then for-

mulate the recall as the ratio of true positive to the

total number of the query images.

To demonstrate the generalization of our observa-

tions, we report the results of experiments using two

backbones, e.g., ResNet50 and VGG16. ResNet50

is a deeper architecture which contains more convo-

lutional, pooling and fully connected layers with 50

weight layers, over 25 million parameters and 3.8 bil-

lion FLOPs (He et al., 2016). VGG16 backbone with

16 weight layers contains nearly 138 million param-

eters and 15.3 billion FLOPs (Simonyan and Zisser-

man, 2015).

Indoor Place Recognition. The results are presented

in Table 3. From the results of the indoor experi-

ments, we obtain the following ﬁndings: (1) Rade-

novi

c et al. method systematically obtains better ac-

curacy than NetVLAD method, (2) the testing perfor-

mance of ResNet50 backbone is slightly better than

VGG16, with a 4 % improvement on average accu-

racy of the recall@1 at Test 01, 2 % at Test 02, and

6.0 % at Test 03. However, Radenovi

c et al. with

VGG16 backbone also performs relatively well con-

sidering less computational expenses, and (3) the in-

door precision drops rapidly when the threshold falls

below τ = 50.0 cm.

It simply indicates that the localization accuracy

of ± 50.0 cm can be achieved with 80% recall@1 rate

even in conditions where the query dataset is substan-

tially different from the gallery dataset. This happens,

for instance, in day vs. night samples.

Outdoor Place Recognition. The results are sum-

marized in Table 4. Our ﬁndings from the outdoor

experiments of Oxford Radar RobotCar dataset are

similar to those of the indoor COLD database: Rade-

novi

c et al. method outperforms NetVLAD method.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

442

(a) (b) (c) (d)

(e)

−4 −2 0 2 4

Y [m]

−6

−4

−2

X [m]

Trajectory

Path

start

end

(f)

Figure 2: Examples of COLD indoor database (Pronobis and Caputo, 2009). Images of same location in different sequences:

a) Gallery: Cloudy-seq1 b) Test 01: Sunny-seq1, c) Test 02: Cloudy-seq2, d) Test 03: Night-seq3, e) Map view of the lab: blue

dashes: standard path consisting of rooms in most typical ofﬁce environments, red dashes: extended path containing rooms

speciﬁc to this environment or its part, arrows: direction of driving the robot and f) Robot path of approximately 50 m.

ResNet50 provides better accuracy than the VGG16

backbone.

The higher recall@1 rates in Test 01, using both

methods, compared to Test 02 and Test 03 is due to

its high similarity to the gallery dataset, e.g., different

time during the same day. However, Radenovi

c et al.

method performs relatively better in small threshold,

τ = 2 m, in Test 03 which indicates its robustness

even in extreme conditions, including night and rain.

The testing performance of ResNet50 backbone

achieves a 0.2 % improvement on average accuracy

of the recall@1 at Test 01, 4.6 % at Test 02, and 2.3 %

at Test 03 compared to VGG16 architecture in the cor-

responding tests. This surely comes with higher com-

putation for ResNet50 backbone.

The results of the Table 4 also demonstrate that lo-

calization accuracy of ± 5.0 m can be achieved with

the recall rate of 80% or greater for different illumina-

tion and weather conditions in an urban environment.

This is reasonable due to the more versatile features

of the outdoor environments, compared to the indoor

environments.

The results of both Table 3 and Table 4 demon-

strate that Radenovi

c et al. method outperforms the

NetVLAD in both indoor and outdoor datasets. The

main clariﬁcation of better performance is the selec-

tion procedure of training image pairs for positives

and negatives samples with queries forming the train-

ing tuples.

Evaluation of Long-term Deep Visual Place Recognition

443

Table 3: Indoor place recognition results for the COLD Saarbr

ucken sequences, given various distance thresholds τ.

COLD recall@1

Method BB τ = 100 cm τ = 75 cm τ = 50 cm τ = 25 cm

Test 01 (sunny)

Radenovi

c (Radenovi

c et al., 2019) VGG16 93.03 91.49 78.26 43.12

ResNet50 97.19 95.74 82.43 43.39

NetVLAD (Arandjelovi

c et al., 2018) VGG16 92.30 89.58 76.36 41.85

ResNet50 91.94 91.76 78.35 44.75

Test 02 (cloudy)

Radenovi

c (Radenovi

c et al., 2019) VGG16 94.32 92.75 84.62 46.13

ResNet50 95.20 91.67 80.12 46.62

NetVLAD (Arandjelovi

c et al., 2018) VGG16 90.70 87.37 78.45 46.33

ResNet50 93.44 89.72 79.63 45.05

Test 03 (night)

Radenovi

c (Radenovi

c et al., 2019) VGG16 82.99 82.06 75.36 44.64

ResNet50 91.13 88.97 80.72 47.32

NetVLAD (Arandjelovi

c et al., 2018) VGG16 81.03 78.66 70.52 45.05

ResNet50 82.06 79.90 71.34 44.43

In NetVLAD method, the image with the lowest

descriptor distance to the query is chosen as posi-

tive pairs. In this naive approach, the network is not

capable of sufﬁcient learning from positive samples

given only the GPS coordinates and camera orienta-

tion is not available. In Radenovi

c et al. method, on

the contrary, the positive samples are chosen at ran-

dom from a pool of images with similar camera posi-

tions. This ensures selecting harder matching exam-

ples along with increasing variability of viewpoints.

Negative samples are selected from clusters different

that the cluster of the query images as clusters are

non-overlapping. Non matching images with similar

descriptors are selected as hard negatives.

Amount of Training Data. Finally, in this exper-

iment we investigate to what extent the data spe-

ciﬁc ﬁne-tuning can further improve the visual place

recognition performance results. Based on the ob-

tained results of the Table 3 and the Table 4, we assign

the ﬁxed indoor distance threshold to τ = 50 cm and

outdoor threshold to τ = 5 m. We evaluate the most

challenging test sequence, Test 03, for both indoor and

outdoor datasets.

In our experiments, we compared ResNet50 and

VGG16 backbones in terms of training time using

NVIDIA V100 Tensor core GPU with 32 GB of mem-

ory for the outdoor and indoor datasets. Our ﬁnd-

ings indicated that VGG16 is less computationally

expensive and less prone to overﬁtting. The ﬁne-

tuning of CNN using ResNet50 backbone took ap-

proximately 9 hours and 45 minutes for the Oxford

Radar RobotCar dataset with ≈ 37k images with size

of 1024 ×1024 pixels and roughly 85 minutes for

COLD dataset with 1036 training images of 640×480

pixels for 50 epochs and mini-batch of 5 images.

However, VGG16 was slightly lighter since it took ap-

proximately 8 hours and 25 minutes for Oxford Radar

RobotCar and 73 minutes for the COLD dataset.

Despite the high similarity of two backbones in

the performance, we utilized VGG16 backbone. Con-

sequently, we repeat the experiment with different

amounts of training data to study how much and

to what extent, the ﬁne-tuning improves the perfor-

mance. The results are presented in Table 5.

Table 5 demonstrates the absolute outperforming

results of Radenovi

c et al. method over NetVLAD

with VGG16 backbone in both indoor and outdoor

datasets. The accomplished performance results is

regardless of how much training data is utilized for

ﬁne-tuning with new datasets. Interestingly, there is

a substantial improvement from absolute zero ﬁne-

tuning data to the full training dataset (37k images)

for the outdoor Oxford Radar RootCar dataset. How-

ever, there is not a signiﬁcant difference for the indoor

COLD dataset. This can partly be due to the fact that

there are not enough training samples in the indoor

dataset (approximately 1000 samples per sequence).

According to Table 5, there is also a signiﬁcant

improvement of the results (approximately 50 %) for

the outdoor Oxford Radar RobotCar dataset when

Radenovi

c et al. method is utilized considering all

the versatile characteristics of an urban environment.

This is similar for the indoor COLD database even

though the results of Radenovi

c et al. method indi-

cates approximately 10% of improvement compared

to the outdoor dataset.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

444

Table 4: Outdoor place recognition results for the Oxford Radar RobotCar dataset, given various distance thresholds τ.

Oxford Radar RobotCar recall@1

Method BB τ = 25 m τ = 10 m τ = 5 m τ = 2 m

Test 01 (later time of day)

Radenovi

c (Radenovi

c et al., 2019) VGG16 98.39 97.42 95.82 59.86

ResNet50 98.01 97.23 95.96 59.94

NetVLAD (Arandjelovi

c et al., 2018) VGG16 80.14 76.81 72.29 44.25

ResNet50 92.67 89.60 84.24 52.14

Test 02 (diff. day, same time)

Radenovi

c (Radenovi

c et al., 2019) VGG16 91.37 89.37 82.16 42.02

ResNet50 95.11 93.43 86.91 48.09

NetVLAD (Arandjelovi

c et al., 2018) VGG16 36.11 29.32 22.40 9.65

ResNet50 70.54 63.35 52.42 23.83

Test 03 (diff. day and time)

Radenovi

c (Radenovi

c et al., 2019) VGG16 89.64 86.63 82.83 62.42

ResNet50 92.00 89.00 84.62 65.08

NetVLAD (Arandjelovi

c et al., 2018) VGG16 33.58 28.05 23.12 13.36

ResNet50 49.68 44.46 38.07 22.23

Furthermore, we evaluated the pre-trained mod-

els of both Radenovi

c et al. and NetVLAD meth-

ods to investigate how much ﬁne-tuning with our

custom indoor and outdoor datasets could improve

the visual place recognition performance. For both

datasets, there is a signiﬁcant improvement of the re-

sults. For instance, we ﬁnd out that considering nearly

20 −50 % of queries which are randomly drawn per

one training epoch in the indoor COLD database

could potentially enhance the performance results up

to approximately 80 % during ﬁne-tuning.

6 CONCLUSIONS

In this paper, we evaluated the performance of two

state-of-the-art deep metric learning methods for the

problem of visual place recognition. We used both

indoor and outdoor datasets with diverse and large

long-term variations, including time, illumination and

weather to investigate the performance results of

these methods. Our evaluation results indicate that

ﬁne-tuning the Radenovi

c et al. (Radenovi

c et al.,

2019) method with visual place recognition datasets

achieves recall rate of 80 % or greater for a given lo-

calization accuracy in the indoor and outdoor datasets,

respectively.

As an alternative, we compared the obtained re-

sults of the NetVLAD (Arandjelovi

c et al., 2018)

method which is trained and ﬁne-tuned for our cus-

tom indoor and outdoor datasets. Compared to Rade-

novi

c et al. (Radenovi

c et al., 2019) method, it re-

vealed less robust performance due to challenging

illumination and weather conditions in both indoor

and outdoor datasets. Our ﬁndings from the two

state-of-the-art deep learning architectures conﬁrms

that ResNet50 performs slightly better than VGG16

backbones considering the larger computational ex-

penses. Therefore, we adopt VGG16 backbone since

it is more computationally affordable.

Based on our ﬁndings from multiple experiments

for both indoor and outdoor datasets, the deep archi-

tecture by Radenovi

c et al. (Radenovi

c et al., 2019)

outperforms the NetVLAD architecture by Arand-

jelovi

c et al. (Arandjelovi

c et al., 2018) with a clear

margin. The reason to clarify the better perfor-

mance of Radenovi

c et al. method lies in selection

of training image pairs for the positives and the neg-

atives samples with queries forming the training tu-

ples. NetVLAD method suffers from the insufﬁcient

learning of the positive and the negative pairs which

are selected based on the lowest and the highest de-

scriptor distance to the query, respectively. Further-

more, our comprehensive study conﬁrms that both

deep learning architectures obtain the best results with

ResNet50 backbone and by ﬁne-tuning the architec-

ture with data speciﬁc training data.

One possible direction for the future work can be

an indoor data acquisition to a larger extent with more

challenging illumination conditions, suitable for the

problem of visual place recognition with more pre-

cise ground-truth and complementary sensors, e.g.,

Evaluation of Long-term Deep Visual Place Recognition

445

Table 5: Results for different number of queries randomly drawn per one training epoch using the most challenging test

sequences, Test 03, in both indoor and outdoor datasets. The gallery and test sets contain all images. “0” indicates that the

dataset speciﬁc ﬁne-tuning is skipped.

Oxford Radar RobotCar (τ = 5 m)

recall@1

Method BB 37k (all) 10k 5k 2k 0

Test 03 (diff. day and time)

Radenovi

c (Radenovi

c et al., 2019) VGG16 95.82 86.10 87.53 88.91 72.62

NetVLAD (Arandjelovi

c et al., 2018) VGG16 80.14 13.82 20.48 21.83 23.35

COLD (τ = 50 cm)

recall@1

Method BB 1k (all) 500 250 100 0

Test 03 (diff. day and time)

Radenovi

c (Radenovi

c et al., 2019) VGG16 77.73 78.35 80.10 77.53 77.22

NetVLAD (Arandjelovi

c et al., 2018) VGG16 66.80 70.52 71.86 71.13 71.55

LiDAR. It is useful to (1) investigate to what ex-

tent and how number of queries randomly drawn per

one train epoch could potentially inﬂuence the re-

call@1 rate within certain localization accuracy and

(2) whether or not sensor fusion of RGB and LiDAR

could further improve the performance results in the

indoor environment.

REFERENCES

Arandjelovi

c, R., Gronat, P., Torii, A., Pajdla, T., and Sivic,

J. (2018). NetVLAD: Cnn architecture for weakly su-

pervised place recognition. TPAMI.

Azizpour, H., Razavian, A. S., Sullivan, J., Maki, A., and

Carlsson, S. (2015). From generic to speciﬁc deep

representations for visual recognition. In 2015 IEEE

Conference on Computer Vision and Pattern Recogni-

tion Workshops (CVPRW), pages 36–45.

Badino, H., Huber, D., and Kanade, T. (2011). Visual topo-

metric localization. In 2011 IEEE Intelligent Vehicles

Symposium (IV), pages 794–799.

Barnes, D., Gadd, M., Murcutt, P., Newman, P., and Posner,

I. (2020). The oxford radar robotcar dataset: A radar

extension to the oxford robotcar dataset. In 2020 IEEE

International Conference on Robotics and Automation

(ICRA), pages 6433–6438.

Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. (2008).

Speeded-up robust features (surf). Computer Vision

and Image Understanding, 110(3):346–359. Similar-

ity Matching in Computer Vision and Multimedia.

Blanco-Claraco, J.-L.,

Angel Moreno-Due

nas, F., and

Gonz

alez-Jim

enez, J. (2014). The m

alaga urban

dataset: High-rate stereo and lidar in a realistic ur-

ban scenario. The International Journal of Robotics

Research, 33(2):207–214.

Bonin-Font, F., Ortiz, A., and Oliver, G. (2008). Visual

navigation for mobile robots: A survey. J Intell Robot

Syst.

Chen, D. M., Baatz, G., K

oser, K., Tsai, S. S., Vedantham,

R., Pylv

ainen, T., Roimela, K., Chen, X., Bach, J.,

Pollefeys, M., Girod, B., and Grzeszczuk, R. (2011).

City-scale landmark identiﬁcation on mobile devices.

In CVPR 2011, pages 737–744.

Chopra, S., Hadsell, R., and LeCun, Y. (2005). Learning a

similarity metric discriminatively with application to

face veriﬁcation. In CVPR.

Cummins, M. and Newman, P. (2008). Fab-map: Proba-

bilistic localization and mapping in the space of ap-

pearance. The International Journal of Robotics Re-

search, 27(6):647–665.

Cummins, M. and Newman, P. (2011). Appearance-only

slam at large scale with fab-map 2.0. The Inter-

national Journal of Robotics Research, 30(9):1100–

1123.

Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-

dients for human detection. In 2005 IEEE Computer

Society Conference on Computer Vision and Pattern

Recognition (CVPR’05), volume 1, pages 886–893

vol. 1.

DeSouza, G. and Kak, A. (2002). Vision for mobile robot

navigation: A survey. TPAMI.

Galvez-L

opez, D. and Tardos, J. D. (2012). Bags of binary

words for fast place recognition in image sequences.

IEEE Transactions on Robotics, 28(5):1188–1197.

Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready

for autonomous driving? the kitti vision benchmark

suite. In 2012 IEEE Conference on Computer Vision

and Pattern Recognition, pages 3354–3361.

Gordo, A., Almaz

an, J., Revaud, J., and Larlus, D. (2017).

End-to-End Learning of Deep Visual Representa-

tions for Image Retrieval. Int. J. Comput. Vision,

124(2):237–254.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In 2016 IEEE Con-

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

446

ference on Computer Vision and Pattern Recognition

(CVPR), pages 770–778.

egou, H., Douze, M., Schmid, C., and P

erez, P. (2010).

Aggregating local descriptors into a compact image

representation. In 2010 IEEE Computer Society Con-

ference on Computer Vision and Pattern Recognition,

pages 3304–3311.

Kalantidis, Y., Mellina, C., and Osindero, S. (2016). Cross-

Dimensional Weighting for Aggregated Deep Convo-

lutional Features. In Computer Vision – ECCV 2016

Workshops, pages 685–701. Springer, Cham, Switzer-

land.

Lowe, D. G. (2004). Distinctive Image Features from Scale-

Invariant Keypoints. Int. J. Comput. Vision, 60(2):91–

110.

Lowry, S., S

underhauf, N., Newman, P., Leonard, J. J.,

Cox, D., Corke, P., and Milford, M. J. (2016). Vi-

sual place recognition: A survey. IEEE Transactions

on Robotics, 32(1):1–19.

Maddern, W., Pascoe, G., Linegar, C., and Newman, P.

(2017). 1 year, 1000 km: The oxford robotcar

dataset. The International Journal of Robotics Re-

search, 36(1):3–15.

Masone, C. and Caputo, B. (2021). A survey on deep visual

place recognition. IEEE Access, 9:19516–19547.

Milford, M. J. and Wyeth, G. F. (2012). Seqslam: Vi-

sual route-based navigation for sunny summer days

and stormy winter nights. In 2012 IEEE International

Conference on Robotics and Automation, pages 1643–

1649.

Olid, D., F

acil, J. M., and Civera, J. (2018). Single-view

place recognition under seasonal changes. CoRR,

abs/1808.06516.

Pandey, G., McBride, J. R., and Eustice, R. M. (2011). Ford

campus vision and lidar data set. The International

Journal of Robotics Research, 30(13):1543–1552.

Pion, N., Humenberger, M., Csurka, G., Cabon, Y., and Sat-

tler, T. (2020). Benchmarking image retrieval for vi-

sual localization. In Int. Conf. on 3D Vision (3DV).

Pitropov, M., Garcia, D. E., Rebello, J., Smart, M., Wang,

C., Czarnecki, K., and Waslander, S. (2021). Canadian

adverse driving conditions dataset. The International

Journal of Robotics Research, 40(4-5):681–690.

Pronobis, A. and Caputo, B. (2009). Cold: The cosy

localization database. The International Journal of

Robotics Research, 28(5):588–594.

Radenovi

c, F., Tolias, G., and Chum, O. (2019). Fine-tuning

cnn image retrieval with no human annotation. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence, 41(7):1655–1668.

Radenovic, F., Iscen, A., Tolias, G., Avrithis, Y., and Chum,

O. (2018). Revisiting oxford and paris: Large-scale

image retrieval benchmarking. In 2018 IEEE/CVF

Conference on Computer Vision and Pattern Recog-

nition, pages 5706–5715.

Radenovi

c, F., Tolias, G., and Chum, O. (2016). CNN

image retrieval learns from BoW: Unsupervised ﬁne-

tuning with hard examples. In ECCV.

Radenovi

c, F., Tolias, G., and Chum, O. (2018). Deep shape

matching. In ECCV.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh,

S., Ma, S., Huang, Z., Karpathy, A., Khosla, A.,

Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015).

ImageNet Large Scale Visual Recognition Challenge.

International Journal of Computer Vision (IJCV),

115(3):211–252.

Sarlin, P., Cadena, C., Siegwart, R., and Dymczyk, M.

(2019). From coarse to ﬁne: Robust hierarchical lo-

calization at large scale. In CVPR.

Sattler, T., Maddern, W., Toft, C., Torii, A., Hammarstrand,

L., Stenborg, E., Safari, D., Okutomi, M., Pollefeys,

M., Sivic, J., Kahl, F., and Pajdla, T. (2018). Bench-

marking 6dof outdoor visual localization in changing

conditions. In 2018 IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 8601–

8610.

Sattler, T., Maddern, W., Toft, C., Torii, A., Hammarstrand,

L., Stenborg, E., Safari, D., Okutomi, M., Pollefeys,

M., Sivic, J., Kahl, F., and Pajdla, T. (2020). Bench-

marking 6DOF outdoor visual localization in chang-

ing conditions. In Int. Conf. on 3D Vision (3DV).

Sattler, T., Weyand, T., Leibe, B., and Kobbelt, L. (2012).

Image retrieval for image-based localization revisited.

In Proceedings of the British Machine Vision Confer-

ence, pages 76.1–76.12. BMVA Press.

Simonyan, K. and Zisserman, A. (2015). Very deep con-

volutional networks for large-scale image recognition.

In International Conference on Learning Representa-

tions.

Taira, H., Okutomi, M., Sattler, T., Cimpoi, M., Pollefeys,

M., Sivic, J., Pajdla, T., and Torii, A. (2021). Inloc: In-

door visual localization with dense matching and view

synthesis. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 43(4):1293–1307.

Tolias, G., Sicre, R., and J

egou, H. (2016a). Particular

Object Retrieval With Integral Max-Pooling of CNN

Activations. In ICL 2016 - RInternational Confer-

ence on Learning Representations, International Con-

ference on Learning Representations, pages 1–12, San

Juan, Puerto Rico.

Tolias, G., Sicre, R., and J

egou, H. (2016b). Particular ob-

ject retrieval with integral max-pooling of cnn activa-

tions.

Williams, B., Klein, G., and Reid, I. (2011). Automatic

relocalization and loop closing for real-time monocu-

lar slam. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 33(9):1699–1712.

Xu, M., Snderhauf, N., and Milford, M. (2002). Vision for

mobile robot navigation: A survey. TPAMI.

Yandex, A. B. and Lempitsky, V. (2015). Aggregating local

deep features for image retrieval. In 2015 IEEE In-

ternational Conference on Computer Vision (ICCV),

pages 1269–1277.

Zhang, X., Wang, L., and Su, Y. (2020). Visual place recog-

nition: A survey from deep learning perspective. Pat-

tern Recognition, page 107760.

Evaluation of Long-term Deep Visual Place Recognition

447