Global Point Cloud Descriptor for Place Recognition in Indoor

Environments

Jacek Komorowski, Grzegorz Kurzejamski, Monika Wysocza

nska and Tomasz Trzcinski

Warsaw University of Technology, Warsaw, Poland

Keywords:

Place Recognition, 3D Point Cloud, RGB-D, Deep Metric Learning.

Abstract:

This paper presents an approach for learning-based discriminative 3D point cloud descriptor from RGB-D

images for place recognition purposes in indoor environments. Existing methods, such as such as Point-

NetVLAD, PCAN or LPD-Net, are aimed at outdoor environments and operate on 3D point clouds from

LiDAR. They are based on PointNet architecture and designed to process only the scene geometry and do not

consider appearance (RGB component). In this paper we present a place recognition method based on sparse

volumetric representation and processing scene appearance in addition to the geometry. We also investigate

if using two modalities, appearance (RGB data) and geometry (3D structure), improves discriminativity of a

resultant global descriptor.

1 INTRODUCTION

Depth-aware sensors, such as time-of-ﬂight cameras

or solid state lidars, are becoming more and more af-

fordable and popular. Self-driving cars are frequently

equipped with LiDAR scanner which produces a map

of the observed environment in the form of a sparse

3D point cloud. In indoor environments, inexpensive

time-of-ﬂight cameras, such as the latest generation

of Azure Kinect, can generate a representation of an

observed scene in the form of an RGB point cloud.

Applying deep learning methods to solve 3D com-

puter vision problems based on point cloud represen-

tation is an area of active development. A number of

methods for classiﬁcation (Qi et al., 2017a; Qi et al.,

2017b), object detection (Qi et al., 2017a; Wang and

Jia, 2019), semantic segmentation (Qi et al., 2017a;

Choy et al., 2019a) and local (Zeng et al., 2017; Choy

et al., 2019b) or global (Angelina Uy and Hee Lee,

2018; Liu et al., 2019) features extraction from 3D

point clouds was recently proposed.

We focus our attention on ﬁnding a discriminative,

low-dimensional 3D point cloud descriptor for place

recognition purposes. Such global descriptors are

computed for each processed point cloud and stored

in the database. Localization is performed by an efﬁ-

cient search for descriptors closest (in Euclidean dis-

tance sense) to the query point cloud descriptor. This

allows efﬁciently retrieving the most similar point

clouds from the database and reason about the local-

ization of the query point cloud.

In this paper we investigate if using two modal-

ities, appearance (RGB data) and geometry (3D

structure), can improve discriminativity of a global

point cloud descriptor for place recognition purposes.

State-of-the-art place recognition methods based on

3D point clouds, such as PointNetVLAD (An-

gelina Uy and Hee Lee, 2018), PCAN (Zhang and

Xiao, 2019) or LPD-Net (Liu et al., 2019), oper-

ate on data acquired in an outdoor environment by

a car-mounted LiDAR. They compute a discrimina-

tive global descriptor from a raw 3D point cloud,

which is then used to ﬁnd and retrieve the most simi-

lar point clouds from the database. These methods are

based on a single modality only – geometry. Focusing

solely on geometry and neglecting appearance (RGB)

component is justiﬁed for place recognition in out-

door environments. An appearance of the observed

scene can vary drastically due to lighting and sea-

sonal changes. Whereas LiDAR acquired geometry

remains relatively constant thorough different times

of the day, seasons and weather conditions. In in-

door environments there’s less variability of appear-

ance component, hence it’s reasonable to use both

modalities for indoor place recognition task.

Data acquired using LiDAR in an outdoor envi-

ronment has a different characteristic than data gath-

ered indoor using RGB-D cameras with time-of-ﬂight

216

Komorowski, J., Kurzejamski, G., Wysocza

nska, M. and Trzcinski, T.

Global Point Cloud Descriptor for Place Recognition in Indoor Environments.

DOI: 10.5220/0010340502160224

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 5: VISAPP, pages

216-224

ISBN: 978-989-758-488-6

sensor. The former creates a sparser point cloud. Sur-

faces, such as building facades, are relatively far from

the observer and mapped with less detail. The lat-

ter creates denser point clouds. Observed surfaces are

closer to the camera and captured with greater details.

Fine-grain structures of objects like like furniture are

mapped in detail. Both PointNetVLAD and PCAN

methods use PointNet (Qi et al., 2017a) backbone

as the ﬁrst stage of the processing pipeline. While

PointNet architecture proved to be successful in many

applications, it was originally used to process point

clouds representing single objects, not large and com-

plex scenes. The drawback of PointNet architecture

is that for the most part each point it processed in iso-

lation. Local features computed separately for each

point are aggregated in the last few fully connected

layers. As such, it’s not well suited to capture lo-

cal geometric structures of the observed scene. Such

structures are more prevalent in indoor scene scans

using RGB-D cameras with time-of-ﬂight sensor than

in outdoor, LiDAR-based scene scans.

An alternative is to use voxelized representation

which can be processed using 3D convolutions. Con-

volutions proved to be very successful in processing

2D visual information as they can effectively capture

local structures in the image. However, the naive vox-

elized representation based on a dense grid of voxels

is very inefﬁcient. Most of the voxels are empty and

processing entire grid of voxels is computationally

very expensive. Recently, an interesting alternative

emerged. So called Minkowski convolutional neural

networks (Choy et al., 2019a; Choy et al., 2019b)

are based on a sparse voxelized representation and a

very efﬁcient implementation of sparse 3D and higher

dimensional convolutions The sparse representation

scales linearly with the number of 3D points, with-

out the need to store and process dense 3D voxel grid.

The approach proved to be very successful, achieving

state-of-the-art methods in different 3D vision tasks,

such as semantic segmentation (Choy et al., 2019a)

and local features extraction (Choy et al., 2019b). In

this paper we compare performance of sparse vox-

elized point cloud representation and sparse 3D con-

volutions against unordered point cloud representa-

tion and PointNet architecture for place recognition

task.

In summary, main contributions of this work are as

follows. First, we examined if using two modalities,

geometry (3D structure) and appearance (RGB data),

can improve place recognition precision in an indoor

environment. Are there any advantages from fusing

two modalities, or does one dominates the other and

there’s no gain from using both of them? Second, we

experimentally verify if using sparse voxelized repre-

sentation is advantageous over the popular PointNet

architecture based on ’set of unordered points’ repre-

sentation, for place recognition purposes.

2 RELATED WORK

Point Cloud Representation for Deep Learning.

Early deep learning methods operating on 3D point

clouds use volumetrically discretized representa-

tions (Maturana and Scherer, 2015) in the form of a

dense grid of voxels. It’s a natural extension of 2D

image representation as a grid of pixels and 3D convo-

lutions can be applied to process such data. However,

such representation is very inefﬁcient. The memory

requirements grow cubically as spatial resolution in-

creases, making it inappropriate for processing larger

point clouds. Most of the voxels are empty and pro-

cessing entire grid of voxels is very inefﬁcient and

computationally expensive.

(Su et al., 2015) proposed so called multi-view ap-

proach, multiple 2D images of a 3D model are ﬁrst

rendered by virtual cameras placed around the ob-

ject of interest. Each virtually rendered image is pro-

cessed by 2D convolutional network. Feature maps

produced by 2D networks are concatenated and fed

into the ﬁnal classiﬁcation network.

PointNet (Qi et al., 2017a) was the ﬁrst deep learn-

ing method operating directly on a raw 3D point

cloud. An input is organized as an unordered set

of points, where each point is described by its X,

Y, Z coordinates and optional features, such as nor-

mal or RGB. Each point is processed separately by

multi-layer perceptrons and point features are aggre-

gated using a symmetric function, such as max pool-

ing. This makes the architecture independent from

input points ordering. PointNet learns a set of func-

tions that select interesting and informative key points

from a subset of input points, encoding this informa-

tion in each layers feature vector. The drawback of

the architecture is that most of the processing is done

separately for each point and the architecture is not

well suited to capture local geometric structures. The

advantage is it’s efﬁciency, as there’s no need to build

a costly voxelized representation nor render multiple

virtual images.

An alternative approach was recently proposed

in (Choy et al., 2019a). So called Minkowski con-

volutional neural networks are based on a sparse

voxelized representation and an efﬁcient implemen-

tation of sparse 3D and higher dimensional convolu-

tions. This representation joins advantages of both

voxelized and ’unordered set of points’ representa-

tions. As with voxelized representation, 3D convolu-

Global Point Cloud Descriptor for Place Recognition in Indoor Environments

217

tions can be used to capture local structures, similarly

as 2D convolutions in 2D images. And sparsity allows

compact representation and efﬁcient computation.

Place Recognition using Learning-based Global

Features. PointNetVLAD (Angelina Uy and

Hee Lee, 2018) was the ﬁrst deep network for

large-scale 3D point cloud retrieval. It combines

PointNet (Qi et al., 2017a) architecture to extract

local features and NetVLAD (Arandjelovic et al.,

2016) to aggregate local features and produce a

discriminative global descriptor.

PCAN (Zhang and Xiao, 2019) enhances Point-

NetVLAD architecture by adding an attention mech-

anism to predict signiﬁcance of each local point fea-

ture based on a local context. Local features are ex-

tracted using PointNet architecture. Then, features

fed to NetVLAD aggregation layer are weighted by

their signiﬁcance. More attention is paid to the local-

ization task-relevant features, while non-informative

features are ignored.

To mitigate limitations of PointNet-based archi-

tecture in local feature extraction, LPD-Net (Liu et al.,

2019) relies on handcrafted features and uses graph

neural networks to extract local contextual informa-

tion. Ten handcrafted features, such as point den-

sity or local curvature, are computed for each point.

Then, 3D points enhanced with handcrafted features

are processed using Point Net architecture and fed to

a graph neural network to aggregate neighbourhood

features. Finally, global descriptor is computed using

NetVLAD (Arandjelovic et al., 2016) layer.

A recent MinkLoc3D (Komorowski, 2020) net-

work has a fully convolutional architecture based on a

sparse voxelized representation. The local feature ex-

traction part of the network is modelled after Feature

Pyramid Network (Lin et al., 2017) design pattern.

Generalized Mean Pooling layer is used to aggregate

local features into a global descriptor. Despite its sim-

plicity, the method achieves state-of-the-art results in

the outdoor place recognition benchmarks.

Deep Metric Learning. Distance metric learning

aims at learning a distance function to measure se-

mantic similarity between data points (Lu et al.,

2017). This approach is widely used in many recogni-

tion tasks in computer vision domain, such as pedes-

trian re-identiﬁcation (Hermans et al., 2017) and im-

age retrieval (Lee et al., 2008). Deep metric learning

uses deep neural networks to compute a non-linear

mapping from a high dimensional data point space to

a low-dimensional Euclidean space, known as a repre-

sentation or embedding space. The learned mapping

preserves semantic similarity between objects. Em-

beddings of similar data points are closer to each other

in a representation space than embeddings of dis-

similar objects. Early deep metric learning methods

use a Siamese architecture trained with a contrastive

loss (Bromley et al., 1994). Latter methods propose

more complex loss functions, such as triplet (Her-

mans et al., 2017) or quadruplet (Chen et al., 2017)

loss. Signiﬁcant attention is put to a selection of an ef-

fective sampling scheme to select informative training

samples (so called hard negatives mining) (Wu et al.,

2017). One of the most popular scheme is batch hard

negative mining proposed in (Hermans et al., 2017).

3 GLOBAL POINT CLOUD

FEATURE DESCRIPTOR FOR

PLACE RECOGNITION

In this section we describe our approach for compu-

tation of a discriminative, global 3D point cloud de-

scriptor based on two modalities: appearance (RGB

data) and geometry (3D structure). We use a deep

metric learning approach illustrated in Fig. 2. The

embedding network f

, parametrized by weights vec-

tor w, is trained to produce a discriminative, low di-

mensional descriptor (embedding) of the input point

cloud. The network is trained using a triplet loss (Her-

mans et al., 2017). The aim is to make embeddings of

dissimilar point clouds (representing different places)

to be further away by a predeﬁned margin than em-

beddings of the similar point clouds (representing the

same place).

We evaluate two architectures of an embedding

network, each using a different point cloud repre-

sentation. One is PointNetVLAD (Angelina Uy and

Hee Lee, 2018) method using an unordered set of

points representation. It consists of a PointNet-

based (Qi et al., 2017a) backbone followed by

NetVLAD (Arandjelovic et al., 2016) feature aggre-

gation layer. For details, please refer to (Angelina Uy

and Hee Lee, 2018). We modiﬁed the original ar-

chitecture to accept input points clouds with optional

RGB features in addition to XYZ coordinates. The

network produces a 256 dimensional global descrip-

tor.

The other approach is based on a sparse voxelized

representation. Inspired by (Komorowski, 2020),

we designed a 3D convolutional network (called

MinkNetVLAD) based on a sparse voxelized repre-

sentation, shown in Fig. 1. It consists of a fully

convolutional local feature extraction block followed

by NetVLAD (Arandjelovic et al., 2016) feature ag-

gregation block. Local feature extraction network is

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

218

Figure 1: MinkNetVLAD network architecture. The input

point cloud is quantized into a sparse 3D tensor. Local fea-

tures are extracted using a sparse 3D convolutional network.

NetVLAD pooling layer is used to pool the resultant 3D

feature map and produce a global point cloud descriptor.

Numbers in local feature extraction module (1/32, 2/32, . ..)

show a stride and number of channels of a sparse feature

map produced by each block.

modelled after Feature Pyramid Network (Lin et al.,

2017) architecture. Bottom-up part part consists of

3D convolutions producing 3D feature maps with de-

creasing spatial resolution and increasing receptive

ﬁeld. Downsampling of the feature map is achieved

using stride 2 convolutions. The top-down part con-

sists of transposed convolutions. Lateral connections

(convolutions with 1x1x1 ﬁlter) are used to merge fea-

tures produced by the bottom-up part of the network

with the corresponding features from top-down pass.

The rationale of using this design, instead of a sim-

ple convolutional network, is to increase the receptive

ﬁeld of each feature map element to allow capturing

high-level semantic of the input point cloud. Archi-

tecture details are given in Tab. 1. Local features are

aggregated using NetVLAD layer, producing a com-

pact 256-dimensional global descriptor. The network

is implemented using MinkowskiEngine (Choy et al.,

2019a) auto-differentiation library for sparse tensors.

To asses impact of each modality on the discrim-

inative power of the resulting global point cloud de-

scriptor, we train the networks using three types of

input: geometry (XYZ coordinates) and appearance

(RGB component); geometry only; appearance only.

When training using geometry only, all points, instead

of RGB values, are assigned a dummy one dimen-

sional feature set to 1. When training using appear-

ance only, depth of all 3D points is set to the same

dummy value.

Table 1: Details of the local feature extraction block in

MinkNetVLAD network. All convolutional layers are fol-

lowed by BatchNorm and ReLU non-linearity (not listed in

the table for brevity).

Block Layers

Conv0 32 ﬁlters 5x5x5

Conv1 32 ﬁlters 2x2x2 stride 2

32 ﬁlters 3x3x3 stride 1

Conv2 64 ﬁlters 2x2x2 stride 2

64 ﬁlters 3x3x3 stride 1

Conv3 64 ﬁlters 2x2x2 stride 2

64 ﬁlters 3x3x3 stride 1

Conv4 128 ﬁlters 2x2x2 stride 2

128 ﬁlters 3x3x3 stride 1

Conv5 128 ﬁlters 2x2x2 stride 2

128 ﬁlters 3x3x3 stride 1

1x1x1Conv4 128 ﬁlters 1x1x1 stride 1

Transposed convolutions

TConv4 128 ﬁlters 2x2x2 stride 2

TConv5 128 ﬁlters 2x2x2 stride 2

Figure 2: Learning a global point cloud descriptor using a

deep metric learning technique with a triplet loss.

Dataset. We conduct our experiments using Scan-

Net (Dai et al., 2017) dataset. ScanNet is a large,

richly-annotated dataset with 3D reconstructions of

indoor scenes. It contains 2.5 million views (RGB-D

images) in more than 1500 locations, annotated with

3D camera poses and surface reconstructions.

We split the dataset into three separate parts: train-

ing set, validation set to choose training hyperparame-

ters and test set for ﬁnal performance evaluation. The

training set contains 993 thousand point clouds recon-

structed from RGB-D images taken at 616 distinct lo-

cations. The validation set contains 65 thousand point

clouds and 45 locations. The test set contains 253

thousand point clouds and 176 locations. Fig. 3 shows

Global Point Cloud Descriptor for Place Recognition in Indoor Environments

219

Figure 3: Exemplary RGB-D images from one location in

ScanNet dataset. RGB images on the top and corresponding

depth maps at the bottom.

exemplary RGB-D images from one location. On the

top, there are RGB images and on the bottom, cor-

responding depth maps. Point clouds are constructed

from RGB-D images in the dataset by backproject-

ing each pixel in the RGB image using known camera

intrinsics and depth. An example of a reconstructed

point clouds is shown in Fig. 4. The point clouds are

fed into an embedding network to compute a global

descriptor.

Deep distance metric learning methods, such as

methods based on triplet networks, require infor-

mation on semantically similar and dissimilar data

points. In our case, similar elements are point clouds

representing largely overlapping parts of the scene,

and dissimilar elements are point clouds represent-

ing different places. Such information is not avail-

able in the ScanNet dataset and needs to be computed

to prepare sufﬁciently large training dataset. Using

solely camera pose ground truth to asses visible scene

overlap of two RGB-D images is problematic. Spa-

tially distant cameras may show the same place from

different angles, and the corresponding point clouds

Figure 4: An exemplary point cloud generated from one

RGB-D image.

should be considered similar. To solve this problem

we developed] an efﬁcient method to compute a view

overlap between two point clouds. It is based on cal-

culating a percentage of points co-visible on RGB-D

images that are used to construct point clouds.

To ﬁnd pairs of similar and dissimilar point clouds

for network training, we randomly sample 500 RGB-

D images from each location. For each sampled im-

age, we compute its view overlap with a different set

of 500 images sampled from the same location. This

generates view overlap information for 500 · 500 =

250 thousand pairs in each location. To construct the

validation and test set, we sample 100 RGB-D im-

ages from each scene and compute view overlap be-

tween each pair of sampled images. This produces

view overlap data for 100 · 100/2 = 5 thousand pairs.

View overlap between two RGB-D images is calcu-

lated as a percentage of points co-visible on both im-

ages. This is done as follows. We sample a set of 500

random points in the ﬁrst image. Using depth data,

camera intrinsics and relative pose between two im-

ages (given in the dataset groundtruth) we re-project

a point p in the ﬁrst image onto the second image, ob-

taining a point p

. If the point falls outside the second

image area then it’s not co-visible on both images.

Otherwise, we re-projected point p

back onto the ﬁrst

image, obtaining a point p

. If the Euclidean distance

between an original point p and its re-projection p

below a given threshold (4 pixels in our implementa-

tion) we consider the point p co-visible on both im-

ages. To make our view overlap measure symmetric,

we compute it in two directions: ﬁrst by projecting

points from the ﬁrst image into the second and from

the second to the ﬁrst; and then by projecting points

from the second image to the ﬁrst and back to the sec-

ond. The ﬁnal overlap measure between two RGB-D

images, is taken as a minimum of these two results.

Such overlap measure can be effectively computed us-

ing a vectorized implementation operating on an array

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

220

of points sampled in one image. Results on the view

overlap calculations can bee seen in Fig. 6.

For network training and evaluation purposes we

consider two point clouds similar, if they are con-

structed from two RGB-D images taken at the same

location and having a view overlap above the thresh-

old (30% in our implementation). Otherwise, two

point clouds are considered dissimilar.

Network Training. The embedding network is

trained using a stochastic gradient descent approach

with a triplet loss (Hermans et al., 2017). Mini-

batches contain triplets consisting of an anchor, a pos-

itive and a negative element. A positive element is a

point cloud similar to the anchor cloud, with the view

overlap above the threshold (30% in our implemen-

tation). A negative element is a point cloud showing

a different place than an anchor. A randomly cho-

sen negative element would often depict a scene that

is very different, both in appearance and geometry,

from an anchor element. In the presence of such easy

cases, the network will quickly learn how to produce

sufﬁciently different embeddings and the training will

stagnate. To improve effectiveness of the training

process we use batch hard negative mining scheme

to construct triplets, as proposed in (Hermans et al.,

2017). Each triplet is constructed using the hardest

negative example found within a batch. The hardest

negative example for each anchor is a dissimilar point

cloud that has the closest embedding to the anchor

embedding, computed using current network weights.

We use a popular triplet loss formulation as de-

ﬁned in (Hermans et al., 2017):

L(a

, p

) = max

{

d(a

, p

) − d(a

) + margin,0

}

where d(x,y) = ||x − y||

is an Euclidean distance be-

tween embeddings x and y; a

, p

are embeddings

of an anchor, a positive and a negative point cloud in

i-th triplet and margin is set to 0.4. The loss func-

tion is minimized using a stochastic gradient descent

approach with Adam optimizer. We train the the net-

work for 16 epochs with an initial learning rate set

to 0.001, which is decreased to 0.0001 after eight

epochs.

To increase variability of the training data and de-

crease overﬁtting, we apply on-the-ﬂy data augmen-

tation. It includes photometric distortions, random ro-

tation, translation and resizing of the point cloud. Ad-

ditionally we adapted a random erasing augmentation

(Zhong et al., 2017) to operate on 3D point clouds.

A fronto-parallel cuboid with a random size and posi-

tion is randomly generated, and all points lying within

the cuboid are removed.

4 EXPERIMENTAL RESULTS

This section describes experimental evaluation re-

sults of global point cloud descriptors performance

for place recognition purposes in indoor environment.

Evaluation is done on a subset of ScanNet dataset

contains 253 thousand point clouds gathered at 176

locations that are different from locations used for

training. The evaluation is done using the follow-

ing procedure. First, the test set is split randomly

into the query set, containing 10% of elements, and

the database containing remaining 90%. Then, global

descriptors of all point clouds are computed using a

trained embedding network. Finally, for each point

cloud in the query set, we search for k = 20 most

similar point clouds in the database. This is done

by ﬁnding point clouds in the database with the clos-

est, in Euclidean distance sense, descriptor to the de-

scriptor of the query point cloud. If the view over-

lap, calculated using a procedure detailed in the pre-

vious section, between the query point cloud and re-

trieved point cloud is above the threshold (we set

threshold to 30%) we declare a match (true posi-

tive). Otherwise we declare a false positive. We use

Precision@k as the evaluation metric, averaged over

all query set elements. Precision@k is deﬁned as the

percentage of correctly retrieved elements (true pos-

itives) within the ﬁrst k elements. Fig. 6 visualizes

point cloud retrieval results using descriptors calcu-

lated with MinkNetVLAD network trained with ge-

ometry and appearance modality.

Fig. 5 shows performance of PointNetVLAD (An-

gelina Uy and Hee Lee, 2018) and MinkNetVLAD

network architectures trained using three different

modalities: both geometry (XYZ) and appearance

(RGB); only appearance; and only geometry. Numer-

ical results are shown in Tab. 2.

Table 2: Evaluation of MinkNetVLAD and PointNet net-

work architectures and different modalities on point cloud

retrieval task.

Base network Modality Precision

@1 @10

PointNetVLAD geometry 0.855 0.383

PointNetVLAD RGB 0.979 0.666

PointNetVLAD RGB+geom. 0.986 0.681

MinkNetVLAD geometry 0.939 0.542

MinkNetVLAD RGB 0.976 0.662

MinkNetVLAD RGB+geom. 0.992 0.670

When using only geometry modality,

MinkNetVLAD outperforms PointNetVLAD by

a large margin. The former has 0.939 (0.542) and

the latter 0.855 (0.383) precision@1 (precision@10).

Global Point Cloud Descriptor for Place Recognition in Indoor Environments

221

Figure 5: Point cloud retrieval results using PointNetVLAD

architectures and different modalities.

As mentioned earlier, PointNet-based architecture is

not well suited to capture local geometric structures

which adversely affects the quality of the resultant

descriptor. MinkNetVLAD network can extract

more discriminative features using 3D convolutions

and sparse voxelized representation. When using

RGB modality, both architectures show similar

performance: PointNetVLAD achieves 0.979 (0.666)

precision@1 (precision@10) and MinkNetVLAD

0.976 (0.662). The results using solely scene ap-

pearance (RGB modality) are signiﬁcantly better,

compared to geometry. This can be understood, as ap-

pearance of scenes in our evaluation dataset exhibits

limited variability. Image acquisition conditions are

not affected by environmental factors, lighting is

usually constant, and only differences are to a view-

point change. Fusing two modalities, appearance and

geometry, improves discriminability of the resultant

global descriptor. For MinkNetVLAD architecture,

precision@1 increases from 0.976 (RGB only) to

0.992 (RGB+geometry). For PointNetVLAD, it

improves from 0.979 to 0.986. However, it must be

noted that the improvement is moderate, 1.6 p.p. in

the ﬁrst case, and 0.7 p.p. in the second case.

MinkNetVLAD architecture consistently outper-

forms PointNetVLAD (Angelina Uy and Hee Lee,

2018) method. It performs signiﬁcantly better using

geometry only (0.939 vs 0.855 precision@1); com-

parable using scene appearance only (0.976 vs 0.979

prevision@1) and slightly better using both modali-

ties (0.992 vs 0.986).

5 CONCLUSIONS

Our experiments show, that in indoor environments,

scene appearance is much more informative that scene

geometry for place recognition purposes. Both evalu-

ated architectures trained with RGB data yielded sig-

niﬁcantly better results compared to training using

solely scene geometry modality. Fusing two modal-

ities, scene appearance and geometry, improved dis-

criminativity of the resultant global descriptor by a

small factor (0.6-1.2 p.p.). RGB component domi-

nates over geometry, and there’s little gain from using

both of them in indoor environment. When using only

scene geometry, MinkNetVLAD architecture, based

on sparse voxelized representation and using 3D con-

volutions, yields signiﬁcantly better results compared

to PointNetVLAD (Angelina Uy and Hee Lee, 2018)

method, based on PointNet (Qi et al., 2017a) architec-

ture.

For future work, we plan to examine more so-

phisticated approaches for fusing scene appearance

and geometry modalities. One idea is to use a pre-

trained 2D convolutional network to extract features

from RGB image and link them with 3D points, be-

fore feeding to the global descriptor extraction net-

work.

ACKNOWLEDGEMENTS

The project was funded by POB Research Centre for

Artiﬁcial Intelligence and Robotics of Warsaw Uni-

versity of Technology within the Excellence Initiative

Program - Research University (ID-UB).

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

222

Figure 6: Visualization of point cloud retrieval results using embeddings calculated with MinkNetVLAD architecture and

RGB+geometry modality. Each row shows a query RGB point cloud (on the left) and its ﬁve nearest neighbours retrieved

from the database (on the right). Distance is an Euclidean distance between a query and a database point cloud embedding.

Different scene names correspond to different locations.

REFERENCES

Angelina Uy, M. and Hee Lee, G. (2018). Pointnetvlad:

Deep point cloud based retrieval for large-scale place

recognition. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

4470–4479.

Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., and Sivic,

J. (2016). Netvlad: Cnn architecture for weakly super-

vised place recognition. In Proceedings of the IEEE

Global Point Cloud Descriptor for Place Recognition in Indoor Environments

223

conference on computer vision and pattern recogni-

tion, pages 5297–5307.

Bromley, J., Guyon, I., LeCun, Y., S

ackinger, E., and Shah,

R. (1994). Signature veriﬁcation using a” siamese”

time delay neural network. In Advances in neural in-

formation processing systems, pages 737–744.

Chen, W., Chen, X., Zhang, J., and Huang, K. (2017). Be-

yond triplet loss: a deep quadruplet network for per-

son re-identiﬁcation. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 403–412.

Choy, C., Gwak, J., and Savarese, S. (2019a). 4d spatio-

temporal convnets: Minkowski convolutional neural

networks. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

3075–3084.

Choy, C., Park, J., and Koltun, V. (2019b). Fully convolu-

tional geometric features. In Proceedings of the IEEE

International Conference on Computer Vision, pages

8958–8966.

Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser,

T., and Nießner, M. (2017). Scannet: Richly-

annotated 3d reconstructions of indoor scenes. In

Proc. Computer Vision and Pattern Recognition

(CVPR), IEEE.

Hermans, A., Beyer, L., and Leibe, B. (2017). In defense

of the triplet loss for person re-identiﬁcation. arXiv

preprint arXiv:1703.07737.

Komorowski, J. (2020). Minkloc3d: Point cloud

based large-scale place recognition. arXiv preprint

arXiv:2011.04530.

Lee, J.-E., Jin, R., and Jain, A. K. (2008). Rank-based

distance metric learning: An application to image re-

trieval. In 2008 IEEE Conference on Computer Vision

and Pattern Recognition, pages 1–8. IEEE.

Lin, T.-Y., Doll

ar, P., Girshick, R., He, K., Hariharan, B.,

and Belongie, S. (2017). Feature pyramid networks

for object detection. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 2117–2125.

Liu, Z., Zhou, S., Suo, C., Yin, P., Chen, W., Wang, H.,

Li, H., and Liu, Y.-H. (2019). Lpd-net: 3d point cloud

learning for large-scale place recognition and environ-

ment analysis. In Proceedings of the IEEE Interna-

tional Conference on Computer Vision, pages 2831–

2840.

Lu, J., Hu, J., and Zhou, J. (2017). Deep metric learning

for visual understanding: An overview of recent ad-

vances. IEEE Signal Processing Magazine, 34(6):76–

84.

Maturana, D. and Scherer, S. (2015). Voxnet: A 3d con-

volutional neural network for real-time object recog-

nition. In 2015 IEEE/RSJ International Conference

on Intelligent Robots and Systems (IROS), pages 922–

928. IEEE.

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. (2017a). Point-

net: Deep learning on point sets for 3d classiﬁcation

and segmentation. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 652–660.

Qi, C. R., Yi, L., Su, H., and Guibas, L. J. (2017b). Point-

net++: Deep hierarchical feature learning on point sets

in a metric space. In Advances in neural information

processing systems, pages 5099–5108.

Su, H., Maji, S., Kalogerakis, E., and Learned-Miller, E.

(2015). Multi-view convolutional neural networks for

3d shape recognition. In Proceedings of the IEEE

international conference on computer vision, pages

945–953.

Wang, Z. and Jia, K. (2019). Frustum convnet: Slid-

ing frustums to aggregate local point-wise features

for amodal. In 2019 IEEE/RSJ International Confer-

ence on Intelligent Robots and Systems (IROS), pages

1742–1749. IEEE.

Wu, C.-Y., Manmatha, R., Smola, A. J., and Krahenbuhl, P.

(2017). Sampling matters in deep embedding learn-

ing. In Proceedings of the IEEE International Confer-

ence on Computer Vision, pages 2840–2848.

Zeng, A., Song, S., Nießner, M., Fisher, M., Xiao, J., and

Funkhouser, T. (2017). 3dmatch: Learning the match-

ing of local 3d geometry in range scans. In CVPR,

volume 1.

Zhang, W. and Xiao, C. (2019). Pcan: 3d attention map

learning using contextual information for point cloud

based retrieval. In Proceedings of the IEEE Confer-

ence on Computer Vision and Pattern Recognition,

pages 12436–12445.

Zhong, Z., Zheng, L., Kang, G., Li, S., and Yang, Y. (2017).

Random erasing data augmentation. arXiv preprint

arXiv:1708.04896.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

224