Multi-Branch Convolutional Descriptors for Content-based Remote

Sensing Image Retrieval

Raffaele Imbriaco

, Tunc Alkanat

, Egor Bondarev and Peter H. N. de With

Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven 5612AZ, The Netherlands

Keywords:

Content-based Image Retrieval, Remote Sensing, Convolutional Neural Networks, Local Feature Extraction.

Abstract:

Context-based remote sensing image retrieval (CBRSIR) is an important problem in computer vision with

many applications such as military, agriculture, and surveillance. In this study, inspired by recent developments

in person re-identiﬁcation, we design and ﬁne-tune a multi-branch deep learning architecture that combines

global and local features to obtain rich and discriminative image representations. Additionally, we propose a

new evaluation strategy that fully separates the test and training sets and where new unseen data is used for

querying, thereby emphasizing the generalization capability of retrieval systems. Extensive evaluations show

that our method signiﬁcantly outperforms the existing approaches by up to 10.7% in mean precision@20

on popular CBRSIR datasets. Regarding the new evaluation strategy, our method attains excellent retrieval

performance, yielding more than 95% precision@20 score on the challenging PatternNet dataset.

1 INTRODUCTION

In recent years, Remote Sensing (RS) imagery has

become increasingly available. RS image collections

now contain a large number of pictures at high resolu-

tions. Manual labeling and/or annotation of images is

a cumbersome and expensive task. Furthermore, cer-

tain labeling or indexing methods are not suited for

user-friendly retrieval (e.g. consider the case of geo-

graphic coordinates as labels). Therefore, new meth-

ods for managing RS image collections need to be

developed. A technique that has achieved signiﬁcant

success in the RS community is Content-Based Re-

mote Sensing Image Retrieval (CBRSIR) (Manjunath

and Ma, 1996, Bai et al., 2014, Tang et al., 2018). In

CBRSIR, the goal is to generate compact and robust

representations of the visual content of images, to eas-

ily ﬁnd similarities among them. Such systems can

generally be reduced to two principal phases. First,

compact image representations are generated in the

feature extraction phase. Second, image similarity is

computed based on special metrics, using the descrip-

tors produced in the ﬁrst phase.

Research has been commonly concentrated on the

feature-extraction process (Zhou et al., 2017, Roy

et al., 2018, Xiong et al., 2019). Different types of

descriptors have been used for CBRSIR, which are

These authors contributed equally to this work.

classiﬁed according to their semantic level. Exam-

ples of low-level descriptors are SIFT (Lowe et al.,

1999) and Gabor ﬁlters (Haralick et al., 1973), which

describe shape, texture, color, etc. Mid-level de-

scriptors are produced by aggregating low-level fea-

tures, using methods like Bag-of-Words (Sivic and

Zisserman, 2003) or Vector of Locally Aggregated

Descriptors (J

egou et al., 2010). High-level descrip-

tors encode information related to semantic concepts,

such as “airplane” and “vegetation”. These high-

level semantic descriptors are commonly extracted

from Convolutional Neural Networks (CNNs), which

are trained for tasks such as classiﬁcation. Several

architectures and techniques are deployed in liter-

ature to improve feature extraction. Among these

are, Deep Metric Learning (Roy et al., 2018, Xiong

et al., 2019) and local feature extraction & aggrega-

tion (Tang et al., 2018, Imbriaco et al., 2019).

In this study, we explore the CBRSIR and present

a twofold contribution. First, inspired by recent de-

velopments in the ﬁeld of person re-identiﬁcation, we

design and deploy a part-based feature extractor and

obtain state-of-the-art retrieval results. To the best of

our knowledge, this is the only system to produce a

single global representation, using a part-based model

for CBRSIR without requiring aggregation after ex-

traction. Second, we propose an alternative, more

challenging evaluation protocol to study the general-

ization capabilities of CBRSIR systems. This simu-

242

Imbriaco, R., Alkanat, T., Bondarev, E. and de With, P.

Multi-Branch Convolutional Descriptors for Content-based Remote Sensing Image Retrieval.

DOI: 10.5220/0008895702420249

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 5: VISAPP, pages

242-249

ISBN: 978-989-758-402-2; ISSN: 2184-4321

lates a more realistic scenario, where not all of the

classes are available for training purposes, and where

images of known classes can be acquired under dif-

ferent conditions or by different sensors.

2 RELATED WORK

In this section, a summary of the related work is pre-

sented, focusing on methods that exploit CNN fea-

tures, metric learning, and local features.

2.1 Global Feature Extraction

As mentioned previously, one of the fundamental pro-

cessing steps of any CBRSIR system is the feature

extractor. Figure 1 depicts the basic architecture of an

RSIR system. These are commonly classiﬁed accord-

ing to the semantic complexity of the representations

they generate from RS imagery. Early work (Haralick

et al., 1973, Manjunath and Ma, 1996), utilized hand-

crafted textural features to match images with similar

visual content. Richer semantic features are obtained

by combining local descriptors, such as SIFT (Lowe

et al., 1999) with aggregation methods like Bag-of-

Words (BoW) (Sivic and Zisserman, 2003) or Vector

of Locally Aggregated Descriptors (VLAD) (J

egou

et al., 2010). However, most recent CBRSIR systems

employ rich semantic features extracted from CNNs.

Penatti et al. demonstrate in (Penatti et al., 2015)

that CNN features are generic enough for RS imagery

classiﬁcation. In (Zhou et al., 2017), various meth-

ods are proposed for the extraction of descriptive rep-

resentations. These include a Network-in-Network

block (Lin et al., 2013) for dimensionality reduction.

The above-mentioned systems produce a single, high-

dimensional vector per image. These representations

are commonly referred to as global descriptors, as

they encode information about the whole image in-

stead of image regions. The approaches discussed

in the following subsection produce local represen-

tations and descriptors.

2.2 Local Feature Extraction

Alternatives to the global descriptors are presented

in (Tang et al., 2018) and (Imbriaco et al., 2019).

The ﬁrst work presents an unsupervised framework

for CBRSIR. Images are divided into patches and then

fed through an auto-encoder that reconstructs the in-

puts. Two types of patches (uniform and superpixel)

are extracted per image. The descriptors generated

from each patch are aggregated using Bag-of-Words,

producing a histogram representation for each image.

The second work deploys attentive Deep Local Fea-

tures (Noh et al., 2017) for the extraction of local

descriptors at various scales. These descriptors are

aggregated using VLAD (J

egou et al., 2010) and the

network is trained for classiﬁcation. A disadvantage

of local descriptors is that the direct estimation of

image similarity becomes computationally expensive.

A single image may contain a large number of local

descriptors, making brute-force search inefﬁcient for

large databases. Systems that exploit local descriptors

generally aggregate them into a single global repre-

sentation (BoW, VLAD) for efﬁcient database search.

Furthermore, all methods described above deploy net-

works trained for other tasks, e.g. classiﬁcation. An

emerging trend in CBRSIR is to train networks using

metric learning for improved retrieval performance.

2.3 Metric Learning for RSIR

Features extracted from CNNs generalize well to

tasks the network was not trained for (Penatti et al.,

2015). However, networks trained using metric-

learning objective functions, such as contrastive

loss (Chopra et al., 2005) or triplet loss (Weinberger

and Saul, 2009), show excellent performance in vari-

ous retrieval tasks, like person re-identiﬁcation (Her-

mans et al., 2017) and remote sensing image retrieval

(RSIR) (Chaudhuri et al., 2019, Cao et al., 2019).

Cao et al. present a novel method for RSIR us-

ing metric learning and study various dimensionality-

reduction techniques. Their system produces global

descriptors, learned with triplet loss. Dimensionality

reduction and whitening of the trained descriptors is

done using either Principal Component Analysis (Jol-

liffe, 2011), or learned using a fully-connected layer.

Meanwhile, Chauduri et al. construct a Region Adja-

cency Graph, which is fed into a Graph Convolution

Network to produce a global descriptor. This descrip-

tor encodes the relationship between adjacent objects

in the images. Training is done with the contrastive

loss. Metric learning produces descriptors with small

intra-class distances and large inter-class distances in

feature space. These types of descriptors provide ex-

cellent ranking performance, even when dealing with

the visual complexity occurring in RS imagery (scale,

lighting and position variations). Unlike the work dis-

cussed above, our approach enables us to extract lo-

cal image information without requiring neither ad-

ditional post-processing of the convolutional descrip-

tors nor computation of visual dictionaries (as in BoW

and VLAD). Inspired by person re-identiﬁcation, we

deploy an architecture based on the Multi-Granularity

Network (MGN) (Wang et al., 2018), and train it

for CBRSIR. MGN’s architecture enables the extrac-

Multi-Branch Convolutional Descriptors for Content-based Remote Sensing Image Retrieval

243

Image

Database

Feature

Extraction

Distance

Computation

Query Image Retrieved Images

Figure 1: Diagram of the principal phases in a CBRSIR system. The blue arrow depicts the ofﬂine data ﬂow, whereas the red

arrows depict the online data ﬂow.

tion of features from various regions and at differ-

ent granularities (parts of images), thereby produc-

ing a compact and highly descriptive representation.

The combined qualities of this approach enable excel-

lent retrieval performance. Furthermore, we consider

a more generic retrieval case, in which the network

is trained on a dataset and retrieval is performed on

different, unseen datasets. This evaluation procedure

more closely resembles a real-world scenario, where

the training data and retrieval database may have dif-

ferent domains. In conclusion, we aim at an archi-

tecture based on MGN which offers a compact and

highly descriptive representation, while remaining ro-

bust for retrieval with unseen data. A more detailed

description of the architecture and the design param-

eters are given in the following section.

3 METHOD

A. Overview and Feature Extractor: RS imagery is

signiﬁcantly different from street-level imaging. Im-

ages are acquired from an orthographic view and at

a high altitude, thereby altering the image character-

istics. Moreover, the variety of locations is large, in-

troducing an additional demand on the generalization

ability of the feature extractor. Other sources of ap-

pearance variations can be conditional on weather and

lighting (e.g. illumination and occlusion), or environ-

mental/anthropogenic (e.g. agricultural). The above-

mentioned difﬁculties inherent to CBRSIR motivates

research towards the development of better feature ex-

traction approaches.

As discussed in Section 2.1, CNN-based feature

extraction approaches are shown to be superior com-

pared to their handcrafted counterparts. Thus, to ex-

ploit the remarkable potential of deep learning, we

adopt CNNs to extract rich features from RS imagery.

However, a single, universal CNN architecture that

performs well on every problem does not exist. Well-

performing architectures are explored for speciﬁc ap-

plications.

Recently, in other image retrieval tasks (person

and vehicle re-identiﬁcation), simultaneous usage

of global and local features has signiﬁcantly con-

tributed to the overall retrieval performance (Chen

et al., 2019, Zheng et al., 2019). Motivated by this

information, we propose to jointly use global and

local features, to achieve superior performance in

CBRSIR. Inspired by the person re-identiﬁcation

approach based on MGN (Wang et al., 2018), we

propose the CNN architecture shown in Figure 2.

B. Part-based CNN Architecture: As shown

in Figure 2, our architecture is a ﬁve-branch CNN,

where four of the branches extract local features

and one extracts global features. We utilize four

local branches of increasing granularity, to better

adapt to variations in scale and position of the

depicted object in the image content. Many powerful

backbone architectures exist in the literature, such

as ResNet (He et al., 2016), Inception (Szegedy

et al., 2015) and VGG (Simonyan and Zisserman,

2014). In this study, the backbone architecture

is ResNet-50, which is pre-trained on ImageNet.

This decision is guided by ResNet-50’s desirable

computational-cost-to-performance ratio and prior

success in the re-identiﬁcation literature. Other

backbone architectures can also be used with the

proposed approach.

The shared backbone consists of ResNet-50

blocks up to and including the conv4 1 layer. Each

branch uses the output of the shared backbone to

concurrently extract local and global features. The

features are generated with the remaining blocks of

the ResNet-50 architecture up to and including the

conv5

1 layer. After branching, the ResNet blocks

do not share parameters. During the extraction of lo-

cal branch features, we follow (Wang et al., 2018)

and reduce the last stride of ResNet-50 from 2 to 1,

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

244

Figure 2: Overview of our proposed model. Our architecture has ﬁve branches, one being the global branch and rest are the

local branches. After the shared-weight backbone, the global branch pools the feature tensor along the spatial axes without

partitioning. In contrast, the local branches partition the feature tensor in different conﬁgurations prior to pooling. Then, each

local feature is trained using softmax for RS classiﬁcation problem. Note that, the FC layers after the convolutional reduction

block of each local feature are omitted for clarity (best viewed in color).

which produces better local features. This reduction

enables the local branches to extract richer features

and is found to be helpful in other studies (Luo et al.,

2019, Kalayeh et al., 2018). Then, as depicted in Fig-

ure 2, max-pooling is applied. Each branch has a

different pooling strategy, to boost the feature extrac-

tion performance. The global branch, utilizes max-

pooling over both height and width dimensions, re-

ducing the spatial size to unity. In addition to global

max-pooling, the local branches also partition the fea-

ture tensor into multiple local feature tensors and then

apply individual max-pooling operations. The result-

ing pooled features are then trained with different ap-

proaches. We train for global feature extraction using

the softmax cross-entropy (abbreviated as softmax)

and triplet losses, while we train for local feature ex-

traction using only the softmax loss.

C. Global Features - Softmax: Both the global and

local branches learn global features with the softmax

loss. To this end, global max-pooling is applied to

the feature tensors after the conv5 1 layer of each

branch. Afterwards, the tensor is passed through a

fully-connected (FC) layer, effectively reducing the

size of the pooled feature vectors to N

, where N

denotes the number of classes in the training dataset.

Lastly, the output of the FC layer is trained with

softmax loss. Our method utilizes softmax loss in

addition to triplet loss on global features, because it

has shown to provide richer features.

D. Local Features - Softmax: The local branches di-

vide the output of conv5 1 into horizontal or vertical

partitions of increasing granularity. Horizontally, the

feature tensor is split into partitions of 1×2 and 1× 4

sub-tensors. The vertical partitions are generated

with the same dimensions. This partitioning strategy

produces high-dimensional representations that take

advantage of the contextual information. Then, the

max-pooling operation is applied individually to

every sub-tensor. Afterwards, the resulting feature

vectors are reduced in size using a convolutional

reduction block. This block consists of 1 × 1 convo-

lution, batch normalization, and ReLU layers, and

reduces the feature size to 256, leading to relatively

compact descriptors. Finally, each reduced feature

is inputted to an FC layer that has N

elements and

local features are learned using the softmax loss for

classiﬁcation.

E. Global Features - Triplet Loss: In addition to

the global softmax loss of each branch, training is

performed with the triplet loss on the reduced max-

pooled feature vectors. The dimensionality-reduction

strategy is identical to that of the local softmax loss

training. The resulting feature vectors are trained us-

Multi-Branch Convolutional Descriptors for Content-based Remote Sensing Image Retrieval

245

ing the triplet loss given as:

triplet

∑

a, p,n

6=y

min(0, m + D

a, p

− D

a,n

), (1)

where a, p, n are anchor, positive and negative sam-

ples, y

is the class of sample i, parameter m is the

margin, and D

a, p

and D

a,n

are the distances between

anchor-positive and anchor-negative samples, respec-

tively. Note that we follow (Hermans et al., 2017) and

apply hard-triplet mining to enhance the discrimina-

tion ability of the triplet loss. This hard-triplet min-

ing strategy ﬁrst picks P classes and then K images

randomly from each class to construct a mini-batch.

During the training, the network weights are only up-

dated for each anchor sample once, using the hardest

positive and negative images within the mini-batch.

At inference time, the ﬁnal feature vector used for re-

trieval is constructed by concatenating the triplet-loss

trained features. Per branch, feature vectors are ex-

tracted from each partition, prior to their last FC layer

and concatenated. This leads to a ﬁxed-size feature

vector for every image, regardless of the number of

classes in a given training dataset. Then, to math-

ematically represent the distance between any arbi-

trary pair of samples, we calculate a distance metric

between their descriptors. We use the L

distance to

compute the similarity between two feature vectors.

4 EXPERIMENTS

4.1 Datasets

To demonstrate the efﬁciency of our approach,

we evaluate on three widely used public datasets:

UC Merced Land Use (Yang and Newsam, 2010),

NWPU-RESISC45 (Cheng et al., 2017) and Pattern-

Net (Zhou et al., 2018).

UC Merced. Published in 2010, this dataset

includes 2,100 images equally distributed over

21 classes. The image size is 256 × 256 pixels, where

pixel resolution is approximately 30 centimeters.

NWPU-RESISC45. This large-scale dataset in-

cludes 45 classes, each including 700 images. The

image size is 256 × 256 pixels, where the pixel reso-

lution varies between 30 and 0.2 meters. Published

in 2016, this dataset is speciﬁcally challenging due to

its large number of classes.

PatternNet. The PatternNet dataset has been pub-

lished in 2017 and forms one of the most recent

additions to the RSIR literature. This dataset includes

38 classes with 800 images. The 256 × 256 pixel

images have been collected from Google Earth and

Google Map API. As is the case for the NWPU-

RESISC45 dataset, the pixel resolution is not ﬁxed

and varies between 4.693 and 0.062 meters.

4.2 Metrics

We adopt two popular metrics to evaluate our ap-

proach. Per experiment, the Mean Precision@k

(mP@k) and the average normalized modiﬁed re-

trieval rank (ANMRR) (Manjunath et al., 2001, Ap-

toula, 2013) are computed. In the case of mP@k, we

compute the ratio of correctly retrieved images in the

top k positions. A higher number denotes better per-

formance. The ANMRR evaluates the retrieval per-

formance taking both the number and rank of the re-

trieved results into account. In this case, a smaller

number indicates better performance.

4.3 Splits and Evaluation Protocol

To evaluate the performance of our approach and to

achieve a fair comparison with other methods, we use

two evaluation protocols. In the ﬁrst evaluation proto-

col, which is also employed in (Tang et al., 2018, Im-

briaco et al., 2019,Cao et al., 2019), we train and eval-

uate on the same dataset. To split the datasets, we

ﬁrst randomly sample 20% of all images in a class-

balanced manner. Then, we use the remaining part

as the training set. For the evaluation, we consider

each of the smaller subset images as the query and the

rest of the dataset as the gallery. Retrieved matches

of the same class are considered true positives. This

split and evaluation protocol is commonly used for

CBRSIR. However, this evaluation protocol has a sig-

niﬁcant shortcoming. In this strategy, the training

and evaluation sets are not completely disjoint for the

gallery images. In other words, some of the gallery

images for each query are also used for training pur-

poses.

To compensate for the shortcoming of the existing

evaluation protocol and to obtain a better idea about

the generalization capabilities of CBRSIR systems,

we also report results on the transfer learning setting.

In this evaluation approach, we train on a dataset and

evaluate on another. In this way, the training and eval-

uation sets are completely disjoint. Moreover, this

evaluation protocol reveals how successful an algo-

rithm is on discriminating different RS structures in

an image, even if it is not trained to recognize speciﬁc

cues. For example, unlike dataset NWPU45, the UC

Merced dataset does not contain the “swimming pool”

class. A system with superior generalization ability is

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

246

Table 1: Comparison of image retrieval performances on the UC Merced, NWPU45 and PatternNet datasets.

UC Merced NWPU45 PatternNet

Method mP@10 mP@20 ANMRR mP@10 mP@20 ANMRR mP@10 mP@20 ANMRR

ResNet50 (Imbriaco et al., 2019) - 0.816 - - 0.798 - - - -

DBOW (Tang et al., 2018) - 0.830 - - 0.821 - - - -

V-DELF (MA) (Imbriaco et al., 2019) - 0.896 - - 0.840 - - - -

SGCN (Chaudhuri et al., 2019) 0.936 - 0.300 - - - 0.971 - 0.210

DML (Cao et al., 2019) 0.976 - 0.023 - - - 0.996 - 0.003

Ours - Global branch only 0.979 0.979 0.019 0.944 0.941 0.074 0.994 0.994 0.012

Ours 0.990 0.990 0.013 0.951 0.947 0.089 0.996 0.995 0.013

expected to be able to discriminate the images belong-

ing to this class as a separate structure, even if it was

not explicitly trained with this class.

4.4 Hyperparameters and Settings

Throughout our experiments, we have used an ini-

tial learning rate of 0.0002, reduced by one-tenth

at epochs 60, 100 and 125. All models have been

trained for a total of 150 epochs using the Adam algo-

rithm (Kingma and Ba, 2014) and the PyTorch frame-

work (Paszke et al., 2017). The values for weight de-

cay and the triplet loss margin were set to 0.0005 and

1.2, respectively. Lastly, hard-triplet mining parame-

ters P and K were set to P = 4 and K = 5 for Pattern-

Net and NWPU45 datasets, where P = 3 and K = 7

were used for UC Merced dataset.

4.5 Results

We present the CBRSIR results organized as follows.

In Table 1, we show the comparative performance

evaluation of our approach against the results of ex-

isting methods. Table 2 depicts the per-class retrieval

performance evaluated on the UC Merced Land Use

dataset. Lastly, we present in Table 3, the retrieval

performance results for transfer learning evaluation

for all dataset combinations. As it can be observed

from Table 1, our method outperforms the previous

state-of-the-art. The multi-branch convolutional de-

scriptors increase the retrieval performance on all

three datasets, except for PatternNet, where our re-

sults are comparable to those of the state of the art.

Moreover, for UC Merced and PatternNet datasets,

our results exceed 99.5%, which may be interpreted

as an indication of a saturated performance.

5 DISCUSSION

Observing Table 2, we conclude that our performance

in various classes is balanced. In UC Merced, the

lowest retrieval results are associated with the “dense

Table 2: Retrieval performances of our system on the UC

Merced dataset for each class and the average.

Class mP@1 mP@5 mP@10 mP@20 ANMRR

agricultural 1.000 1.000 1.000 1.000 0.008

airplane 1.000 1.000 1.000 1.000 0.008

baseball d. 1.000 1.000 1.000 1.000 0.008

beach 1.000 1.000 1.000 1.000 0.008

buildings 1.000 0.950 0.965 0.975 0.017

chaparral 1.000 1.000 1.000 1.000 0.008

d. residential 0.900 0.930 0.925 0.915 0.056

forest 1.000 1.000 1.000 1.000 0.008

freeway 1.000 1.000 1.000 1.000 0.008

golf course 1.000 1.000 1.000 1.000 0.008

harbor 1.000 1.000 1.000 1.000 0.008

intersection 1.000 1.000 1.000 1.000 0.008

m. residential 1.000 1.000 0.990 0.992 0.008

m. homepark 1.000 1.000 1.000 1.000 0.008

overpass 0.950 0.950 0.950 0.950 0.032

parkinglot 0.950 0.950 0.950 0.950 0.032

river 1.000 1.000 1.000 1.000 0.008

runway 1.000 1.000 1.000 1.000 0.008

s. residential 1.000 1.000 1.000 1.000 0.008

storagetanks 1.000 1.000 1.000 1.000 0.008

tenniscourt 1.000 1.000 1.000 1.000 0.008

Average 0.990 0.990 0.990 0.990 0.013

Table 3: Retrieval results on transfer learning setting, ex-

plained in Section 4.3. Here, the model is trained on

the “Training” dataset and evaluated directly on the “Test”

dataset.

Training Test mP@1 mP@5 mP@10 mP@20 ANMRR

UCM PNet 0.941 0.916 0.899 0.880 0.331

UCM NWPU45 0.717 0.658 0.627 0.591 0.620

PNet UCM 0.926 0.851 0.798 0.715 0.483

PNet NWPU45 0.769 0.708 0.670 0.627 0.633

NWPU45 UCM 0.967 0.937 0.911 0.880 0.199

NWPU45 PNet 0.977 0.970 0.964 0.956 0.181

residential” and “overpass” classes. We conjecture

that the lowered performance in those classes occurs

due to the existence of visually similar categories

“medium residential” and “intersection”. In Pattern-

Net, our overall result is only slightly higher than

that of UC Merced. However, considering that Pat-

ternNet is nearly 15 times larger than UC Merced,

this indicates that our method produces sufﬁciently

discriminative representations even for small datasets

with limited training images. As labeling is a labor-

Multi-Branch Convolutional Descriptors for Content-based Remote Sensing Image Retrieval

247

intensive task, we consider this as an advantageous

property. Perhaps, the most interesting results are

those of the NWPU45-RESISC dataset. This dataset

is comparable in size to PatternNet and it has the high-

est number of classes among all three datasets. As

it is the case for existing approaches, our method ex-

hibits decreased retrieval performance for this dataset.

The classes “railway”, “railway station”, “terrace”

and “palace” show the lowest overall performance,

scoring 0.844, 0.876, 0.868, 0.853 in mP@20, re-

spectively. Among those, the “railway” and “railway

station” classes are visually similar. Our descriptor

is not capable of producing discriminative represen-

tations for classes with such small semantic differ-

ences, resulting in reduced retrieval performance for

both. Overall, we conclude that, according to both

the mean precision and ANMRR metrics, our method

offers class-balanced retrieval performance. As ex-

plained in Section 4.3, we conjecture that the trans-

fer learning results in Table 3 are far more informa-

tive, since they provide better insight into the gen-

eralization abilities of our system. Investigating the

mean precision@20, we conclude that, the best re-

trieval results are obtained when NWPU45 is used as

the source dataset for training. Although the Pattern-

Net and NWPU45 datasets are of similar scale, evalu-

ation on UC Merced reveals that training on NWPU45

offers 16.5% better results than training on Pattern-

Net. Thus, we conclude that thanks to the higher

image diversity of NWPU45, training on this dataset

yields better generalization. Closer inspection of the

results in Table 3 also indicate that applying trans-

fer learning, when the training dataset is UC Merced

yield only moderate results. This reinforces the idea

that the training dataset should be sufﬁciently rich in

classes and number of images.

6 CONCLUSIONS

In this study, we have presented our deep learning-

based approach to CBRSIR. By taking advantage

of combining deep global and local features, we

have achieved state-of-the-art results on three pub-

licly available and popular datasets. Moreover, our

approach offers near-perfect retrieval performance for

the widely-used UC Merced and PatternNet datasets,

while providing balanced retrieval performance for all

classes of the considered datasets.

As an additional contribution, we have also argued

that the existing evaluation protocol for the CBRSIR

problem has shortcomings and that it is not infor-

mative about the generalization ability of CBRSIR

systems. Thus, we have proposed to utilize transfer

learning evaluation to alleviate the problems of the

existing evaluation approach. Furthermore, we have

presented the results of our method on the transfer

learning evaluation setting. We presume that this new

evaluation protocol will be beneﬁcial for the CBRSIR

literature and will motivate researchers to concentrate

on methods with superior generalization capability.

ACKNOWLEDGEMENTS

The authors kindly appreciate the NVIDIA gift of a

Titan Xp GPU for this research. This work was sup-

ported by the Interreg project PASSAnT and the Eu-

ropean ITEA project PS-CRIMSON.

REFERENCES

Aptoula, E. (2013). Remote sensing image retrieval

with global morphological texture descriptors. IEEE

transactions on geoscience and remote sensing,

52(5):3023–3034.

Bai, Y., Yu, W., Xiao, T., Xu, C., Yang, K., Ma, W.-Y., and

Zhao, T. (2014). Bag-of-words based deep neural net-

work for image retrieval. In Proceedings of the 22nd

ACM international conference on Multimedia, pages

229–232. ACM.

Cao, R., Zhang, Q., Zhu, J., Li, Q., Li, Q., Liu, B., and Qiu,

G. (2019). Enhancing remote sensing image retrieval

using a triplet deep metric learning network. Interna-

tional Journal of Remote Sensing, pages 1–12.

Chaudhuri, U., Banerjee, B., and Bhattacharya, A. (2019).

Siamese graph convolutional network for content

based remote sensing image retrieval. Computer Vi-

sion and Image Understanding, 184:22–30.

Chen, H., Lagadec, B., and Bremond, F. (2019). Partition

and reunion: A two-branch neural network for vehi-

cle re-identiﬁcation. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition

Workshops, pages 184–192.

Cheng, G., Han, J., and Lu, X. (2017). Remote sensing

image scene classiﬁcation: Benchmark and state of

the art. Proceedings of the IEEE, 105(10):1865–1883.

Chopra, S., Hadsell, R., LeCun, Y., et al. (2005). Learning a

similarity metric discriminatively, with application to

face veriﬁcation. In CVPR (1), pages 539–546.

Haralick, R. M., Shanmugam, K., et al. (1973). Textural

features for image classiﬁcation. IEEE Transactions

on systems, man, and cybernetics, (6):610–621.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Hermans, A., Beyer, L., and Leibe, B. (2017). In defense

of the triplet loss for person re-identiﬁcation. ArXiv,

abs/1703.07737.

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

248

Imbriaco, R., Sebastian, C., Bondarev, E., et al. (2019). Ag-

gregated deep local features for remote sensing image

retrieval. Remote Sensing, 11(5):493.

egou, H., Douze, M., Schmid, C., and P

erez, P. (2010). Ag-

gregating local descriptors into a compact image rep-

resentation. In CVPR 2010-23rd IEEE Conference on

Computer Vision & Pattern Recognition, pages 3304–

3311. IEEE Computer Society.

Jolliffe, I. (2011). Principal component analysis. Springer.

Kalayeh, M. M., Basaran, E., G

okmen, M., Kamasak,

M. E., and Shah, M. (2018). Human semantic pars-

ing for person re-identiﬁcation. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition, pages 1062–1071.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Lin, M., Chen, Q., and Yan, S. (2013). Network in network.

arXiv preprint arXiv:1312.4400.

Lowe, D. G. et al. (1999). Object recognition from local

scale-invariant features. In iccv, volume 99, pages

1150–1157.

Luo, H., Gu, Y., Liao, X., Lai, S., and Jiang, W. (2019).

Bag of tricks and a strong baseline for deep person re-

identiﬁcation. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition Work-

shops, pages 0–0.

Manjunath, B. S. and Ma, W.-Y. (1996). Texture features

for browsing and retrieval of image data. IEEE Trans-

actions on pattern analysis and machine intelligence,

18(8):837–842.

Manjunath, B. S., Ohm, J.-R., Vasudevan, V. V., and Ya-

mada, A. (2001). Color and texture descriptors. IEEE

Transactions on circuits and systems for video tech-

nology, 11(6):703–715.

Noh, H., Araujo, A., Sim, J., Weyand, T., and Han, B.

(2017). Large-scale image retrieval with attentive

deep local features. In Proceedings of the IEEE In-

ternational Conference on Computer Vision, pages

3456–3465.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,

DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and

Lerer, A. (2017). Automatic differentiation in pytorch.

In NIPS-W.

Penatti, O. A. B., Nogueira, K., and dos Santos, J. A.

(2015). Do deep features generalize from everyday

objects to remote sensing and aerial scenes domains?

In The IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR) Workshops.

Roy, S., Sangineto, E., Demir, B., and Sebe, N. (2018).

Deep metric and hash-code learning for content-based

retrieval of remote sensing images. In IGARSS 2018-

2018 IEEE International Geoscience and Remote

Sensing Symposium, pages 4539–4542. IEEE.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Sivic, J. and Zisserman, A. (2003). Video google: A text

retrieval approach to object matching in videos. In

null, page 1470. IEEE.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015). Going deeper with convolutions.

In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 1–9.

Tang, X., Zhang, X., Liu, F., and Jiao, L. (2018). Unsuper-

vised deep feature learning for remote sensing image

retrieval. Remote Sensing, 10(8).

Wang, G., Yuan, Y., Chen, X., Li, J., and Zhou, X.

(2018). Learning discriminative features with multi-

ple granularities for person re-identiﬁcation. CoRR,

abs/1804.01438.

Weinberger, K. Q. and Saul, L. K. (2009). Distance met-

ric learning for large margin nearest neighbor clas-

siﬁcation. Journal of Machine Learning Research,

10(Feb):207–244.

Xiong, W., Lv, Y., Cui, Y., Zhang, X., and Gu, X. (2019).

A discriminative feature learning approach for remote

sensing image retrieval. Remote Sensing, 11:281.

Yang, Y. and Newsam, S. (2010). Bag-of-visual-words and

spatial extensions for land-use classiﬁcation. In Pro-

ceedings of the 18th SIGSPATIAL international con-

ference on advances in geographic information sys-

tems, pages 270–279. ACM.

Zheng, F., Deng, g., Sun, X., Jiang, X., Guo, X., Yu, Z.,

Huang, F., and Ji, R. (2019). Pyramidal person re-

identiﬁcation via multi-loss dynamic training. In Pro-

ceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 8514–8522.

Zhou, W., Newsam, S., Li, C., and Shao, Z. (2017). Learn-

ing low dimensional convolutional neural networks

for high-resolution remote sensing image retrieval.

Remote Sensing, 9(5).

Zhou, W., Newsam, S., Li, C., and Shao, Z. (2018). Pattern-

net: A benchmark dataset for performance evaluation

of remote sensing image retrieval. ISPRS journal of

photogrammetry and remote sensing, 145:197–209.

Multi-Branch Convolutional Descriptors for Content-based Remote Sensing Image Retrieval

249