Place Recognition Using Bag of Semantic and Visual Words from

Equirectangular Images

Mar

ıa Flores

, Marc Fabregat-Ja

, Juan Jos

e Cabrera

, Adri

an Peidr

David Valiente

and Luis Pay

Engineering Research Institute of Elche (I3E), Miguel Hern

andez University, Avda. de la Universidad, s/, 03202, Elche,

Alicante, Spain

ﬂ

Keywords:

Place Recognition, Equirectangular Images, Semantic Information, Bag of Visual Words.

Abstract:

Place recognition has a crucial relevance in some tasks of mobile robot navigation. For example, it is used

for the detection of loop-closure or for estimating the position of a mobile robot along a route in a known

environment. If place recognition is based on visual information, it can be approached as an image retrieval

problem. The Bag of Visual Words technique can be used for image retrieval. Image retrieval is based on

an image representation (for example, a vector) that contains relevant visual information. In this paper, two

image signatures are proposed. Both are based on semantic and visual information. A bag of visual words

is created for each semantic class. Local feature descriptors are classiﬁed according to the projection of their

associated point on a segmented semantic map. On the one hand, the image signature is composed of a set

of histograms where each cell encodes the frequency with which a visual word appears in the image. On the

other hand, the image signature is composed of a set of vectors where each cell encodes the sum of the cosine

distance between the visual word and the nearest extracted features.

1 INTRODUCTION

A mobile robot can navigate autonomously in a pri-

ori unknown environment or, by contrast, in a known

environment. In the ﬁrst situation, the mobile robot

must solve the Simultaneous Localization And Map-

ping (SLAM) problem. This means that the mobile

robot builds a map of the environment and simul-

taneously estimates its position within the map dur-

ing navigation. In visual SLAM, the accuracy of the

map and localization can be improved by identifying

a previously visited location (Loop Closure Detection

module). In the second situation, the map is already

available in advance, and the mobile robot must esti-

mate its actual position within this map. In this con-

text, the mobile robot can locate itself if it is able to

identify its current surroundings on the stored topo-

logical map.

Place recognition is a computer vision task in

https://orcid.org/0000-0003-1117-0868

https://orcid.org/0009-0002-4327-0900

https://orcid.org/0000-0002-7141-7802

https://orcid.org/0000-0002-4565-496X

https://orcid.org/0000-0002-2245-0542

https://orcid.org/0000-0002-3045-4316

which, given an image, its location is identiﬁed by

querying the locations of images which belong to

the same place in a large geotagged database (Zeng

et al., 2018). It is commonly posed as an image

retrieval task. Image retrieval techniques can be

grouped into the following: Text-Based Image Re-

trieval (TBIR) and Content-Based Image Retrieval

(CBIR). The main difference between both is that the

search for images similar to a given query from a large

database is based on their visual content (CBIR) or

on the textual data (metadata) associated with the im-

age. The CBIR is based on three key components: se-

lection, extraction, and representation of features. A

comprehensive survey of this is presented by Srivas-

tava et al. (2023). Similarly, Li et al. (2021) provide

another survey of the fast advances and applications

of theories and algorithms, focusing on those within

the period from 2009 to 2019.

In place recognition, an important issue is the sen-

sors on board the mobile robot, since they provide

information about its surrounding, and its location

is identiﬁed by analysing this captured information.

Some well-recognized types of sensors used in place

recognition are vision (Wang et al., 2018; Xu et al.,

2019; Alfaro et al., 2024) and LiDAR systems (Cabr-

166

Flores, M., Fabregat-Jaén, M., Cabrera, J. J., Peidró, A., Valiente, D. and Payá, L.

Place Recognition Using Bag of Semantic and Visual Words from Equirectangular Images.

DOI: 10.5220/0013838000003982

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 22nd International Conference on Informatics in Control, Automation and Robotics (ICINCO 2025) - Volume 2, pages 166-174

ISBN: 978-989-758-770-2; ISSN: 2184-2809

era et al., 2024; Vilella-Cantos et al., 2025).

This work focuses on solving place recognition

using visual information captured in images. An im-

portant feature of images is the rich information they

provide about the environment in which they were

taken. In mobile robotics, the ﬁeld of view of the vi-

sion system is relevant, since, for example, the wider

the ﬁeld of view, the fewer images are needed to cre-

ate a map.

The present work is based on (Ouni et al., 2022).

The authors propose three image signatures in order

to resolve the image retrieval task. Focusing specif-

ically on the type of signature that combines visual

features and semantic information in (Ouni et al.,

2022), this paper evaluates its behaviour for place

recognition and in images with a wider ﬁeld of view

such as equirectangular images. This signature is an

NxM matrix where N is the number of classes and M

is the size of the visual descriptor. The procedure con-

sists in extracting local feature points and a semantic

segmented map, the points will be classiﬁed accord-

ing to their projection on such map so that a semantic

label is assigned. In this way, there are different clus-

ters (one per semantic class) composed of visual de-

scriptors. Each row of the image signature will be the

centroid of each of these clusters. However, it may

happen that a class has a completely different visual

appearance. For this reason, we propose to create a

bag of visual words for each semantic class instead of

a class being represented by a unique visual descrip-

tor.

The contributions of this work are as follows:

1. An image signature in which semantic and visual

information is merged. Each semantic category

will have a bag of visual words that will be used

to obtain a frequency histogram. The signature

will be the concatenation of all these histograms.

2. The frequency histograms of the previous sig-

nature are replaced by one-dimensional vectors,

where each bin represents the sum of the cosine

distances to each visual word. In other words, im-

age encoding is based on distance instead of fre-

quency.

3. The above image signatures and one of the pro-

posal in (Ouni et al., 2022) (BoSW) are compared

for place recognition in an outdoor environment.

We want to clarify that we have only evaluated the

image signature that the authors describe in Sec-

tion 3.1 (denominated BoSW) of their paper, not

their full global framework for CBIR.

4. These image signatures are evaluated in image

retrieval when images are distorted, such as in

equirectangular images.

5. The inﬂuence of the type of distance on the Near-

est Neighbour search in the image encoding step

is studied.

The remainder of this paper is organized as fol-

lows. In Section 2, some image retrieval techniques

proposed in the literature are presented. Section 3 de-

scribes the different parts of the algorithm employed

to solve image retrieval in this paper. Section 4 is

focused on the experimental part, the database used,

and the analytical metrics are described. The results

obtained from the experiments are presented and dis-

cussed in Section 5. Finally, Section 6 presents the

conclusions.

2 RELATED WORKS

2.1 Image Retrieval

In image retrieval, there are approaches based on bag

of visual words. Mansoori et al. (2013) focus on the

feature extraction stage and propose to incorporate

colour information (hue descriptor) in descriptor fea-

tures (SIFT) of the images.

In terms of works that propose combining se-

mantic information and local visual features, Ouni

et al. (2022) present three procedures in order to con-

struct the image signatures using semantic informa-

tion. Amongst the three signature, only two combine

these two types of information, the other is based on

semantic information only. One year later, Ouni et al.

(2023) proposed two additional types of signatures.

The ﬁrst of these integrates at the same time the se-

mantic proportions of objects and their spatial posi-

tions. Meanwhile the second one builds a semantic

bag of visual phrase (i.e. a set of words linked to-

gether) by combining the visual vocabulary with se-

mantic information. In this case, the image signature

is an upper triangular matrix whose height and width

are equal to the number of visual words in the code-

book.

As with other computer vision tasks, the use of

convolutional neural network-based architectures has

increased in popularity over the last decade. Rani

et al. (2025) propose a separable convolutional neu-

ral networks-based framework. This contribution

reduces the computational complexity in terms of

a number of convolutional operations and hyper-

parameters. Forcen et al. (2020) present a new rep-

resentation of co-occurrences from deep convolu-

tional features which is combined with the feature

map in order to improve the image representation.

Dubey (2022) presents a comprehensive survey of

deep learning-based progress over the last decade.

Place Recognition Using Bag of Semantic and Visual Words from Equirectangular Images

167

EXTRACT

FEATURES

DATABASE

VISUAL VOCABULARY

CREATION

SIGNATURE

CONSTRUCTION

Section 3.1

Section 3.2

Section 3.3

Figure 1: The visual bag of visual words framework: given a set of images (database), feature extraction is performed for all

these images (Section 3.1). From all the features, the visual vocabulary is created (Section 3.2). Then, the signature of each

image is built using the visual words that compose the vocabulary (Section 3.3).

3 SEMANTIC AND VISUAL BAG

OF WORDS

In this work, a hierarchical bag of words is used to

solve the localization problem in the navigation of a

mobile robot. This hierarchical bag of words is com-

posed of two levels. The higher level is based on se-

mantic information, while the lower level is supported

by visual information.

As it has been mentioned, the bag of visual words

method involves creating a vocabulary which is com-

posed of representative visual words that are the re-

sults of clustering the visual descriptors extracted in

an image.

The bag of visual words technique consists of the

extraction of visual features which are then clustered

in order to create a set of visual words (vocabulary or

codebook). After that, each image is represented by a

signature that encodes the frequency with which each

visual word appears in the image. These components

are shown in Figure 1, and described in detail within

the following sections.

3.1 Feature Extraction

In this work, the features used are the result of com-

bining local feature descriptors (see Section 3.1.1)

and semantic information (see Section 3.1.2). Both

processes are carried out in parallel as shown in Fig-

ure 2.

SEMANTIC

SEGMENTATION

MODEL

LOCAL FEATURE

EXTRACTION

DESCRIPTION

Figure 2: The upper part of the ﬁgure corresponds to the ex-

traction of local features (points and its feature descriptors,

Section 3.1.1). On the other hand, the bottom part of the

ﬁgure corresponds to the extraction of semantic segmented

map using a semantic segmentation model (Section 3.1.2).

3.1.1 Local Features

This stage is divided into two steps. Firstly, the dis-

tinctive local points are identiﬁed for each image in

the database. This step is known as local feature ex-

traction. These points can be corners, edges or blobs.

Secondly, the extracted local points are represented

by a feature descriptor that extracts visual features of

its neighbourhood.

There are several techniques for this purpose, such

as SIFT (Scale-Invariant Feature Transform) (Lowe,

2004), SURF (Speeded-Up Robust Features) (Bay

et al., 2006) and ORB (Oriented FAST and Rotated

BRIEF) (Rublee et al., 2011).

3.1.2 Semantic Segmentation

Semantic segmentation is a technique that aims to as-

sign a semantic class to each pixel in the image.

Although the diagram presented in Figure 2 shows

a block corresponding to the extraction of the seman-

tic segmentation map using a semantic segmentation

model, it is important to note that this step is not car-

ried out in this work, as the semantic segmentation

maps have been previously generated and are part of

the dataset, as it will explained in Section 4.1.

3.1.3 Fusion of Semantic and Visual Information

A semantic segmentation map and a set of local points

with their corresponding feature descriptors are avail-

able for each image. The goal is to have a semantic

label associated to each local feature descriptor. For

that end, given a local point, its coordinates are em-

ployed to extract the semantic label encoded in the

map at this pixel. This label is then assigned to the

feature descriptor of this local point.

At the end of this step, the visual descriptors have

been classiﬁed into different semantic categories.

3.2 Visual Vocabulary Creation

In a bag of visual words algorithm, visual feature de-

scriptors extracted from all images in the database are

grouped into k clusters based on their similarity. This

ICINCO 2025 - 22nd International Conference on Informatics in Control, Automation and Robotics

168

“car”

“vegetation”

“building”

Figure 3: The local points detected are projected on the se-

mantic segmented map to obtain its semantic label.

stage requires the use of a clustering algorithm such

as k-means. This is followed by the creation of the

visual vocabulary. The visual words are the centroids

of the clusters and the size of the vocabulary is equal

to the number of clusters (k).

In this work, no visual vocabulary is created for

all extracted feature descriptors, but a single one is

generated for each semantic category.

“car”

“vegetation”

“building”

Kmeans

𝑤

𝑐𝑎𝑟

𝑤

𝑏𝑢𝑖𝑙𝑑𝑖𝑛𝑔

𝑤

𝑣𝑒𝑔𝑒𝑡𝑎𝑡𝑖𝑜𝑛

𝑤

𝑐𝑎𝑟

𝑤

𝑏𝑢𝑖𝑙𝑑𝑖𝑛𝑔

Figure 4: A bag of k visual words is created for each seman-

tic class. Given a set of local feature descriptors classiﬁed

as class i (e.g. car), a k-means algorithm is used to assemble

these descriptors into k clusters (e.g. two) and to extract the

centroids of each one. These centroids are the visual words

of this class (w

class

j = 1, ..., k).

3.3 Signature Construction

In a bag of visual words algorithm, each image is rep-

resented by a one-dimensional vector with a length

equal to the number of visual words (vocabulary size)

where each element encodes the number of times each

visual word appears in the image. In other words, the

signature is a histogram of the frequencies of visual

words.

In this work, there is not a unique bag of visual

words, but a bag of visual words for each semantic

category. Then, if the number of semantic categories

to be considered is N, the signature is a set of N fre-

quency histograms. It will be identiﬁed in the experi-

mental section (Section 5) as BoSVW.

In addition, we propose replacing the frequency

histogram with a vector in which each cell encodes

the sum of the cosine distance between the local fea-

tures and the visual word. It will be identiﬁed in the

experimental section (Section 5) as BoSVW*.

Figure 5: Once the feature descriptors have been divided

into different groups based on the semantic information, a K

nearest neighbour algorithm is used to ﬁnd the most similar

visual word with the same semantic class. After obtaining

a histogram for each semantic group, the signature is con-

structed.

4 EXPERIMENTAL SETUP

In this section, the visual localization problem is

solved using several conﬁgurations of bag of visual

words method. The main purpose is to evaluate these

conﬁgurations and select the one that provides the

best results. It is important to note that the full

content-based image retrieval algorithm proposed by

Ouni et al. (2022) has not been implemented. Only

the signature construction module explained in Sec-

tion 3.1 of (Ouni et al., 2022) was implemented in our

framework. It will be identiﬁed in the experimental

section (Section 5) as BoSW. Table 1 shows a brief

deﬁnition of each of these conﬁgurations, their IDs

used in Section 5 and a brief description of the image

signature.

The images used are equirectangular (more infor-

mation on the dataset is given in Section 4.1) in or-

der to evaluate these methods also for this type of im-

ages, since spherical images present some beneﬁts in

mobile robot navigation (such as their wide ﬁeld of

view).

The algorithm executed in all four cases is the

same except for the encoding image part, where the

procedure is different for each type of signature (or

distance for BoSVW). In all four cases, the measure

chosen to compare the signatures of the query im-

age and those in the database is the cosine distance.

Since these are two-dimensional image signatures in

all four cases, the rows are concatenated to obtain a

one-dimensional vector. After that, the vector is nor-

malized using L1. The cosine distance is then calcu-

lated, as indicated, in order to obtain the most similar

Place Recognition Using Bag of Semantic and Visual Words from Equirectangular Images

169

Table 1: The different conﬁgurations of the bag of words method that are evaluated. The ﬁrst column shows the ID used for

each conﬁguration when displaying the results, while the third column describes the signature of each.

ID Method Signature

BoSW Bag of Semantic Words (Ouni

et al., 2022)

MxN matrix where the width N corresponds to the size

of the visual descriptor and the height M corresponds to

the number of semantic classes.

BoSVW Bag of Semantic Visual Words A set of M (number of semantic classes) one-dimensional

frequency histograms with a length equal to K.

BoSVW* Bag of Semantic Visual Words

(no frequency).

A set of M (number of semantic classes) one-dimensional

vectors with a length equal to K where each bin is the sum

of the cosine distances instead of the count.

image (retrieved image) in the database based on this.

All signatures have in common that they use visual

information. In this work, ORB (Rublee et al., 2011)

has been chosen to obtain the local points and their

corresponding feature descriptors.

In the cases of BoSVW and BoSVW*, the vocab-

ulary size (k) is initially ﬁxed for all semantic classes

and its value is 10. If any semantic class has a number

of feature descriptors less than 10, then the vocabu-

lary size of this class is equal to the number of feature

descriptors.

4.1 Dataset

The dataset employed in this paper is KITTI-360

(Liao et al., 2023). For the image collection, the au-

thors equipped a station wagon with two ﬁsheye cam-

era positioned to each side. Both ﬁsheye cameras

have 180 degrees of ﬁeld of view so that a full view

of the scene is captured.

Before carrying out the experiments, each pair of

ﬁsheye images was converted into a single equirect-

angular image. For that, the calibration provided

in the dataset is used to convert each ﬁsheye image

into equirectangular (ﬁsheye image projection to unit

sphere). Then, a polynomial transformation proposed

by Flores et al. (2024) is used to align both equirect-

angular images. Figure 6 shows an example (equirect-

angular image) after performing this process.

Figure 6: An equirectangular image generated from a pair

of ﬁsheye images of the KITTI-360 dataset (Liao et al.,

2023).

The dataset of images has been divided into two

subsets: database and query. For the ﬁrst subset, an

image was selected every 10 meters of the trajectory,

taking the ﬁrst captured image as the starting image.

Thus, the database consists of a total of 791 images.

The images of the dataset not selected have been con-

sidered as query images. This means that 9723 im-

ages constitute the query set.

As mentioned above, the semantic segmentation

maps were not obtained during the running process

(obtained beforehand). The semantic segmentation

model employed for this purpose is SegFormer (Xie

et al., 2021). A semantic segmented map can be vi-

sualized in Figure 7. Due to the fact that it is a 380

vision system, part of the station wagon appears in the

image. However, it is not part of the scene so it has

been labelled as unlabelled (black pixels in Figure 7),

and this class has not been taken into account for the

image signature construction.

Figure 7: The semantic segmentation map of the image

shown in Figure 6. It was generated using SegFormer (Xie

et al., 2021).

4.2 Evaluation Protocol

4.2.1 Distance Difference

The distances between the query image (q) and the

retrieved image (r) will be analysed. Then, given

a query image captured at position XY Z

, the im-

age retrieval algorithm is executed, which returns the

database image most similar to the query image (i.e.

retrieved image). The retrieved image was acquired at

XY Z

. Both positions (XY Z

and XY Z

) are extracted

from the pose ﬁle provided by the dataset. The dis-

ICINCO 2025 - 22nd International Conference on Informatics in Control, Automation and Robotics

170

tance units are meters and are calculated as follows:

dist

q−r

− X

)

+ (Y

−Y

)

+ (Z

− Z

)

(1)

4.2.2 Average Recall (AR) at 1

For each query image, a retrieved image is recovered

from the database after applying the image retrieval

method. Since the dataset provides the poses in which

all images were captured, after acquiring the retrieved

image, the distance between the pose of the retrieved

image and the query image (dist

q−r

) is calculated us-

ing equation (1).

If the distance is lower than 20 meters, the recall

for this query image (I

) is one. Otherwise, the recall

will be zero.

R@1



1 if dist

q−r

< 20meters

0 if dist

q−r

≥ 20meters

(2)

The evaluation measure is the average value of all

recall values after executing the method for all query

images (n images):

AR@1(%) =

∑

i=1

R@1

· 100 (3)

5 RESULTS AND ANALYSIS

5.1 BoSVW: Feature Extraction in the

Database

To create the vocabulary, it is important to extract fea-

tures. In this section, we analyse the visual features

extracted from the images in the database, speciﬁcally

how many are associated with each of the semantic

classes. This can be observed in Figure 8 by means

of a graph chart, where the height of each bar repre-

sent the number of visual features that are classiﬁed

for each semantic class.

As it can be seen, the semantic class with the high-

est number of features is ”vegetation”, with a total of

374321. In contrast, ”train” is the class with the low-

est number, a total of 4. Other semantic classes with

a high number of associated local features are ”build-

ing”, ”car” and ”sky”, in that order.

5.2 Comparison and Evaluation

This section compares the different image signatures

presented in Table 1 using the evaluation protocols de-

scribed in Section 4.2. For the case of BoSVW, two

experiments have been carried out. The ﬁrst exper-

iment used the Euclidean distance to create the his-

togram (i.e. during the K Nearest Neighbour process).

Figure 8: Number of ORB local features (y-axis) for each

semantic class (x-axis).

The second experiment uses the cosine distance. The

aim is to determine whether it has an inﬂuence on the

results.

5.2.1 Evaluation in Terms of Average Recall at 1

First, the different bag of words methods mentioned

above (see Table 1) are evaluated in terms of Average

Recall at 1 (AR@1). The results are shown by means

of bar graphs which can be observed in Figure 9. Each

bar indicates the AR@1 achieved for each signature

type.

Figure 9: Results of Average Recall at 1 (AR@1) calculated

after using the four image signatures evaluated.

With regard to the results shown in this ﬁgure,

the value of AR@1 for the BoSW method is equal

to 20.806, for BoSVW using the Euclidean distance

it is equal to 50.417, meanwhile using the cosine dis-

tance the value is 49.069, and for BoSW* it is equal

Place Recognition Using Bag of Semantic and Visual Words from Equirectangular Images

171

Figure 10: The distance between the position of the query image and the retrieved image using the different signatures.

to 46.529. As can be seen, a higher value of re-

call in place recognition is achieved for BoSVW vari-

ations than for BoSW. The results show that using

frequency (BoSVW) instead of distance (BoSVW*)

achieves better results. In terms of the distance to ﬁnd

the visual words, the Euclidean distance provides bet-

ter results than the cosine distance.

In summary, we propose two image signatures

(BoSVW and BoSW*) in which each semantic cate-

gory has its own associated bag of visual words from

which its descriptor will be obtained, rather than the

descriptor being a single visual descriptor (BoSW).

The main objective of this paper is to solve the local-

ization problem by means of place recognition. Ac-

cording to this initial evaluation, the image signatures

we propose increase the AR@1 value when they are

used to address image retrieval in a set of images from

the same environment (a trajectory followed by a mo-

bile robot). In other words, the proposed image signa-

tures improve the capability of the algorithm to return

an image of the database with a distance of less than

20 metres from the query image on more times.

5.2.2 Evaluation in Terms of Distance Difference

Second, the different methods are also evaluated in

terms of distance difference in meters. For this pur-

pose, the distance in meters between each query im-

age in the query set (9723 images) and its retrieved

image is calculated using (1). For each type of sig-

nature, all of these 9723 distances are displayed us-

ing rain cloud plots, which help to analyse the dis-

tribution of distances as probability density, and at

the same time, key summary statistics (such as me-

dian and quartiles) can be visualized. This graph can

be seen in Figure 10. It is important to note that the

database set is made up of an image every 10 meters

on the trajectory (see Section 4.1). Then, the range of

Euclidean position distances between the query im-

ages and the nearest database images (ground truth) is

between 0 and 5 meters. Therefore, the committed lo-

calization error is considerable in the cases (points in

the graph) where the distance between the query im-

age and the retrieved image (dist

q−r

) is much greater

than 5 meters.

At ﬁrst view, we can see that, for all four types

of signatures, the highest concentration of points is

found on distances below 100 metres approximately.

In addition to this, it is clear that the three types of

BoSVW have a higher peak than BoSW in this inter-

val. As for the box plots, BoSW has a higher median

value (around 265 meters) than the other three image

signatures. Also, the lower whisker of the BoSW box-

plot has a longer distance compared to the others. Fo-

cusing on BoVW types, the results are better when

using the frequency histogram, the median value is

around 36 metres when the distance is Euclidean and

around 55 metres when it is cosine. However, the me-

dian value is around 84 metres when using the vector

representing the sum of the cosine distance to the vi-

sual words.

In this section, the image signatures have not been

evaluated only on the condition of ﬁnding the most

similar one within a ratio, as in the previous eval-

uation, but rather all distances are shown after run-

ning the image retrieval algorithm for all query im-

ages. It for each signature. Taking these results into

ICINCO 2025 - 22nd International Conference on Informatics in Control, Automation and Robotics

172

account, the proposed signatures most frequently re-

turn as the most similar image from the database that

is in a closer position to the query image, achieving a

more reﬁned localization.

6 CONCLUSIONS

The main objective of this work is to solve the task

of place recognition for a mobile robot navigating in

a known outdoor environment. The method used for

this is image retrieval using equirectangular images

as input. Image retrieval relies on the fact that images

are represented in a way that their signiﬁcant features

are described (image signature).

In relation to this, three types of image signatures

are evaluated and compared in this work for place

recognition, when a mobile robot navigates a trajec-

tory in a previously visited environment. All the im-

plemented signatures combine semantic and visual in-

formation. The ﬁrst one (i.e. BoSW) was proposed by

Ouni et al. (2022) whereas the other two variations are

proposed in this paper (i.e. BoSVW and BoSVW*).

The BoSW image signature is a matrix in which each

row is the centroid of a set of visual feature descrip-

tors belonging to the same semantic class. The num-

ber of rows is equal to the number of semantic classes

and the number of columns is equal to the size of the

local visual feature. In the case of BoSVW the rows

are histogram of frequency of visual words.

After the experiments, the results in terms of recall

at one determine that BoSVW using the Euclidean

distance during the image encoding step provides the

highest value. In contrast, the lowest recall value

is achieved using BoSW. Apart from this evaluation

measure, the distances between the position of the

query image and the position of the image retrieved

by the method using each image signature is also anal-

ysed. The use of a Euclidean distance achieves a

lower distance in more times than the cosine distance

for BoSVW.

Therefore, it can be concluded that creating a bag

of visual words for each semantic category, such as

proposed in this paper, rather than a single visual de-

scriptor, improves the results on the problem of place

recognition. Additionally, if each category is repre-

sented by a frequency histogram, the localization is

more accurate than using a vector that encodes dis-

tances.

In summary, the evaluations show that the imple-

mentation of the proposed signatures in an image re-

trieval algorithm for place recognition provides better

results.

In this work, only image signatures that merge se-

mantic and visual information have been evaluated

and compared to solve the place recognition. Taking

it into account, we propose as a future work to extend

this comparative evaluation to other algorithms (such

as these that use only visual information). In the same

line, other possible future work can be study these

signatures using other local features, both using tra-

ditional extraction methods and Deep Learning meth-

ods. Finally, other future work could be to research

whether the proposed signatures can be improved by

ﬁnding the optimal value of clusters for each vocabu-

lary size in each category, rather than this parameter

being ﬁxed for all semantic categories as it is in this

work.

ACKNOWLEDGEMENTS

This research work is part of a project funded by

”AYUDAS A LA INVESTIGACI

ON 2025 DEL

VICERRECTORADO DE INVESTIGACI

ON Y

TRANSFERENCIA” of the Miguel Hern

andez Uni-

versity and part of the project PID2023-149575OB-

I00 funded by MICIU/AEI/10.13039/501100011033

and by FEDER, UE. It is also part of the project

CIPROM/2024/8 funded by Generalitat Valenciana

and part of the project CIAICO/2023/193 funded by

Generalitat Valenciana.

REFERENCES

Alfaro, M., Cabrera, J., Jim

enez, L., Reinoso, O., and Pay

L. (2024). Triplet Neural Networks for the Visual

Localization of Mobile Robots:. In Proceedings of

the 21st International Conference on Informatics in

Control, Automation and Robotics, pages 125–132,

Porto, Portugal. SCITEPRESS - Science and Technol-

ogy Publications.

Bay, H., Tuytelaars, T., and Van Gool, L. (2006). SURF:

Speeded Up Robust Features. In Leonardis, A.,

Bischof, H., and Pinz, A., editors, Computer Vision

– ECCV 2006, pages 404–417, Berlin, Heidelberg.

Springer.

Cabrera, J. J., Santo, A., Gil, A., Viegas, C., and Pay

L. (2024). MinkUNeXt: Point Cloud-based Large-

scale Place Recognition using 3D Sparse Convolu-

tions. arXiv:2403.07593.

Dubey, S. R. (2022). A Decade Survey of Content Based

Image Retrieval Using Deep Learning. IEEE Trans-

actions on Circuits and Systems for Video Technology,

32(5):2687–2704.

Flores, M., Valiente, D., Peidr

o, A., Reinoso, O., and Pay

L. (2024). Generating a full spherical view by mod-

eling the relation between two ﬁsheye images. The

Visual Computer, 40(10):7107–7132.

Place Recognition Using Bag of Semantic and Visual Words from Equirectangular Images

173

Forcen, J. I., Pagola, M., Barrenechea, E., and Bustince,

H. (2020). Co-occurrence of deep convolutional fea-

tures for image search. Image and Vision Computing,

97:103909.

Li, X., Yang, J., and Ma, J. (2021). Recent developments of

content-based image retrieval (CBIR). Neurocomput-

ing, 452:675–689.

Liao, Y., Xie, J., and Geiger, A. (2023). KITTI-360: A

Novel Dataset and Benchmarks for Urban Scene Un-

derstanding in 2D and 3D. IEEE Transactions on Pat-

tern Analysis and Machine Intelligence, 45(3):3292–

3310.

Lowe, D. G. (2004). Distinctive Image Features from Scale-

Invariant Keypoints. International Journal of Com-

puter Vision, 60(2):91–110.

Mansoori, N. S., Nejati, M., Razzaghi, P., and Samavi, S.

(2013). Bag of visual words approach for image re-

trieval using color information. In 2013 21st Iranian

Conference on Electrical Engineering (ICEE), pages

1–6. ISSN: 2164-7054.

Ouni, A., Chateau, T., Royer, E., Chevaldonn

e, M., and

Dhome, M. (2023). An efﬁcient ir approach based

semantic segmentation. Multimedia Tools and Appli-

cations, 82(7):10145–10163.

Ouni, A., Royer, E., Chevaldonn

e, M., and Dhome, M.

(2022). Leveraging semantic segmentation for hybrid

image retrieval methods. Neural Computing and Ap-

plications, 34(24):21519–21537.

Rani, S., Kasana, G., and Batra, S. (2025). An efﬁcient con-

tent based image retrieval framework using separable

CNNs. Cluster Computing, 28(1):56.

Rublee, E., Rabaud, V., Konolige, K., and Bradski, G.

(2011). ORB: An efﬁcient alternative to SIFT or

SURF. In 2011 International Conference on Com-

puter Vision, pages 2564–2571. ISSN: 2380-7504.

Srivastava, D., Singh, S. S., Rajitha, B., Verma, M., Kaur,

M., and Lee, H.-N. (2023). Content-Based Image Re-

trieval: A Survey on Local and Global Features Selec-

tion, Extraction, Representation, and Evaluation Pa-

rameters. IEEE Access, 11:95410–95431.

Vilella-Cantos, J., Cabrera, J. J., Pay

a, L., Ballesta, M., and

Valiente, D. (2025). MinkUNeXt-SI: Improving point

cloud-based place recognition including spherical co-

ordinates and LiDAR intensity. arXiv:2505.17591.

Wang, T.-H., Huang, H.-J., Lin, J.-T., Hu, C.-W.,

Zeng, K.-H., and Sun, M. (2018). Omnidirectional

CNN for Visual Place Recognition and Navigation.

arXiv:1803.04228.

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M.,

and Luo, P. (2021). Segformer: Simple and efﬁcient

design for semantic segmentation with transformers.

Advances in neural information processing systems,

34:12077–12090.

Xu, S., Chou, W., and Dong, H. (2019). A Robust Indoor

Localization System Integrating Visual Localization

Aided by CNN-Based Image Retrieval with Monte

Carlo Localization. Sensors, 19(2):249.

Zeng, Z., Zhang, J., Wang, X., Chen, Y., and Zhu, C. (2018).

Place Recognition: An Overview of Vision Perspec-

tive. Applied Sciences, 8(11):2257.

ICINCO 2025 - 22nd International Conference on Informatics in Control, Automation and Robotics

174