
account, the proposed signatures most frequently re-
turn as the most similar image from the database that
is in a closer position to the query image, achieving a
more refined localization.
6 CONCLUSIONS
The main objective of this work is to solve the task
of place recognition for a mobile robot navigating in
a known outdoor environment. The method used for
this is image retrieval using equirectangular images
as input. Image retrieval relies on the fact that images
are represented in a way that their significant features
are described (image signature).
In relation to this, three types of image signatures
are evaluated and compared in this work for place
recognition, when a mobile robot navigates a trajec-
tory in a previously visited environment. All the im-
plemented signatures combine semantic and visual in-
formation. The first one (i.e. BoSW) was proposed by
Ouni et al. (2022) whereas the other two variations are
proposed in this paper (i.e. BoSVW and BoSVW*).
The BoSW image signature is a matrix in which each
row is the centroid of a set of visual feature descrip-
tors belonging to the same semantic class. The num-
ber of rows is equal to the number of semantic classes
and the number of columns is equal to the size of the
local visual feature. In the case of BoSVW the rows
are histogram of frequency of visual words.
After the experiments, the results in terms of recall
at one determine that BoSVW using the Euclidean
distance during the image encoding step provides the
highest value. In contrast, the lowest recall value
is achieved using BoSW. Apart from this evaluation
measure, the distances between the position of the
query image and the position of the image retrieved
by the method using each image signature is also anal-
ysed. The use of a Euclidean distance achieves a
lower distance in more times than the cosine distance
for BoSVW.
Therefore, it can be concluded that creating a bag
of visual words for each semantic category, such as
proposed in this paper, rather than a single visual de-
scriptor, improves the results on the problem of place
recognition. Additionally, if each category is repre-
sented by a frequency histogram, the localization is
more accurate than using a vector that encodes dis-
tances.
In summary, the evaluations show that the imple-
mentation of the proposed signatures in an image re-
trieval algorithm for place recognition provides better
results.
In this work, only image signatures that merge se-
mantic and visual information have been evaluated
and compared to solve the place recognition. Taking
it into account, we propose as a future work to extend
this comparative evaluation to other algorithms (such
as these that use only visual information). In the same
line, other possible future work can be study these
signatures using other local features, both using tra-
ditional extraction methods and Deep Learning meth-
ods. Finally, other future work could be to research
whether the proposed signatures can be improved by
finding the optimal value of clusters for each vocabu-
lary size in each category, rather than this parameter
being fixed for all semantic categories as it is in this
work.
ACKNOWLEDGEMENTS
This research work is part of a project funded by
”AYUDAS A LA INVESTIGACI
´
ON 2025 DEL
VICERRECTORADO DE INVESTIGACI
´
ON Y
TRANSFERENCIA” of the Miguel Hern
´
andez Uni-
versity and part of the project PID2023-149575OB-
I00 funded by MICIU/AEI/10.13039/501100011033
and by FEDER, UE. It is also part of the project
CIPROM/2024/8 funded by Generalitat Valenciana
and part of the project CIAICO/2023/193 funded by
Generalitat Valenciana.
REFERENCES
Alfaro, M., Cabrera, J., Jim
´
enez, L., Reinoso, O., and Pay
´
a,
L. (2024). Triplet Neural Networks for the Visual
Localization of Mobile Robots:. In Proceedings of
the 21st International Conference on Informatics in
Control, Automation and Robotics, pages 125–132,
Porto, Portugal. SCITEPRESS - Science and Technol-
ogy Publications.
Bay, H., Tuytelaars, T., and Van Gool, L. (2006). SURF:
Speeded Up Robust Features. In Leonardis, A.,
Bischof, H., and Pinz, A., editors, Computer Vision
– ECCV 2006, pages 404–417, Berlin, Heidelberg.
Springer.
Cabrera, J. J., Santo, A., Gil, A., Viegas, C., and Pay
´
a,
L. (2024). MinkUNeXt: Point Cloud-based Large-
scale Place Recognition using 3D Sparse Convolu-
tions. arXiv:2403.07593.
Dubey, S. R. (2022). A Decade Survey of Content Based
Image Retrieval Using Deep Learning. IEEE Trans-
actions on Circuits and Systems for Video Technology,
32(5):2687–2704.
Flores, M., Valiente, D., Peidr
´
o, A., Reinoso, O., and Pay
´
a,
L. (2024). Generating a full spherical view by mod-
eling the relation between two fisheye images. The
Visual Computer, 40(10):7107–7132.
Place Recognition Using Bag of Semantic and Visual Words from Equirectangular Images
173