thermal images, these challenges still exist.
• Nature of Dynamic Landmarks: Due to the dy-
namic natur e of humans, the reliability of loop
closure might be affected if HOIs are tr ansient.
While our focus is on interaction s with static ob-
jects, the duration and stability of the interaction
itself can influence the overall r obustness of the
system.
• Sparsity of Features: While our method is ef-
fective in envir onments with few static features,
there might be situations where HOIs themselves
are very rare, leading to an insufficient number o f
landmarks. This becomes particularly evident in
environments with minimal human presence or in-
frequent interaction s with specific objec ts.
• Computational Cost: Detecting and track-
ing HOIs, and subsequently performing feature
matching based on them, can be computation -
ally more intensive compared to traditional static
feature-point-based methods. Efficient algorithms
and optimization will be crucial to maintain real-
time performance.
• Dataset Diversity: Current evaluations might be
dependent on specific datasets. A comprehen-
sive quantitative evaluation across a wider range
of real-world scenarios, especially those involving
diverse human ac tivities, object categories, and
complex environmental changes (e.g ., severe oc-
clusions, extreme temperature variations), is nec-
essary.
REFERENCES
Adlakha, D. et al. (2020). Deeptio: A deep thermal-inertial
odometry with visual hallucination. arXiv.
Ali, I., Peltonen, S., and Gotchev, A. (2022). Bi-directional
loop closure for visual slam. arXiv.
Antoun, M. and Asmar, D. (2023). Human object interac-
tion detection: Design and survey. Image and Vision
Computing, 130:104617.
Arandjelovic, R ., Gronat, P., Torii, A., Pajdla, T., and Sivic,
J. (2018). Netvlad: C nn architecture for weakly super-
vised place recognition. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 40(4):1168–1181.
Chen, Z., Song, Z., Pang, Z.-J., Liu, Y., Chen, Z., Han,
X.-F., Zuo, Y.-W., and Shen, S.- J. (2024). Glc-slam:
Gaussian splatting sl am with efficient loop cl osure.
Chum, O., Matas, J., and Kittler, J. (2003). Locally opti-
mized ransac. In Michaelis, B. and Krel l, G., editors,
Pattern Recognition, pages 236–243, B erlin, Heidel-
berg. Springer Berlin Heidelberg.
Cormack, G. V., Clarke, C. L. A., and Buettcher, S. (2009).
Reciprocal rank fusion. In Proceedings of the 32nd
international ACM SIGIR conference on Research and
development in information retrieval, pages 528–529.
ACM.
Cummins, M. and Newman, P. (2008). Fab-map: Proba-
bilistic localization and mapping in the space of ap-
pearance. The International Journal of Robotics Re-
search, 27(6):647–665.
Cummins, M. and Newman, P. (2010). Appearance-only
slam at large scale with fab-map 2.0. International
Journal of Robotics Research, 29(8):943–959.
DeTone, D., Malisi ewicz, T. , and R abinovich, A. (2018).
Superpoint: Self-supervised interest point detection
and description. In 2018 IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops
(CVPRW), pages 337–33712.
Engel, J. , Sch¨ops, T., and Cremers, D. (2014). Lsd-slam:
Large-scale direct monocular slam. In European con-
ference on computer vision, pages 834–849. Springer.
Klein, G. and Murray, D. (2007). Parallel tracking and map-
ping for small ar workspaces. In Proceedings of IS-
MAR 2007. IEEE.
Lahiany, A. and Gal, O. (2025). Autoloop: Fast visual slam
fine-tuning through agentic curriculum learning.
Li, S., Ma, X., He, R., Shen, Y., Guan, H., Liu, H., and
Li, F. (2025). Wti-slam: a novel thermal infrared vi-
sual slam algorithm for weak texture thermal infrared
images. The Journal of Engineering.
Liang, J. T. Y. and Tanaka, K. (2024). Robot traversability
prediction: Towards third-person-view extension of
walk2map wi th photometric and physical constraints.
In IEEE/RSJ International Conference on Intelligent
Robots and Systems, IROS 2024, pages 11602–11609.
IEEE.
Lim, T. Y., Sun, B., Pollefeys, M., and Blum, H. (2025).
2go: Loop closure from two views.
Liso, L., Sandstrom, E., Yugay, V., G ool, L. V., and Oswald,
M. R. (2024). Loopy-slam: Dense neural slam with
loop closures.
Lowe, D. G. (2004). Distinctive image features from scale-
invariant keypoints. Int. J. Comput. Vis., 60(2):91–
110.
Montemerlo, M., T hrun, S., Koller, D., Wegbreit, B., et al.
(2002). Fastslam: A factored solution to the simulta-
neous localization and mapping problem. Aaai/iaai,
593598:593–598.
Mur-Artal, R., Montiel, J. M. M., and Tardos, J. D. (2015a).
Orb-slam: A versatile and accurate monocular slam
system. IEEE Transactions on Robotics, 31(5):1147–
1163.
Mur-Artal, R., Montiel, J. M. M., and Tardos, J. D. (2015b).
Orb-slam: A versatile and accurate monocular slam
system. IEEE transactions on robotics, 31(5):1147–
1163.
Qu, H., Tang, X., Liu, C.-T., Chen, X.-Y., Li, Y.-Z., Zhang,
W.-K., Zhang, H.-T., and Li, J.-H. (2024). Dk-slam:
Monocular visual slam with deep keypoint learning,
tracking and loop-closing.
Rublee, E., R abaud, V., Konolige, K., and Bradski, G.
(2011). Or b: An efficient alternative to sift or surf. In