
tion that such a complex fusion is viable for real-
time applications, achieving approximately 10 FPS
on consumer-grade hardware while robustly identi-
fying navigable areas and obstacles. Qualitative re-
sults confirmed the effectiveness of the proposed fu-
sion logic, showing that the system robustly identifies
navigable areas and obstacles in varied indoor envi-
ronments. Furthermore, preliminary outdoor tests in-
dicated promising generalization capabilities, where
a sidewalk was correctly interpreted as navigable ter-
rain. The system also proved resilient in challenging
scenarios, such as those involving significant partial
object occlusion. These findings validate that the pro-
posed approach provides a comprehensive scene un-
derstanding.
As future work, the next crucial step is rigorous
quantitative validation of the perception system. To
this end, a custom and diverse dataset is currently be-
ing developed. This dataset will consist of video se-
quences captured in multiple indoor scenarios, under
various lighting conditions, and will include a range
of static and dynamic obstacles to test the system’s
limits. It will contain annotated ground truth for each
of the system’s outputs - navigable space, obstacles,
and people, which will allow an objective evaluation
of performance. The evaluation metrics will be se-
lected based on standard practices in the literature for
each perception task. The goal of this quantitative val-
idation is therefore to objectively assess the effective-
ness of the proposed fusion architecture and to test the
hypothesis that it contributes to more accurate scene
understanding. Beyond this validation, future works
aim to evaluate the system in more complex scenarios
and to integrate the proposed perception layer into a
navigation pipeline for autonomous robot operation.
We also consider improving the system performance
using multiple Processor Cores and NPUs (e.g., In-
tel Core Ultra 9 Hardware), and GPUs (Nvidia Jetson
and Nvidia Digits).
DATA AVAILABILITY STATEMENT
The source code developed for this research is not yet
publicly available, but can be provided by the corre-
sponding author upon reasonable request. The code
is currently being prepared for public release and will
be made available in a dedicated repository.
ACKNOWLEDGEMENTS
This work was financed in part by the Coordenac¸
˜
ao
de Aperfeic¸oamento de Pessoal de N
´
ıvel Superior -
Brazil (CAPES) - Finance Code 001. The authors
also wish to thank the Institute of Mathematical and
Computer Sciences (ICMC/USP) and the Center of
Excellence in Artificial Intelligence (CEIA) for their
support.
REFERENCES
Alijani, S., Fayyad, J., and Najjaran, H. (2024). Vision
transformers in domain adaptation and domain gen-
eralization: a study of robustness. Neural Computing
and Applications, 36(29):17979–18007.
Dang, T.-V. and Bui, N.-T. (2023). Obstacle avoidance
strategy for mobile robot based on monocular camera.
Electronics, 12(8):1932.
Jocher, G., Chaurasia, A., and Qiu, J. (2023). Ultralytics
yolov8.
Khanam, R. and Hussain, M. (2024). Yolov11: An
overview of the key architectural enhancements. https:
//arxiv.org/abs/2410.17725.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C.,
Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C.,
Lo, W.-Y., Doll
´
ar, P., and Girshick, R. (2023). Seg-
ment anything. https://arxiv.org/abs/2304.02643.
Liu, Y., Wang, S., Xie, Y., Xiong, T., and Wu, M. (2024). A
review of sensing technologies for indoor autonomous
mobile robots. Sensors, 24(4).
Long, J., Shelhamer, E., and Darrell, T. (2015). Fully con-
volutional networks for semantic segmentation. In
Proceedings of the IEEE conference on computer vi-
sion and pattern recognition, pages 3431–3440.
Maruschak, P., Konovalenko, I., Osadtsa, Y., Medvid, V.,
Shovkun, O., and Baran, D. (2025). Surface defects
of rolled metal products recognised by a deep neural
network under different illuminance levels and low-
amplitude vibration. The International Journal of Ad-
vanced Manufacturing Technology, pages 1–16.
Masoumian, A., Rashwan, H. A., Cristiano, J., Asif, M. S.,
and Puig, D. (2022). Monocular depth estimation us-
ing deep learning: A review. Sensors, 22(14):5353.
Niloy, M. A. K., Shama, A., Chakrabortty, R. K., Ryan,
M. J., Badal, F. R., Tasneem, Z., Ahamed, M. H.,
Moyeen, S. I., Das, S. K., Ali, M. F., Islam, M. R.,
and Saha, D. K. (2021). Critical design and control
issues of indoor autonomous mobile robots: A review.
IEEE Access, 9:35338–35370.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., et al. (2021). Learning transferable visual models
from natural language supervision. In International
conference on machine learning, pages 8748–8763.
PmLR.
Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma,
T., Khedr, H., R
¨
adle, R., Rolland, C., Gustafson, L.,
Mintun, E., Pan, J., Alwala, K. V., Carion, N., Wu,
C.-Y., Girshick, R., Doll
´
ar, P., and Feichtenhofer, C.
(2024). Sam 2: Segment anything in images and
videos. arXiv preprint arXiv:2408.00714.
ICINCO 2025 - 22nd International Conference on Informatics in Control, Automation and Robotics
392