Robust Scene Understanding for Mobile Robots Based on Vision and
Deep Learning Models
Leticia C. Pereira
a
and Fernando S. Os
´
orio
b
Institute of Mathematical and Computer Sciences (ICMC), University of S
˜
ao Paulo (USP), S
˜
ao Carlos, SP, Brazil
Keywords:
Deep Learning, Computer Vision, Mobile Robots, Autonomous Robots.
Abstract:
This paper presents the architecture and results of AIVFusion, a real-time perception system designed to gen-
erate a rich, multi-layered understanding of an environment from a single monocular camera for autonomous
mobile robots. The system is designed to fuse information from different deep learning models to achieve
a comprehensive scene understanding. Our architecture integrates three open-source models to perform dis-
tinct perception tasks: object detection (YOLOv8), semantic segmentation (FastSAM), and monocular depth
estimation (Depth Anything V2). By fusing these outputs, the system generates a unified representation that
identifies the navigable area, detects nearby obstacles based on depth information, and semantically labels
those identified as “person”. The resulting perceptual information can then be leveraged by higher-level sys-
tems for tasks such as decision-making and safer navigation. The system’s viability is demonstrated through
qualitative tests in indoor environments. These results confirm its ability to operate in real-time (approximately
10 FPS) and to effectively fuse the perception layers, even in challenging scenarios involving partial object
occlusion.
1 INTRODUCTION
Autonomous mobile robots (AMRs) have grown
rapidly, driven by advances in artificial intelligence,
robotics, and increasingly accessible hardware. Their
adoption has expanded across sectors due to rising
interest from individuals and companies. AMRs are
now widely used in applications such as domestic as-
sistance, delivery, warehousing, and logistics, proving
to be efficient and innovative solutions. These robots
navigate autonomously in both indoor and outdoor
environments, adapting to static and dynamic obsta-
cles like moving people (Liu et al., 2024; Niloy et al.,
2021). However, this work focuses specifically on in-
door scenarios.
Indoor environments present a significant chal-
lenge due to their unpredictability, often character-
ized by high flows of people (dynamic obstacles), and
static obstacles that can be moved around the envi-
ronment. In addition, narrow corridors and varying
lighting conditions make the environment even more
complex for robot navigation (Zhang et al., 2024).In
particular, the latter, along with physical vibrations
when operating on different floor types, are known to
a
https://orcid.org/0009-0000-1209-5690
b
https://orcid.org/0000-0002-6620-2794
influence the performance of deep learning systems
(Maruschak et al., 2025). Given this scenario and the
growth of AMR applications, safe navigation has be-
come an increasingly important topic.
It is important to highlight that, for a robot of
this type to navigate autonomously in an environment,
there are several development stages involved, such
as perception, mapping, localization, control, naviga-
tion, and decision-making, among others. One of the
most important stages is the perception layer, which
is responsible for understanding the surrounding envi-
ronment through sensors and making decisions based
on the collected information, thereby ensuring safe
navigation (Liu et al., 2024).
Robotic perception can be achieved through var-
ious sensors, including sonar, LiDAR, and cameras.
While sonar systems are often limited by their low
spatial resolution, LiDAR provides precise 3D mea-
surements, but at a prohibitive cost for many mobile
robot applications. In contrast, camera-based sys-
tems offer a compelling trade-off between cost and
the richness of the data they provide. Recent advances
in deep learning, particularly in monocular depth es-
timation (Masoumian et al., 2022), have further en-
hanced the viability of using a single camera to infer
three-dimensional scene geometry, making it a cost-
386
Pereira, L. C. and Osório, F. S.
Robust Scene Understanding for Mobile Robots Based on Vision and Deep Learning Models.
DOI: 10.5220/0013789100003982
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 22nd International Conference on Informatics in Control, Automation and Robotics (ICINCO 2025) - Volume 2, pages 386-393
ISBN: 978-989-758-770-2; ISSN: 2184-2809
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
effective and powerful sensor for modern perception
systems.
Recent advances in deep learning have signifi-
cantly enhanced computer vision, offering powerful
tools for robotic perception. However, individual
tasks such as object detection, depth estimation, or
semantic segmentation, while powerful, provide an
incomplete view of the environment. Achieving ro-
bust autonomous navigation with low-cost sensors re-
quires the cohesive fusion of these modalities. To ad-
dress this challenge, this paper introduces AIVFusion,
a real-time perception system that integrates these ca-
pabilities to generate a rich, contextual understanding
of the scene from a single monocular camera.
The main contribution of this work is a hierar-
chical fusion architecture. Our system first identifies
the navigable area through segmentation to establish
a baseline for safety. It then augments this with a map
of general obstacles based on their physical proxim-
ity, derived from depth estimation. Finally, it applies
semantic labels (“person”) to detected obstacles, en-
abling more sophisticated decision-making, such as
human-aware navigation.
2 RELATED WORKS
Scene perception for autonomous robots has been
transformed by advances in deep neural networks,
which now form the basis of many vision systems.In
the field of object detection, for instance, the YOLO
models have established real-time performance as a
viable baseline in robotics (Vijayakumar and Vairava-
sundaram, 2024). Simultaneously, in monocular
depth estimation, large-scale architectures like Depth
Anything (Yang et al., 2024a) and lightweight models
such as FastDepth (Wofk et al., 2019) have proven
effective in inferring scene geometry. In seman-
tic scene understanding, segmentation models, from
classic Fully Convolutional Networks (FCNs) (Long
et al., 2015) to more recent approaches like Segment
Anything (SAM) (Kirillov et al., 2023) and its vari-
ants, such as FastSAM (Zhao et al., 2023), enable
the semantic classification of pixels into various cate-
gories within a scene. While these tools perform well
in their individual tasks, integrating information ex-
tracted by different modules still represents a chal-
lenge for perception. In this section, we review ex-
isting fusion strategies to contextualize our proposed
approach.
A line of research for collision avoidance focuses
on the fusion of object detection with depth estima-
tion. A representative example is the work of (Urban
and Caplier, 2021), who propose a system to predict
the Time to Collision (TTC). Their architecture first
employs an object detector (YOLOv3) to locate pre-
defined classes, such as “Person” and “Chair”, fol-
lowed by a depth estimation network (FastDepth) to
estimate the distance to those specific targets. How-
ever, this strategy presents two significant limitations.
First, its effectiveness is limited to a predefined set of
object classes, which may reduce the system’s ability
to respond to previously unseen or unlabeled obsta-
cles. Second, if an obstacle is missed, for instance,
due to partial occlusion, the entire risk assessment
process for that object may be compromised.
Another line of research in perception focuses on
the direct segmentation of the navigable space. In this
context, (Dang and Bui, 2023) demonstrate a tech-
nique that uses a binary segmentation network, specif-
ically trained to classify the environment into naviga-
ble and non-navigable areas, generating a Bird’s-Eye-
View (BEV) map for an A* path planner. Although
effective in scenarios similar to its training dataset,
training a segmentation network on a visually spe-
cific dataset presents significant generalization chal-
lenges. The approach becomes prone to errors when
exposed to new or altered environments, where, for
example, changes in lighting conditions or the pres-
ence of dynamic obstacles not seen during training
can lead to incorrect space segmentation. This, in
turn, may cause the planner to generate an erroneous
trajectory over an obstacle that was mistakenly classi-
fied as a navigable area.
Given these limitations, we argue that a robust
perception system capable of supporting autonomous
navigation and decision-making layers should inte-
grate multiple sources of environmental information
to enable a more holistic perception. Our approach
is based on a monocular camera and proposes the
fusion of three complementary perception modali-
ties: (1) navigable space segmentation, performed
by a fundamental segmentation model (FastSAM),
whose large-scale pretraining favors better general-
ization across different scenarios; (2) explicit depth
estimation, through a deep learning model, which en-
ables more robust and class-independent obstacle de-
tection; and (3) semantic object detection, which en-
ables differentiated behaviors, such as social naviga-
tion around people.
3 PROPOSED APPROACH
The proposed system AIVFusion is designed to gen-
erate a real-time, multi-layered scene understanding
from a single monocular camera. Its architecture is
based on the parallel execution of three specialized
Robust Scene Understanding for Mobile Robots Based on Vision and Deep Learning Models
387
perception modules, whose outputs are subsequently
fused in a hierarchical process. The selection of each
module was guided by a balance between state-of-the-
art performance and real-time processing capability,
as detailed below.
3.1 Architectural Components of the
Perception System
The proposed system AIVFusion aims to extract and
combine different information from the images, using
a monocular camera. The proposed approach consid-
ers the fusion of three extracted information layers:
image Semantic Segmentation, image Depth Estima-
tion of scene elements, and image Objects Detection
and Classification.
Navigable Space Segmentation: The foundation
of our perception stack is the identification of the
navigable area. For this task, the primary require-
ment was a model capable of real-time inference.
We initially considered a state-of-the-art model like
Segment Anything 2 (SAM 2); however, our perfor-
mance analysis (detailed in Section 4.2) revealed a
low framerate of 2 FPS, making it unfeasible for
our application. This performance difference stems
from their distinct architectures: SAM 2 relies on a
computationally intensive Vision Transformer (ViT)
as its image encoder (Ravi et al., 2024). In con-
trast, FastSAM achieves its high speed by leverag-
ing a lightweight CNN-based detector (a YOLOv8-
seg model) to generate segmentation proposals (Zhao
et al., 2023). Consequently, we adopted FastSAM for
its optimal balance of segmentation quality and real-
time performance. Both, adopt the task of Prompt-
able Segmentation Task, meaning they can segment
images using different types of inputs, such as points,
bounding boxes, texts, or masks. This flexibility al-
lows the model to segment specific objects in an im-
age based on the provided prompt type. We utilize a
point prompt, strategically placed in the lower portion
of the camera’s view, to robustly isolate the ground
plane.
Depth Estimation: To understand the scene’s
geometry and the proximity of objects, we employ
Depth Anything V2. This state-of-the-art model, is
a powerful monocular depth estimation model that
can be easily deployed to estimate depth in images
and also in videos. This model is based on a Vision
Transformers (ViT) and zero-shot deep estimation, as
in SAM model, that allows the model to handle data
and scenes that are not included in its training set (im-
proved generalization and better performance even
with unseen scenes) (Yang et al., 2024b). This com-
ponent is crucial for our class-agnostic obstacle de-
tection, as it allows the system to perceive any phys-
ical barrier based solely on its distance, without prior
knowledge of its category.
Semantic Object Detection: To add a semantic
layer, we selected a model from the You Only Look
Once (YOLO) family, a series of real-time object de-
tectors based on deep learning convolutional neural
networks (CNNs), known for their efficiency in time-
critical applications. Our objective was to use the lat-
est version available. However, this presented a criti-
cal dependency conflict: YOLO11 (Khanam and Hus-
sain, 2024) require a version of the ultralytics library
(v8.3.0+) that proved incompatible with the version
required by FastSAM (v8.0.120). To ensure the sta-
bility and integration of the complete pipeline, we se-
lected YOLOv8 (Jocher et al., 2023), as it represents
the most capable model within the compatible depen-
dency ecosystem. In our application, it is specifically
configured to identify the “person” class.
3.2 The AIVFusion Proposed System
AIVFusion operates by analyzing each video frame
from a monocular camera to build a rich understand-
ing of the scene. For each captured frame, the
three perception models described above - object de-
tection (YOLOv8), ground segmentation (FastSAM),
and depth estimation (Depth Anything V2) - are ex-
ecuted to extract the base information. From this ex-
ecution, we obtain three primary data layers: (1) the
scene’s depth map, (2) the navigable area mask, and
(3) the location of people (if present). The core of
our contribution, detailed below, lies in how these
three layers of information are combined to generate a
single representation of the environment, highlighting
safe areas and potential risks.
As illustrated in the system architecture diagram
(Figure 1), the pipeline starts with the capture of the
frame by the camera (original input), which in turn is
processed in parallel by three deep neural networks,
each responsible for extracting a fundamental layer
of information from the scene:
Depth Estimation: The first layer generates a
dense depth map, a 2D matrix in which each pixel
represents the relative distance of a point in the
scene relative to the camera. In our implementa-
tion, higher values indicate closer proximity. For
visualization purposes, this map is converted into
a grayscale image, where brighter pixels corre-
spond to nearer objects.
Ground Segmentation: The second layer uses the
FastSAM model to isolate the navigable area. Al-
though the model supports text prompts (such as
“ground”), it relies on additional language–image
ICINCO 2025 - 22nd International Conference on Informatics in Control, Automation and Robotics
388
models like CLIP (Radford et al., 2021), which
increases processing latency and reduces FPS. For
this reason, the point prompt was chosen, as it of-
fers a significantly lower computational cost. A
fixed point in the lower central region of the image
was defined as a reference, generating a binary
mask (ground mask) that defines the traversable
space, visually represented in blue.
Person Detection: The third layer employs the
YOLO model for a specific task: finding instances
of the “person” class. The output of this module
is a set of bounding boxes that define the positions
of the detected people in the image.
With the extracted information, the fusion process
is initiated. The first stage focuses on the geometry
of the environment to identify obstacles in a class-
independent manner.
First, a distance threshold is applied to the gener-
ated depth map to create the Initial Collision Mask
Calculation, which filters pixels considered to repre-
sent nearby obstacles. Since the depth values are rel-
ative, this threshold was empirically defined through
controlled experiments: objects were placed at differ-
ent distances from the camera to identify the approxi-
mate pixel values in the depth matrix corresponding to
up to 1 meter. Thus, any obstacle detected within this
range is considered relevant for immediate risk. The
result of this step, visualized in green, is a risk map
that, by considering only proximity, initially includes
the ground itself.
Next, this information is refined to generate the
Final Collision Mask. The ground mask, obtained
from segmentation, is logically subtracted from the
Initial Collision Mask Calculation. The result is an
accurate map that represents only the nearby obstacles
that are not part of the navigable area.
The last stage of the pipeline is the fusion with
semantic data, where meaning is assigned to the de-
tected obstacles. The bounding boxes of people, de-
tected by YOLO, are overlaid with the Final Collision
Mask. When a person is identified within a risk area,
the system recognizes not only the presence of an ob-
stacle but also that it is a human, enabling higher-level
modules to make more appropriate decisions, such
as reducing speed, performing cautious deviations, or
even initiating audio interactions.
The output of our system is a rich and contextual-
ized scene understanding, composed of three crucial
layers of information: a map of the navigable area, a
real-time assessment of nearby obstacles, and the se-
mantic identification of people within risk zones. This
unified perception serves as a crucial input for any
application requiring spatial and semantic awareness.
While particularly relevant for the decision-making
and navigation stacks of mobile robots, it also extends
to domains where rich contextual scene understand-
ing is required.
Since the objective of our work was to evaluate
our fusion architecture’s ability to generate a rich,
multi-layered understanding of the environment in
real-time, rather than to establish a new state-of-
the-art in accuracy, we used the original pre-trained
weights for all models. The selected models are
well-established, with YOLOv8 trained on the COCO
dataset, FastSAM efficiently trained on a 2% subset
of the large SA-1B dataset, and Depth Anything V2
on a combination of high-quality synthetic data and
millions of pseudo-labeled real-world images.
4 RESULTS
4.1 Experimental Setup
The perception system was implemented in Python
3.8.10 using the PyTorch and Ultralytics libraries,
among others. All processing was performed on a
notebook with the following specifications: an Intel®
Core™ i7-7700HQ processor, 16 GB of DDR4 mem-
ory, and an NVIDIA GeForce GTX 1050Ti GPU with
4 GB of VRAM, running the Ubuntu 20.04.6 LTS op-
erating system. Images were captured in real time us-
ing a Microsoft LifeCam HD-3000 webcam at a res-
olution of 640×480 pixels. To simulate a real-world
application, the camera was mounted on top of the
chassis of a mobile robot (Pioneer 3-AT), positioned
approximately 30 cm above the ground.
4.2 Real-Time Performance Analysis
The viability of a perception system for mobile
robotics is directly related to its ability to operate in
real time. To select the models that compose our
architecture, a performance analysis was conducted
based on the frames per second (FPS) of the main
components.
To ensure real-time applicability, it was necessary
to optimize the inference speed of the most compu-
tationally intensive models. Given the limitations of
the available hardware, the strategy adopted was to re-
duce the input image resolution for the segmentation
and depth estimation modules. During inference, the
parameters were set to imgsz=256 for FastSAM and
input size=256 for Depth Anything v2, as required by
each model. This process resizes the original frame so
that its shorter side measures 256 pixels while main-
taining the aspect ratio, before being fed into the mod-
els. Table 1 summarizes the approximate FPS and la-
Robust Scene Understanding for Mobile Robots Based on Vision and Deep Learning Models
389
Figure 1: AIVFusion Architecture Diagram (Source: Authors).
tency results for each individual model and, finally,
for their integration.
Table 1: Performance Comparison (FPS and Latency) of the
Evaluated Models.
Model
Pre-trained
Weights
Input
Resolution
FPS Latency(ms)
SAM 2 sam2 hiera tiny 640 × 480 2 500
FastSAM FastSAM-s 640 × 480 15–20 50-66
Depth Anything V2 vits 640 × 480 20–24 41.7-50
YOLOv8 yolov8n 640 × 480 25–30 33-40
Integrated System - 640 × 480 10 100
The analysis in Table 1 shows that, although SAM
2 is a powerful model, its performance of approx-
imately 2 FPS makes it unfeasible for our applica-
tion. In contrast, the models YOLOv8, FastSAM, and
Depth Anything V2, running with optimized input
resolutions and smaller pre-trained weights, achieved
suitable frame rates individually. When integrated
into our fusion pipeline, the system reached 10 FPS
on average. It is important to highlight that this per-
formance was only possible after optimizing the vi-
sualization routines, particularly by excluding Fast-
SAM’s plot to result() function, which had signifi-
cantly reduced the FPS. However, this function is in-
tended solely for visualization purposes and is not re-
quired in the actual application, making its removal
feasible.
This resulting performance of 10 FPS is a key out-
come for the practical application of our system. It
translates to a total perception pipeline latency of ap-
proximately 100 ms (Tab. 1). This response time is
well within the soft real-time requirements for safe
navigation of a mobile robot at low to medium speeds,
as it allows the sense-plan-act cycle to update the
world model and react to obstacles effectively. There-
fore, this result validates the viability of our fusion
architecture, demonstrating that it is possible to com-
bine the rich, multi-modal capabilities of these mod-
els while maintaining the real-time performance es-
sential for autonomous robotic tasks.
4.3 Qualitative Analysis of Perception
and Fusion
To validate the effectiveness and generalization ca-
pabilities of the AIVFusion architecture, qualitative
tests were conducted across three distinct scenarios:
a controlled laboratory environment, a complex in-
door common area, and a challenging outdoor envi-
ronment.
The initial validation was performed in our labo-
ratory setting to confirm the core functionality of the
fusion pipeline 2. Figure 2(a) presents the original
scene captured by the camera. From this input, the
system first identifies the traversable space, as shown
in Figure 2(b), where the ground segmentation mask
(in blue) defines the navigable area. Next, the fusion
logic is applied to generate the final collision mask,
shown in Figure 2(c). At this stage, the system iso-
lates only the objects that represent a real proximity
risk, with the ground plane already excluded. The
ICINCO 2025 - 22nd International Conference on Informatics in Control, Automation and Robotics
390
unified output of the system is shown in Figure 2(d),
where all information layers are overlaid: the obsta-
cle (in green) is contextualized with its semantic la-
bel (“person”). It is worth noting that the person was
successfully detected despite significant partial occlu-
sion, demonstrating the system’s capability to gener-
ate coherent scene understanding in challenging situ-
ations.
To further assess the system’s performance in a
new indoor environment, tests were conducted in a
corporate common area (Figure 3). While the fusion
of ground segmentation and depth-based obstacles re-
mained robust, this scenario highlighted the limita-
tions of the underlying perception models. In one
case (Figure 3, top row), the object detector produced
a false positive, identifying three people where only
two were present. In another instance (Figure 3, bot-
tom row), the collision mask only partially covered
the person, suggesting that the empirically set depth
threshold may require fine-tuning or dynamic adapta-
tion for different scenes. These observations are valu-
able as they demonstrate the overall architecture’s re-
silience while pinpointing areas for future improve-
ments in the individual perception components.
Figure 2: The Visual Perception Pipeline for Scene Under-
standing (Source: Authors).
Finally, to test the system’s generalization limits,
a preliminary evaluation was conducted in an outdoor
residential environment (Figure 4). The segmenta-
tion module successfully recognized the sidewalk as a
navigable surface and, importantly, identified the ad-
jacent lawn as an unsafe area (obstacle), a correct and
crucial inference for ensuring robot safety. The full
pipeline also detected a person on the sidewalk as a
potential risk, showing promising potential for gener-
alization beyond structured indoor environments.
While our qualitative results are strong, our archi-
tectural design also considered the previously men-
tioned real-world challenges, such as varying lighting
Figure 3: Qualitative Results in a Novel Indoor Environ-
ment (Source: Authors).
Figure 4: Preliminary Evaluation of System Generalization
in an Outdoor Scenario (Source: Authors).
and vibrations from different floor types. Our system
addresses this by balancing the capabilities of its com-
ponent models: we balance the strong domain gener-
alization of the ViT-based Depth Anything V2, which
is crucial for adapting to visual variations caused by
environmental changes (Alijani et al., 2024), with the
high-speed performance of our selected CNN-based
models, FastSAM and YOLOv8. This combination
allows the system to produce a rich perceptual output
while maintaining real-time performance.
5 CONCLUSIONS AND FUTURE
WORKS
This paper introduced and qualitatively validated
AIVFusion, a hierarchical perception architecture that
fuses object detection, semantic segmentation, and
monocular depth estimation from a single camera.
This work’s primary contribution is the demonstra-
Robust Scene Understanding for Mobile Robots Based on Vision and Deep Learning Models
391
tion that such a complex fusion is viable for real-
time applications, achieving approximately 10 FPS
on consumer-grade hardware while robustly identi-
fying navigable areas and obstacles. Qualitative re-
sults confirmed the effectiveness of the proposed fu-
sion logic, showing that the system robustly identifies
navigable areas and obstacles in varied indoor envi-
ronments. Furthermore, preliminary outdoor tests in-
dicated promising generalization capabilities, where
a sidewalk was correctly interpreted as navigable ter-
rain. The system also proved resilient in challenging
scenarios, such as those involving significant partial
object occlusion. These findings validate that the pro-
posed approach provides a comprehensive scene un-
derstanding.
As future work, the next crucial step is rigorous
quantitative validation of the perception system. To
this end, a custom and diverse dataset is currently be-
ing developed. This dataset will consist of video se-
quences captured in multiple indoor scenarios, under
various lighting conditions, and will include a range
of static and dynamic obstacles to test the system’s
limits. It will contain annotated ground truth for each
of the system’s outputs - navigable space, obstacles,
and people, which will allow an objective evaluation
of performance. The evaluation metrics will be se-
lected based on standard practices in the literature for
each perception task. The goal of this quantitative val-
idation is therefore to objectively assess the effective-
ness of the proposed fusion architecture and to test the
hypothesis that it contributes to more accurate scene
understanding. Beyond this validation, future works
aim to evaluate the system in more complex scenarios
and to integrate the proposed perception layer into a
navigation pipeline for autonomous robot operation.
We also consider improving the system performance
using multiple Processor Cores and NPUs (e.g., In-
tel Core Ultra 9 Hardware), and GPUs (Nvidia Jetson
and Nvidia Digits).
DATA AVAILABILITY STATEMENT
The source code developed for this research is not yet
publicly available, but can be provided by the corre-
sponding author upon reasonable request. The code
is currently being prepared for public release and will
be made available in a dedicated repository.
ACKNOWLEDGEMENTS
This work was financed in part by the Coordenac¸
˜
ao
de Aperfeic¸oamento de Pessoal de N
´
ıvel Superior -
Brazil (CAPES) - Finance Code 001. The authors
also wish to thank the Institute of Mathematical and
Computer Sciences (ICMC/USP) and the Center of
Excellence in Artificial Intelligence (CEIA) for their
support.
REFERENCES
Alijani, S., Fayyad, J., and Najjaran, H. (2024). Vision
transformers in domain adaptation and domain gen-
eralization: a study of robustness. Neural Computing
and Applications, 36(29):17979–18007.
Dang, T.-V. and Bui, N.-T. (2023). Obstacle avoidance
strategy for mobile robot based on monocular camera.
Electronics, 12(8):1932.
Jocher, G., Chaurasia, A., and Qiu, J. (2023). Ultralytics
yolov8.
Khanam, R. and Hussain, M. (2024). Yolov11: An
overview of the key architectural enhancements. https:
//arxiv.org/abs/2410.17725.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C.,
Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C.,
Lo, W.-Y., Doll
´
ar, P., and Girshick, R. (2023). Seg-
ment anything. https://arxiv.org/abs/2304.02643.
Liu, Y., Wang, S., Xie, Y., Xiong, T., and Wu, M. (2024). A
review of sensing technologies for indoor autonomous
mobile robots. Sensors, 24(4).
Long, J., Shelhamer, E., and Darrell, T. (2015). Fully con-
volutional networks for semantic segmentation. In
Proceedings of the IEEE conference on computer vi-
sion and pattern recognition, pages 3431–3440.
Maruschak, P., Konovalenko, I., Osadtsa, Y., Medvid, V.,
Shovkun, O., and Baran, D. (2025). Surface defects
of rolled metal products recognised by a deep neural
network under different illuminance levels and low-
amplitude vibration. The International Journal of Ad-
vanced Manufacturing Technology, pages 1–16.
Masoumian, A., Rashwan, H. A., Cristiano, J., Asif, M. S.,
and Puig, D. (2022). Monocular depth estimation us-
ing deep learning: A review. Sensors, 22(14):5353.
Niloy, M. A. K., Shama, A., Chakrabortty, R. K., Ryan,
M. J., Badal, F. R., Tasneem, Z., Ahamed, M. H.,
Moyeen, S. I., Das, S. K., Ali, M. F., Islam, M. R.,
and Saha, D. K. (2021). Critical design and control
issues of indoor autonomous mobile robots: A review.
IEEE Access, 9:35338–35370.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., et al. (2021). Learning transferable visual models
from natural language supervision. In International
conference on machine learning, pages 8748–8763.
PmLR.
Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma,
T., Khedr, H., R
¨
adle, R., Rolland, C., Gustafson, L.,
Mintun, E., Pan, J., Alwala, K. V., Carion, N., Wu,
C.-Y., Girshick, R., Doll
´
ar, P., and Feichtenhofer, C.
(2024). Sam 2: Segment anything in images and
videos. arXiv preprint arXiv:2408.00714.
ICINCO 2025 - 22nd International Conference on Informatics in Control, Automation and Robotics
392
Urban, D. and Caplier, A. (2021). Time-and resource-
efficient time-to-collision forecasting for indoor
pedestrian obstacles avoidance. Journal of imaging,
7(4):61.
Vijayakumar, A. and Vairavasundaram, S. (2024). Yolo-
based object detection models: A review and its
applications. Multimedia Tools and Applications,
83(35):83535–83574.
Wofk, D., Ma, F., Yang, T.-J., Karaman, S., and Sze, V.
(2019). Fastdepth: Fast monocular depth estimation
on embedded systems. In 2019 International Con-
ference on Robotics and Automation (ICRA), pages
6101–6108. IEEE.
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., and Zhao,
H. (2024a). Depth anything: Unleashing the power of
large-scale unlabeled data. In CVPR.
Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng,
J., and Zhao, H. (2024b). Depth anything v2. https:
//arxiv.org/abs/2406.09414.
Zhang, Y., Liu, Y., Liu, S., Liang, W., Wang, C., and Wang,
K. (2024). Multimodal perception for indoor mo-
bile robotics navigation and safe manipulation. IEEE
Transactions on Cognitive and Developmental Sys-
tems.
Zhao, X., Ding, W., An, Y., Du, Y., Yu, T., Li, M., Tang, M.,
and Wang, J. (2023). Fast segment anything. arXiv
preprint arXiv:2306.12156.
Robust Scene Understanding for Mobile Robots Based on Vision and Deep Learning Models
393