
(Mur-Artal, Montiel and Tardos, 2015). It is a vision
SLAM method based on monocular camera, which
can greatly improve the efficiency of mobile robots,
and lays a solid foundation for the subsequent
development and application of vision SLAM.
Cadena et al. in ‘Past, Present, and Future of
Simultaneous Localization And Mapping: Towards
the Robust-Perception Age’ state that SLAM
technology has made significant progress over the last
30 years, enabling robots to able to navigate
autonomously in unknown environments, and
highlights the robustness and scalability of SLAM in
long-term operation (Cadena, Carlone, Carrillo, Latif,
Scaramuzza, Neira, Reid and Leonard, 2016).
Secondly, Zaffar et al. in ‘Sensors, SLAM and Long-
term Autonomy: a Review’ discuss the various
sensors used in SLAM systems and evaluate their
performance in long-term autonomous operation,
further illustrating the advantages of SLAM
technology in dynamic environments (Zaffar, Ehsan,
Stolkin and McDonald-Maier, 2018). In Hybrid
Navigation Method for Multiple Robots Facing
Dynamic Obstacles, the authors propose improved
reinforcement learning algorithms to enhance the
obstacle avoidance ability of robots in dynamic
environments, aiming to solve the navigation
problems caused by the accumulation of errors in the
traditional methods (Wang, Liu and Li, 2021).
Although vision SLAM has been widely used for
mobile robot navigation, it still faces some challenges
in practical applications. First, the robustness
problem is particularly prominent. Under complex
conditions such as light variations, dynamic
environments, and occlusions, traditional vision
SLAM systems tend to lose tracking ability or
generate inaccurate maps. For example, low-light
conditions or significant light changes can make it
difficult to extract visual features, which in turn
affects positioning accuracy. Therefore, optimizing
the robustness of vision SLAM bears the brunt. In
addition, there is a need to optimize the scalability of
the vision SLAM. This is because the scalability of
existing vision SLAM systems is limited as the size
of the environment increases. Building maps for
large-scale environments increases the computational
load, especially on resource-limited devices, making
it difficult to process or store large-scale
environmental data in real-time.
The aim of this study is to propose a new
framework for optimized visual SLAM with the aim
of improving the robustness and scalability of SLAM
systems. Deep learning and image processing will be
combined, Matlab and python will be used to improve
the robustness and scalability of visual SLAM.
2 METHOD
2.1 Data Sources
The data sources are mainly graphical. It is used for
Convolutional Neural Network (CNN) feature
extraction as well as ORB feature matching. The data
source is mainly image based. CNN feature extraction
as well as ORB feature matching uses a set of gate
images as samples. The difference is that in CNN
feature extraction, ResNet50 is used as the feature
extraction network.
In Proximal Policy Optimization (PPO), instead
of using an external real dataset, a simulation
environment is constructed to generate the ‘data’ for
training, i.e., the generation of environmental data.
Firstly, a simple SLAM robot navigation environment
is defined, and secondly, an observation space is
defined, which is defined as a grey scale image of
shape (64, 64, 1), i.e., a single-channel image of
64×64 pixels. Finally, the action space is defined and
the action space is defined as discrete 3 actions
representing forward, left turn and right turn
respectively.
Next is the relevant dataset used for ORB
loopback detection. Since loopback detection is used
to check whether the robot has returned to the
previously passed position, it need to process many
frames of the image. However, it is difficult to obtain
a large number of images of the same scene quickly,
so the study thought of acquiring a large number of
images of a fixed place by time-lapse photography.
Time-lapse photography refers to the use of a camera
to record changes in the same scene over time, such
as recording the environment of a street from sunrise
to sunset. The recorded video is then captured evenly
over 50 frames. In this way the study get a large
number of images for loopback detection.
2.2 Method
2.2.1 CNN
CNN is a deep learning model specifically designed
for processing grid-structured data (e.g., images).
CNN is one of the core technologies for computer
vision tasks by performing feature extraction and
classification through Convolutional Layers, Pooling
Layers, and Fully Connected Layers (LeCun, Boser,
Denker, Henderson, Howard and Jackel, 1989).
Firstly, CNN can perform local connectivity and
weight sharing. CNN uses convolutional operations
to extract features from local regions, and
significantly reduces model parameters by sharing
EMITI 2025 - International Conference on Engineering Management, Information Technology and Intelligence
44