
significantly reduces ID switching through its ReID
module, enhancing tracking accuracy.
This study's findings provide crucial insights for
the development of more efficient traffic monitoring
systems. By leveraging YOLOv11's powerful
detection capabilities and advanced tracking
algorithms, traffic monitoring systems can better
achieve traffic flow analysis, accident prevention, and
intelligent traffic management.
2 DATASET AND MODELS
2.1 Page Setup
To assess the performance of YOLOv11 and tracking
algorithms, this paper collects data from various
highway and urban roads. The dataset used in this
paper includes 10,870 images of the target objects.
There are four categories of objects: "car," "bus,"
"van," and "others." Each image includes at least two
categories, which can increase the model's accuracy.
This paper sets 80% of all images as the training
set and 20% as the test set. The model training
parameters set the batch value to 4, the total number
of training rounds to 100, the image size to 640,
disabled multi-thread loading, and enabled image
caching.
2.2 Model
The network architecture of YOLO11 (as shown
in Figure 1) fully reflects the balance of efficiency
and accuracy. Its core components include the basic
leading trunk network (Backbone), the connection
layer network (Neck), and the detection module
(Head). First, the input image is processed by the
Backbone (leading trunk network) through a series of
convolutional layers (Conv) and C3k2 modules to
extract image features (Alif, 2024). The C3k2 module,
as an efficient convolutional block, can effectively
extract multi-scale features while reducing
computational redundancy. This is different from the
main trunk network design of YOLOv5 (Zhang et al.,
2022) and YOLOv8(Talaat & ZainEldin, 2023):
YOLOv5 uses CSPDarknet53 (Mahasin & Dewi,
2022) as the leading trunk network, and the core
module is the C3 module, while YOLOv8 introduces
the C2f module to further lightweight the network
structure. YOLO11 optimizes on this basis, using the
C3k2 module instead of C2f to further improve
computational efficiency.
Next, the feature map enters the Convolutional
Block with the Spatial Attention (CBSA) module,
which integrates the spatial attention mechanism and
can dynamically adjust the importance of different
regions in the feature map, thereby enhancing the
feature representation capability. This design does not
explicitly appear in YOLOv5 and YOLOv8;
YOLOv5 mainly relies on the Focus module for
feature extraction, while YOLOv8 introduces
depthwise separable convolution and dilated
convolution to optimize feature extraction efficiency.
YOLO11 further strengthens feature representation
capabilities through the CBSA module, giving it an
advantage in complex scenarios.
Then, the feature map enters the Neck, whose
main task is to process further and fuse the features
extracted by the Backbone to detect targets of
different scales better. In the Neck stage, the feature
map is processed through multiple C3k2 modules and
convolutional layers, and the resolution is increased
through Upsample operations to restore detailed
information. In addition, the feature maps are fused
between different levels through Concat operations,
which can effectively combine low-level detail
information and high-level semantic information.
Compared with YOLOv5 and YOLOv8, YOLOv5
uses PANET (Hussain, 2024) for feature fusion. At
the same time, YOLOv8 optimizes the PANet
structure, removes the convolutional structure in the
upsampling stage, and introduces the SPPF module
for multi-scale feature fusion. YOLO11 adds the
C2PSA module behind the SPPF module to further
enhance feature extraction capabilities, making it
perform better in multi-scale target detection (Jooshin
et al., 2024).
Finally, the feature map processed by the Neck is
sent to the Head (detection module), responsible for
outputting the final detection results. The detection
module includes multiple parallel detection layers,
each responsible for detecting targets of different
scales to adapt to diverse targets in complex scenarios.
Each detection layer contains a C3k2 module to
process the feature map further and then outputs the
detection results through the Detect layer, including
the category and location of the target. Compared
with YOLOv5 and YOLOv8, YOLOv5 adopts an
Anchor-Based design, while YOLOv8 introduces an
Anchor-Free design and uses a Decoupled Head to
handle classification and regression tasks separately.
YOLO11 further optimizes the detection head,
introduces depthwise separable convolution to reduce
redundant calculations, and significantly improves
accuracy.
The entire network architecture is designed to
extract rich features through the Backbone, fuse
multi-scale features through the Neck, and perform
ICDSE 2025 - The International Conference on Data Science and Engineering
482