dimensional performance comparison and analysis of
the different-scale models within the YOLOv8 series
itself—specifically for detecting the critical
"pedestrian" target on a challenging, large-scale
public autonomous driving dataset—remains an area
worthy of investigation. Such an analysis is crucial
for understanding the specific trade-offs among
models of different scales in terms of accuracy, speed,
resource consumption, and sensitivity to complex
driving environment factors.
Therefore, the core work of this study is to select
three representative models from the YOLOv8
series—small, medium, and large—and conduct a
unified training, comprehensive evaluation, and in-
depth comparative study focused on the "pedestrian"
class within the BDD100K dataset. This research
aims to reveal the specific differences and inherent
trade-offs in accuracy, efficiency, robustness, and
resource consumption among these models, thereby
providing insights for future targeted optimizations.
2 DATASET AND METHODS
2.1 BDD100K Dataset and Processing
This study utilizes the large-scale driving scene object
detection dataset, Berkeley DeepDrive 100K
(BDD100K) (Yu et al., 2020). This dataset is
renowned for its vast data volume and high degree of
scene diversity, making it an ideal choice for
evaluating the comprehensive performance of
autonomous driving perception algorithms. The
annotation information in BDD100K includes 10
main categories of traffic participants, such as
"pedestrian" and "car".
The experiment strictly adheres to the official data
partitioning of BDD100K, using its training set of
approximately 70,000 images for model training and
its validation set of about 10,000 images for
performance evaluation. To align with the core
objective of this study—pedestrian detection—the
dataset was specifically processed. First, the original
annotations were converted to the YOLO standard
format. Second, all images containing the "person"
class and their corresponding pedestrian annotations
were filtered to construct a single-class data subset for
the pedestrian detection task.
2.2 The YOLOv8 Model Family
This study employs the YOLOv8 series of models as
the core research object (Jocher et al., 2023).
YOLOv8 inherits and evolves the single-stage, end-
to-end detection philosophy of the YOLO algorithm
family (Redmon et al., 2016), which fundamentally
changed real-time detection. The series has
continuously evolved with versions like YOLOv7
and YOLOv9 (Wang, Bochkovskiy, & Liao, 2023;
Wang, Yeh, & Liao, 2024). By introducing more
efficient structural components and adopting an
Anchor-Free detection head design, a trend also seen
in other modern detectors (Carion et al., 2020), it
achieves significant improvements. Its typical three-
part architecture—Backbone, Neck, and Head—
ensures powerful feature extraction, fusion, and
prediction capabilities, a common paradigm of
inefficient detectors (Tan, Pang, & Le, 2020).
To comprehensively assess the impact of model
scale on pedestrian detection performance, this study
selected three representative, pre-defined standard
models from the YOLOv8 series: YOLOv8s (small),
YOLOv8m (medium), and YOLOv8l (large). These
three models were sourced directly from the official
repository and utilized their corresponding pre-
trained weights from the COCO dataset for transfer
learning. No modifications were made to their
standard network architectures during the
experiment.
2.3 Experimental Setup
All model training and evaluation were conducted in
a unified experimental environment. The platform
was based on an Apple M3 Max processor with its
integrated GPU, accelerated using Metal
Performance Shaders (MPS), under the macOS
operating system. The key software environment
included Python 3.10, PyTorch 2.8.0.dev, and
Ultralytics YOLO 8.3.139.
During the training phase, all models were
initialized with COCO pre-trained weights and
trained for 10 epochs on the BDD100K pedestrian
subset. The optimizer used was SGD. The initial
learning rate and batch size were adapted for the
different model scales: YOLOv8s used an initial
learning rate of 0.001 and a batch size of 8; YOLOv8l
used an initial learning rate of 0.0005 and a batch size
of 4. The input image size for both training and
validation was uniformly set to 640x640 pixels. For
data augmentation, a series of methods provided by
the YOLOv8 framework were used, primarily
including Mosaic, random horizontal flip, random
scaling, random translation, and HSV color space
augmentation.