
et al., 2019).
This paper provides an overview of our research
journey. Section 2 reviews related work on pedes-
trian and vehicle detection, highlighting urban chal-
lenges. Section 3 presents our approach, explain-
ing the YOLO11 architecture and real-time detection
methodology. Section 4 discusses experimental re-
sults, including key metrics like mAP, precision, and
recall. Finally, Section 5 summarizes our findings and
suggests future research, including low-light detec-
tion improvements and predictive analytics for traffic
management.
2 LITERATURE REVIEW
The detection of pedestrians and vehicles has be-
come a critical area of research in urban traffic man-
agement, safety systems, and smart city frameworks.
Early approaches to object detection were primar-
ily based on manually crafted features and machine
learning models. Techniques such as the Viola-Jones
cascade classifier (Viola and Jones, 2001) and His-
tograms Of Oriented Gradients (HOG) (Dalal and
Triggs, 2005) formed the foundation for identify-
ing objects in constrained settings. While effective
in controlled environments, these methods struggled
with challenges inherent to urban landscapes, includ-
ing occlusions, fluctuating lighting, and the diverse
appearance of pedestrians and vehicles.
With the emergence of deep learning, the field
of object detection witnessed a revolutionary shift.
Convolutional Neural Networks (CNNs) automated
feature extraction and greatly enhanced the robust-
ness and accuracy of detection models. Region-based
CNNs, such as Faster R-CNN (Ren et al., 2015),
combined region proposal mechanisms with CNNs to
achieve highly accurate detections. However, these
models relied on computationally expensive multi-
stage pipelines, making them unsuitable for real-time
applications like live traffic monitoring and pedestrian
detection (Girshick, 2015).
To address the limitations of traditional region-
based models, single-shot detection frameworks, in-
cluding the Single Shot MultiBox Detector (SSD)
(Liu et al., 2016) and the early iterations of unified
detection systems like You Only Look Once (YOLO)
(Redmon et al., 2016), reframed object detection as a
single regression problem. The streamlined approach
processed the entire image in a single pass, achiev-
ing real-time performance while maintaining compet-
itive accuracy. Over successive iterations, advance-
ments such as anchor boxes, multi-scale detection,
and improved backbone networks allowed these sys-
tems to handle challenges like small-object detection
and scale variation more effectively (Bochkovskiy
et al., 2020).
The latest evolution in this family of models, in-
troduced as version 11, incorporates key innovations
including Cross-Stage Partial Networks (CSP), Spa-
tial Pyramid Pooling (SPP), and Soft Non-Maximum
Suppression (Soft-NMS). These enhancements sig-
nificantly improve detection capabilities, particularly
in dense and cluttered urban settings. By addressing
challenges such as occlusions, overlapping objects,
and fluctuating lighting conditions, this architecture
achieves an optimal balance of speed and accuracy.
Its ability to process live video feeds in real time po-
sitions it as a pivotal tool for smart city applications,
enhancing pedestrian safety, traffic management, and
urban planning (Liang et al., 2023).
This seamless progression from handcrafted
methods to advanced detection architectures high-
lights the transformative impact of deep learning in
pedestrian and vehicle detection. While recent ad-
vancements have mitigated many challenges, the in-
tegration of these systems into scalable and resource-
efficient frameworks remains a focus for future re-
search, aligning with the overarching goals of build-
ing smarter and safer urban environments.
3 PROPOSED WORK
Building on these advancements, this proposed work
proposes a real-time pedestrian and vehicle detec-
tion system optimized for deployment in urban en-
vironments. The model, based on YOLO11, ad-
dresses challenges such as occlusion, small-object de-
tection, and varying lighting conditions. Designed for
edge devices, the system ensures efficient operation
on resource-constrained hardware while maintaining
high accuracy (Li et al., 2024).
3.1 Architecture
The architecture of YOLO11 (Figure.1) is divided
into three key components: Backbone, Neck, and
Head, each designed to enhance the efficiency and ac-
curacy of pedestrian detection (Zheng et al., 2024).
The Backbone is responsible for feature extraction
from the input image, starting with an input size of
640 × 640 × 3. It uses successive convolutional layers
and C3K2 blocks with shortcut connections to capture
both local and global contextual information while
reducing the spatial dimensions of the feature maps.
These residual modules improve feature propagation
INCOFT 2025 - International Conference on Futuristic Technology
168