High Precision Single Shot Object Detection in Automotive Scenarios

Soumya A

, C Krishna Mohan

and Linga Reddy Cenkeramaddi

Department of Computer Science and Engineering, Indian Institute of Technology, Hyderabad, India

Department of Information and Communication Technology, University of Agder, Grimstad, 4879, Norway

Keywords:

Deep Learning, Convolutional Neural Network, Object Detection, Multi-Class Classiﬁcation, Computer

Vision.

Abstract:

Object detection in low-light scenarios is a challenging task with numerous real-world applications, ranging

from surveillance and autonomous vehicles to augmented reality. However, due to reduced visibility and

limited information in the image data, carrying out object detection in low-lighting settings brings distinct

challenges. This paper introduces a novel object detection model designed to excel in low-light imaging con-

ditions, prioritizing inference speed and accuracy. The model leverages advanced deep-learning techniques

and is optimized for efﬁcient inference on resource-constrained devices. The inclusion of cross-stage par-

tial (CSP) connections is key to its effectiveness, which maintains low computational complexity, resulting

in minimal training time. This model adapts seamlessly to low-light conditions through specialized feature

extraction modules, making it a valuable resource in challenging visual environments.

1 INTRODUCTION

Object detection using deep learning is a fundamen-

tal task in the realm of computer vision that involves

identifying and localizing objects of interest within

an image or video. Object detection holds a cru-

cial role in computer vision systems, ﬁnding appli-

cations across various ﬁelds such as video surveil-

lance (Gajjar et al., 2017), medical imaging (Adel

et al., 2010), (Li et al., 2019b), autonomous driving

(Li et al., 2019a), and robot navigation (Truong et al.,

2015), (Karaoguz and Jensfelt, 2019). The advent of

deep learning, particularly convolutional neural net-

works (CNNs), has led to signiﬁcant advancements in

the accuracy and efﬁciency of object detection. This

literature review explores the key contributions and

trends in object detection using deep learning tech-

niques.

Two-stage detectors and one-stage detectors rep-

resent distinct methods for object detection. Two-

stage detectors, such as Faster R-CNN, employ a two-

step process for object detection. In the ﬁrst stage,

they generate a set of region proposals using a re-

gion proposal network (RPN). Region proposals are

reﬁned and classiﬁed in the second stage to obtain

the ﬁnal detections. This two-stage architecture pro-

vides more accurate object localization and is well-

suited for complex scenes and small objects, but it

has a very high inference time and is computation-

ally expensive. In comparison, single-stage object

detectors perform region proposal and object detec-

tion in a single pass through the network. The (You

Only Look Once) YOLO (Redmon et al., 2016) intro-

duced a single-stage end-to-end object detection ap-

proach. It makes predictions for bounding boxes and

class probabilities in a single pass by analyzing the

entire image once. YOLO achieved real-time infer-

ence speed and demonstrated competitive accuracy.

Subsequent versions, such as YOLO v3 (Redmon and

Farhadi, 2018), YOLO v4 (Bochkovskiy et al., 2020),

and YOLO v6 (Li et al., 2022), further improved ac-

curacy and extended the model’s capabilities. One-

stage detectors are faster than two-stage detectors but

relatively less accurate.

The proposed model is better than the other ad-

vanced single-shot detectors, leveraging state-of-the-

art methods to enhance precision while reducing com-

putational complexity. Its architecture, comprising

a backbone, neck, and head, is designed for efﬁ-

ciency and effectiveness. The backbone, with its

lower computational demands and cross-stage partial

(CSP) connections, ensures smoother gradient ﬂow.

The neck excels at integrating features across diverse

scales, facilitating semantic and spatial information

sharing. Meanwhile, the head streamlines the predic-

tion of classiﬁcations and bounding box coordinates.

604

A, S., Krishna Mohan, C. and Cenkeramaddi, L.

High Precision Single Shot Object Detection in Automotive Scenarios.

DOI: 10.5220/0012383100003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 2: VISAPP, pages

604-611

ISBN: 978-989-758-679-8; ISSN: 2184-4321

A key advantage lies in adopting a state-of-the-art

loss function from the literature, which accounts for

bounding box overlap and size similarity, resulting in

faster convergence and superior accuracy.

Our research paper introduces a novel architecture

for object detection, meticulously designed to opti-

mize efﬁciency and effectiveness, with the following

key contributions:

• A carefully crafted backbone network, drawing

inspiration from Inception ResNetV2 and incor-

porating CSP connections for superior perfor-

mance in image-related tasks.

• A multi-block approach features three distinct

block types (A, B, and C), each tailored to ex-

tract features at different resolutions, enabling ef-

fective object detection across various scales and

complexities.

• Including cross-stage partial (CSP) connections in

all block types ensures smooth gradient ﬂow dur-

ing training, improving convergence and model

performance.

• Multi-scale object detection capability allows our

architecture to adapt to objects of varying sizes

and spatial distributions dynamically.

2 RELATED WORKS

This section summarizes the recent advancements in

object detection. Several works were proposed for

object detection, and the effectiveness of convolu-

tional neural networks (CNN) classiﬁers has been

shown in (Coman et al., 2018) to outperform tradi-

tional machine learning techniques focused on fea-

ture extraction. In the context of object detection,

the faster region-based convolutional neural network

(Faster-RCNN) with InceptionV2 architecture is uti-

lized in (Galvez et al., 2018a) to identify ﬁve indi-

viduals and one quadrotor within the given image. In

(Galvez et al., 2018b), the authors have presented a

low-shot transfer detector using a deep architecture

and a controlled transfer learning framework to ad-

dress the challenges of limited training data in ob-

ject detection. An object detection approach was in-

troduced in (Xu et al., 2018) with a region selection

network for selecting regions from which to consider

features and a gatting network to transform the fea-

ture maps. A novel object detection model in (Zeng

et al., 2013) is designed to train multi-stage classi-

ﬁers. In (Liu et al., 2016), a single-stage detector

(SSD) has been designed, incorporating convolutional

outputs for bounding boxes connected to several fea-

ture maps within the network. Enhancing small ob-

ject detection through contextual information fusion

within the faster R-CNN framework is presented in

(Fang and Shi, 2018). In (Beery et al., 2020) Con-

text, R-CNN presented with the attention to access a

camera-speciﬁc memory bank and improve object de-

tection by incorporating contextual information from

previous frames.

3 PROPOSED METHOD

The proposed architecture, comprising a backbone,

neck, and head, is carefully designed to optimize the

efﬁciency and effectiveness of object detection.

3.1 Backbone Network

The proposed architecture’s backbone draws inspira-

tion from the highly effective Inception ResNetV2

model while incorporating cross-stage partial (CSP)

connections, renowned for its exceptional perfor-

mance in image-related tasks. The architectural de-

backbone that is crucial to the entire model. At its

core, the backbone comprises several important ele-

ments, each with a speciﬁc purpose. It all begins with

the stem, which serves as the initial feature extrac-

tor. Strategically, it reduces the spatial dimensions

of the input image by a factor of 8. This dimension

reduction proves instrumental in capturing essential

features while efﬁciently processing the input data.

Moving forward, Block A takes center stage, featur-

ing an inception module with shortcut connections.

What sets Block A apart is the incorporation of CSP

connections, which involve the deliberate splitting of

feature maps.

This architecture includes ten Block A units, ex-

celling at extracting high-resolution features from the

input, which is crucial for subsequent stages. Follow-

ing Block A, the reduction Block A comes into play,

effectively reducing the spatial resolution by a stride

of 16. This strategic reduction enhances the receptive

ﬁeld, enabling more comprehensive feature analysis

in subsequent stages.

Block B shares the idea of Block A but focuses on

mid-resolution feature maps. A total of 20 Block B

units contribute to extracting vital mid-level features.

Subsequently, reduction Block B follows suit, reduc-

ing spatial resolution with a larger stride of 32. This

strategic choice enables the model to detect objects of

varying sizes and scales efﬁciently. Block C emerges

as a critical component, specializing in reﬁning low-

resolution features, ultimately optimizing the model

High Precision Single Shot Object Detection in Automotive Scenarios

605

for detecting objects with ﬁne details and spatial com-

plexity.

All three block types (A, B, and C) feature CSP

connections, ensuring smooth gradient ﬂow during

training and facilitating improved convergence. The

architecture captures outputs at three distinct scales

after blocks A, B, and C to enable multi-scale object

detection. This enables the model to adapt dynami-

cally to diverse object sizes and spatial distributions.

3.2 Neck and Decoupled Head

The neck component in Figure 2 processes feature

maps from different scales, effectively combining

spatial richness and semantic enrichment. The fusion

of these features, through up-sampling and down-

sampling, ensures the sharing of crucial semantic

information and spatial resolution across all three

scales, enhancing the model’s comprehensiveness and

robustness in object detection. Following the neck

component, the decoupled head, featuring dedicated

convolutional layers, takes center stage. It is designed

explicitly to predict classiﬁcation scores and bound-

ing box coordinates for each of the three scales. This

architecture employs three separate decoupled heads,

one for each scale, ensuring precise object detection

and accurate localization. This holistic design show-

cases a well-coordinated ﬂow that optimizes the en-

tire model’s ability to detect and identify objects ef-

fectively across multiple scales and complexities.

3.3 Training

The proposed object detection model was employed

with the optimization algorithm, stochastic gradient

descent (SGD), with a learning rate of 0.01. The

model was trained with 11 epochs, each processing

batches of size 32. To aid in convergence and op-

timization, the SGD optimizer was conﬁgured with

a momentum of 0.9 and a weight decay of 0.0005,

which helps control the magnitude of weight updates

during training.

Mixed precision training was adopted to enhance

the training process further and accelerate compu-

tations. This method optimizes memory usage and

speeds up training by reducing computational costs

while maintaining adequate numerical precision.

Using mixed precision training allowed for faster

convergence while maintaining the model’s perfor-

mance quality. The chosen hyperparameters, includ-

ing the learning rate and batch size, were selected

to balance the trade-off between model convergence

and optimizing computational efﬁciency. The model

follows an anchor point-based approach and incor-

porates task-assignment learning. Anchor points are

predeﬁned reference points used during the object de-

tection process to facilitate the efﬁcient localization

of objects. Task assignment learning optimizes the

assignment of bounding boxes to anchor points, fur-

ther enhancing the model’s overall performance.

4 DATASET AND IMAGE

COLLECTION

The image collection dataset available in IEEE Data-

port (Gao et al., 2022) is considered for the automo-

tive object detection scenario. This dataset comprises

camera images corresponding to six classes with var-

ied dimensions. The dataset contains 19,740 images

and labels. We randomly selected 15,777 images for

training and 1800 images for validation and testing.

The camera image of size 1440 × 1080 × 3 pixels is

resized to 416 × 416 × 3. There may be one or more

objects in one image, so the location of each object is

pre-annotated. All the objects in the dataset are cat-

egorized into six distinct groups: person, car, cyclist,

bus, truck, and motorbike. Although the dataset’s au-

thor mentions six classes, there seem to be only four

classes (pedestrian, bicycle, car, and truck). There-

fore, the dataset is highly imbalanced.

Annotating Dataset: Since the dataset consists of

a few inconsistent labels, the necessity for annota-

tion arises. Instead of manual annotation, we em-

ployed a pre-trained Faster RCNN for object detec-

tion, and subsequently, we recorded the associated

bounding box coordinates, storing them in a CSV ﬁle.

This method allowed us to annotate the entire dataset

seamlessly.

5 EVALUATION OF THE

STATE-OF-THE-ART CNNs

In this section, we perform a comprehensive assess-

ment of various deep-learning benchmark models us-

ing the Automotive dataset (Gao et al., 2022). We

evaluate the YOLOv5n, YOLOv6n, YOLOv8n, and

RT-DETR models, all of which are designed for

single-stage object detection. The models are trained

to predict bounding boxes and class probabilities di-

rectly from the entire image, enabling real-time de-

tection. It’s worth noting that our evaluation identi-

ﬁed speciﬁc dataset-related challenges. Notably, the

dataset contains four unique classes, but the mAP cal-

culation was conducted for six classes, which can

potentially lead to inaccuracies. To provide a more

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

606

Figure 1: Proposed Architecture Backbone.

comprehensive understanding of our model’s effec-

tiveness, especially for individual classes, we present

the evaluation results in Table 1, including precision,

recall, mAP50, and mAP50-95 metrics.

5.1 Evaluation Metrics

Proposed object detection model performance is typ-

ically assessed by measuring its accuracy using met-

rics such as Average Precision (AP) or Mean Average

Precision (mAP). These metrics involve calculating

the average of AP scores across all object classes.

We employed the mean average precision (mAP)

as the evaluation metric to assess the performance of

the proposed object detection model. Mean average

precision (mAP) extends AP for multi-class or multi-

label scenarios commonly found in object detection.

AP is computed for each class or label, and then mAP

is calculated by averaging these AP values. In object

detection, for instance, we calculate AP for each ob-

ject class to measure how well the model identiﬁes

objects of that class. mAP then offers an overall per-

formance score, considering the precision-recall per-

formance across all object classes. Like AP, higher

mAP signiﬁes better detection accuracy across dif-

ferent classes. A high mAP means a model has a

low false negative and a low false positive rate. We

provide a detailed breakdown of our model’s perfor-

mance shown in Table 2, which is an essential refer-

ence point critical for a deeper understanding of its

effectiveness across various object classes and detec-

tion scenarios.

To train the model, we utilize two distinct loss

functions:

Classiﬁcation Loss- Varifocal loss (Zhang et al.,

2021): The varifocal loss shown in Eq. 1 is utilized as

the classiﬁcation loss function. This loss function ef-

fectively tackles the issue of class imbalance between

positive and negative samples, thereby enhancing the

classiﬁcation performance.

V FL(p, q) =

(

−q(qlog(p) + (1 − q)log(1 − p) if q > 0

−qα

plog(1 − p) if q = 0

(1)

where p is the predicted IoU-aware classiﬁcation

score, and q is the target score. For a foreground

point belonging to its respective ground-truth class,

the target score q is determined as the intersection

over union (IoU) between the generated bounding box

and its associated ground truth. If the point does not

belong to its ground-truth class, the target score is set

to 0.

Bounding box loss - Complete IoU loss (CIOU)

(Zheng et al., 2020): The complete IoU (CIOU) loss

is employed for the bounding box regression task.

CIOU considers both box overlaps and size similarity,

leading to more accurate bounding box predictions es-

sential for precise object localization.

5.2 Evaluation Results

Our model’s performance evaluation was conducted

on the Automotive dataset, comprising 15,777 images

for training, 1,800 for validation, and 1,800 testing

High Precision Single Shot Object Detection in Automotive Scenarios

607

Figure 2: Proposed Architecture Neck and Decoupled head.

Table 1: Performance for all the state-of-the-art models.

Model Precision Recall mAP50 mAP50-95

YOLO v8n (yolov8, 2023) 0.811 0.831 0.83 0.68

YOLO v6n (Li et al., 2022) 0.792 0.732 0.773 0.575

YOLO v5n (yolov5, 2023) 0.811 0.831 0.83 0.68

RT-DETR (Lv et al., 2023) 0.863 0.897 0.915 0.781

Proposed SSD Model 0.859 0.408 0.406 0.257

Table 2: Class-wise performance evaluation table for the proposed single shot detection model.

Class Images Precision Recall F1-Score mAP-50 mAP50-95

Person 2378 0.94 0.45 0.6 0.45 0.30

Car 2378 0.85 0.4 0.544 0.4 0.28

Bicycle 2378 0.827 0.38 0.53 0.39 0.26

Truck 2378 0.85 0.44 0.56 0.4 0.25

images, providing a diverse representation of real-

world scenarios. Object detection metrics were em-

ployed to gauge our model’s efﬁcacy, including the

mean average precision (mAP), precision, recall, and

F1 score. It is worth noting that the dataset presents

certain challenges, primarily stemming from an im-

balanced class distribution and limited samples avail-

able for classes such as motorbike and bus, which

can introduce bias into the model’s performance eval-

uation. As a result, the model might exhibit rela-

tively strong performance for classes with larger sam-

ple sizes while encountering challenges in accurately

detecting and classifying instances of motorbikes and

buses. Upon testing, it was observed that the dataset

contains only four unique classes, but the mAP calcu-

lation was conducted for six classes. This discrep-

ancy in class count could lead to inaccuracies and

misleading results during evaluation. Consequently,

the mAP, though a widely used metric, might not ac-

curately depict the full extent of our model’s accuracy.

To address this limitation, we offer a comprehensive

breakdown of performance metrics for each class. By

highlighting precision, recall, and F1 scores for ev-

ery category, we shed light on our model’s speciﬁc

strengths and weaknesses. The resultant confusion

matrix is shown in Figure 3, and the precision-recall

(PR) curve is shown in Figure 4. Sample inferences

obtained from distant objects are shown in Figure 5,

from crowded areas are shown in Figure 6, and from

shadows and low light conditions are shown in Figure

7. The model presents visual insights through sam-

ple inferences, showcasing our model’s robust perfor-

mance in complex real-world scenarios. Despite the

challenges posed by the dataset, our model’s adapt-

ability and resilience, particularly in low-light condi-

tions, make it a promising solution for a wide range

of practical object detection tasks. It is important to

mention that the proposed method is trained on unbal-

anced datasets, which is also an inﬂuential factor in

the real-time performance of object detection tasks.

A confusion matrix is depicted with a total of

seven classes, and we identiﬁed that the dataset is im-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

608

Figure 5: Sample Inferences: Predictions of distant objects.

balanced and does not have any samples of motor-

bikes, and the bus class is not included in our annota-

tions generated with faster RCNN as it contains very

few samples in the dataset. This lack of samples is

causing the variation in the resulting matrix presented

in Figure 3 and the results presented in Table 4. Also,

the test set we used for the model does not include

the truck, resulting in the PR curve in Figure 4 be-

ing plotted with only 3 classes. The proposed model

was tailored to improve inference speed on resource-

constrained devices by incorporating CSP connec-

tions. This strategic integration signiﬁcantly enhances

the model’s ability to perform rapid inferences.

Figure 3: Confusion matrix of the proposed model with 7x7

class accuracies.

Figure 4: Precision-Recall Curve.

6 CONCLUSION

Leveraging convolutional neural networks (CNNs),

the proposed high-precision single-shot object detec-

tion model excels at optimizing precision and compu-

tational efﬁciency.

The proposed model demonstrates adaptability

by detecting objects across various scales and sizes,

paving the way for practical implementation. We

High Precision Single Shot Object Detection in Automotive Scenarios

609

Figure 6: Sample Inferences: Predictions at crowded areas and groups of objects.

Figure 7: Sample Inferences: Predictions shadows and low light conditions.

evaluated prominent benchmark models, including

YOLOv5n, YOLOv6n, YOLOv8n, and RT-DETR

models. Notably, the proposed approach effectively

addresses challenges inherent to the dataset, such as

class discrepancies, imbalanced data distribution, and

the impact of low lighting conditions, ensuring robust

object detection even in less-than-ideal visibility sce-

narios.

The core strength of our object detection model

lies in its sophisticated architecture. The seamless co-

ordination among the backbone, neck, and decoupled

head components enables the detection of objects in

diverse and complex scenarios. The proposed model

was optimized for efﬁcient resource-constrained de-

vice inference, ensuring shorter training times by in-

corporating CSP (cross-stage partial) connections. In-

tegrating advanced loss functions like varifocal loss

and complete IoU loss for classiﬁcation and bounding

box regression further enhances the model’s accuracy

and robustness.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

610

ACKNOWLEDGMENT

This work was supported by the Indo-Norwegian

Collaboration in Autonomous Cyber-Physical Sys-

tems (INCAPS) project: 287918 of the Interna-

tional Partnerships for Excellent Education, Research

and Innovation (INTPART) program and the Low-

Altitude UAV Communication and Tracking (LU-

CAT) project: 280835 of the IKTPLUSS program

from the Research Council of Norway.

REFERENCES

Adel, M., Moussaoui, A., Rasigni, M., Bourennane, S., and

Hamami, L. (2010). Statistical-based tracking tech-

nique for linear structures detection: Application to

vessel segmentation in medical images. IEEE Signal

Processing Letters, 17(6):555–558.

Beery, S., Wu, G., Rathod, V., Votel, R., and Huang, J.

(2020). Context r-cnn: Long term temporal context

for per-camera object detection. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR).

Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. (2020).

Yolov4: Optimal speed and accuracy of object detec-

tion. arXiv preprint arXiv:2004.10934.

Coman, C. et al. (2018). A deep learning sar target clas-

siﬁcation experiment on mstar dataset. In 2018 19th

International Radar Symposium (IRS). IEEE.

Fang, P. and Shi, Y. (2018). Small object detection using

context information fusion in faster r-cnn. In 2018

IEEE 4th International Conference on Computer and

Communications (ICCC), pages 1537–1540. IEEE.

Gajjar, V., Gurnani, A., and Khandhediya, Y. (2017). Hu-

man detection and tracking for video surveillance: A

cognitive science approach. In Proceedings of the

IEEE International Conference on Computer Vision

(ICCV) Workshops.

Galvez, R. L., Bandala, A. A., Dadios, E. P., Vicerra, R.

R. P., and Maningo, J. M. Z. (2018a). Object detec-

tion using convolutional neural networks. In TENCON

2018 - 2018 IEEE Region 10 Conference.

Galvez, R. L., Bandala, A. A., Dadios, E. P., Vicerra, R.

R. P., and Maningo, J. M. Z. (2018b). Object detec-

tion using convolutional neural networks. In TENCON

2018-2018 IEEE Region 10 Conference, pages 2023–

2027. IEEE.

Gao, X., Luo, Y., Xing, G., Roy, S., and Liu, H. (2022).

Raw adc data of 77ghz mmwave radar for automotive

object detection.

Karaoguz, H. and Jensfelt, P. (2019). Object detection

approach for robot grasp detection. In 2019 In-

ternational Conference on Robotics and Automation

(ICRA). IEEE.

Li, B., Ouyang, W., Sheng, L., Zeng, X., and Wang, X.

(2019a). Gs3d: An efﬁcient 3d object detection frame-

work for autonomous driving. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition.

Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., Ke, Z.,

Li, Q., Cheng, M., Nie, W., et al. (2022). Yolov6: A

single-stage object detection framework for industrial

applications. arXiv preprint arXiv:2209.02976.

Li, Z., Dong, M., Wen, S., Hu, X., Zhou, P., and Zeng,

Z. (2019b). Clu-cnns: Object detection for medical

images. Neurocomputing, 350.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot

multibox detector. In Computer Vision–ECCV 2016:

14th European Conference, Amsterdam, The Nether-

lands, October 11–14, 2016, Proceedings, Part I 14,

pages 21–37. Springer.

Lv, W., Xu, S., Zhao, Y., Wang, G., Wei, J., Cui, C.,

Du, Y., Dang, Q., and Liu, Y. (2023). Detrs beat

yolos on real-time object detection. arXiv preprint

arXiv:2304.08069.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time object

detection. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 779–

788.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental

improvement. arXiv preprint arXiv:1804.02767.

Truong, X.-T., Yoong, V. N., and Ngo, T.-D. (2015). Rgb-

d and laser data fusion-based human detection and

tracking for socially aware robot navigation frame-

work. In 2015 IEEE International Conference on

Robotics and Biomimetics (ROBIO), pages 608–613.

IEEE.

Xu, H., Lv, X., Wang, X., Ren, Z., Bodla, N., and Chel-

lappa, R. (2018). Deep regionlets for object detection.

In Proceedings of the European conference on com-

puter vision (ECCV), pages 798–814.

yolov5, u. (2023). ultralytics comprehensive guide. https:

//docs.ultralytics.com/yolov5/. Accessed: 2023-9-1.

yolov8, u. (2023). comprehensive guide ultralytics. https:

//docs.ultralytics.com/. Accessed: 2023-9-1.

Zeng, X., Ouyang, W., and Wang, X. (2013). Multi-stage

contextual deep learning for pedestrian detection. In

Proceedings of the IEEE International Conference on

Computer Vision (ICCV).

Zhang, H., Wang, Y., Dayoub, F., and Sunderhauf, N.

(2021). Varifocalnet: An iou-aware dense object de-

tector. In Proceedings of the IEEE/CVF conference on

computer vision and pattern recognition, pages 8514–

8523.

Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., and Ren, D.

(2020). Distance-iou loss: Faster and better learning

for bounding box regression. In Proceedings of the

AAAI conference on artiﬁcial intelligence, volume 34.

High Precision Single Shot Object Detection in Automotive Scenarios

611