The Evolution of Object Detection Algorithms Based on Deep

Learning

Qiyan Guo

Glasgow College, UESTC, Chengdu, Sichuan, 611731, China

Keywords: Object Detection, Deep Learning, Contrastive Analysis.

Abstract: Object detection technology is an important research direction in the field of computer vision, which is widely

used in automatic driving, security monitoring, medical image analysis and other fields. With the rapid

development of deep learning technology, object detection algorithms based on deep learning have made

significant breakthroughs in accuracy and efficiency, gradually replacing traditional object detection methods.

The object detection algorithm based on deep learning can automatically learn features in images and train

and optimize them in an end-to-end manner, significantly improving detection accuracy and robustness. The

progress of object detection technology is initially reviewed in this study, with particular attention paid to

deep learning-based object identification algorithms and conventional object detection techniques. Next, two

common deep learning object identification frameworks— YOLO-v3 and Faster R-CNN— are thoroughly

examined and contrasted. Experimental results show that YOLO-v3 has a slightly higher average accuracy

(mAP) on the COCO dataset than Faster R-CNN, but performs better in small target detection and dense

scenarios. Nevertheless, the Faster R-CNN still has some advantages in terms of overall accuracy.

1 INTRODUCTION

As one of the core tasks in the field of computer

vision, object detection aims to identify and locate

specific categories of objects from images or videos,

and its application scope covers many fields such as

automatic driving, security monitoring, medical

image analysis and so on. Deep learning-based object

recognition algorithms have progressively supplanted

conventional techniques and gained popularity in

recent years due to the quick advancement of deep

learning technology. Traditional object detection

methods, such as HOG and Haar, rely on hand-

designed feature extraction methods and classical

classifiers. Although the computational resource

requirements are low and the speed is fast, the

detection accuracy and robustness in complex scenes

are limited.

Deep learning, and convolutional neural networks

(CNN) in particular, has significantly accelerated the

advancement of object detection technologies. The

deep learning-based object identification algorithm

can automatically learn the feature representation

from a vast amount of labeled data, greatly increasing

https://orcid.org/0009-0005-5972-6222

the efficiency and accuracy of detection. Two-stage

and single-stage algorithms are the two primary

groups into which these algorithms fall based on the

various detection frameworks. Two-stage detection

frameworks, such as Faster R-CNN, which first

generates candidate regions, then performs

classification and bounding box regression, have

higher detection accuracy, but are slower. Single-

stage detection frameworks, such as the YOLO

family, predict target categories and bounding boxes

directly on the feature map, faster but with relatively

low accuracy.

Despite significant progress in object detection

algorithms based on deep learning, many challenges

remain. For example, in extreme weather and light

conditions, the performance of the algorithm may be

affected; For some specific types of small targets, the

detection effect still needs to be improved. How to

balance detection speed and accuracy, and how to

deal with scale change and target occlusion, still need

to be further studied.

The evolution of object detection technology will

be reviewed in this paper, along with the analysis of

deep learning-based object detection algorithms and

Guo, Q.

The Evolution of Object Detection Algorithms Based on Deep Learning.

DOI: 10.5220/0013681000004670

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 2nd International Conference on Data Science and Engineering (ICDSE 2025), pages 199-203

ISBN: 978-989-758-765-8

199

conventional object detection techniques. Two

common deep learning object detection frameworks,

Faster R-CNN and YOLO-v3, will be thoroughly

discussed and compared. Finally, the challenges and

future development direction of target detection

technology will be discussed in order to provide

reference and inspiration for related research.

2 TRADITIONAL OBJECT

DETECTION TECHNOLOGY

In contrast to deep learning model-based object

detection frameworks, traditional object detection

technologies rely more on hand-designed feature

extraction methods and classical classifiers (Guo,

2024). These techniques are more effective at

capturing information in an image, but are limited to

simple image features such as color, texture, and

edges to identify the category and location of the

object. These manual features, combined with

powerful machine learning classifiers, make it

possible to distinguish target and non-target regions

from feature vectors. According to the method of

extracting image features from the model, traditional

object detection algorithms can be divided into the

following categories: sliding method, region-based

method, and model-fitting-based method.

The Histogram of Oriented Gradients (HOG)

object detection algorithm, which was put forth by

Navneet Dalal and Bill Triggs in 2005, is the most

representative of these techniques. This algorithm's

core idea is to create a histogram based on gradient

characteristics, vote on the image's local gradient

amplitude and direction, and then concatenate the

local features to create the overall features (Dalal and

Triggs, 2005). It is robust to changes such as lighting

and local occlusion, and can capture shape and texture

information in images well, so it performs well in

tasks such as pedestrian detection. However, it is very

sensitive to the rotation and scale change of the target,

and because of its high computational complexity, the

real-time detection of the target is poor.

Haar target detection algorithm is also a very

representative traditional target detection algorithm.

It is a classical object detection algorithm based on

Haar features and a cascade classifier. It represents

the texture and edge information of the image by

calculating the gray difference of pixels in adjacent

rectangular areas in the image. Then, the AdaBoost

algorithm is used to select the best subset of Haar

features, and a cascade classifier is constructed,

which slides on the image with a fixed size window.

The cascade classifier is used to classify different

regions of the image. This algorithm has certain

robustness to illumination changes and target attitude,

and the detection speed is fast, which is suitable for

real-time applications, so it is widely used in face

detection, pedestrian detection and other fields (Viola

and Jones, 2011). However, it is sensitive to the scale

and attitude changes of the target, so the application

scenario is relatively limited.

These traditional target detection methods have

low requirements on computing resources and

relatively fast computing speed, so they still have

application value in scenarios with limited conditions

or high real-time requirements. However, they

require a lot of manual adjustment and optimization

in complex or changeable environments. With the

progress of technology, traditional target detection

algorithms are gradually being eliminated.

3 TYPICAL CONVOLUTIONAL

NEURAL NETWORK

ARCHITECTURE

In computer vision, object detection is a crucial

activity that seeks to locate or identify a particular

class of objects in a picture or video. By directly

learning feature representations from a large number

of tags using deep learning neural networks, object

detection algorithms based on deep learning have

increased the accuracy and efficiency of object

detection in recent years. The advancement of object

detection technology has benefited considerably from

the use of convolutional neural networks (CNN).

These algorithms can be divided into two broad

categories based on how the detection framework

operates and recognizes them. The first type is a two-

stage detection framework, which identifies potential

target areas, and then performs classification and

bounding box regression. The second type is a single-

stage detection framework, which recognizes both the

image category and the bounding box in one

operation (Liu, Ouyang, and Wang, 2020) (Liu,

Ouyang, and Wang, 2018).

3.1 Faster R-CNN

Faster R-CNN is a very representative two-stage

target detection algorithm that represents a substantial

improvement in the field of target identification.

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian

Su proposed the third-generation model in the R-

CNN series in 2015(Ren, He, and Girshick, 2017),

ICDSE 2025 - The International Conference on Data Science and Engineering

200

This system is composed of a classifier, a region

proposal network (rpn), a convolutional layer, and a

roi pooling layer. The convolution and pooling layers

are utilized to extract the features of the image. The

pooling layer uses the size of the image to extract

features, and the convolution layer uses the

translation invariance of the image. In comparison to

earlier extraction networks, the Regional Proposal

Network (RPN) significantly increases the speed of

regional proposal by using a full convolutional

network that can immediately produce candidate

regions from feature maps. Its main concept is to

separate the object identification work into two

phases: the extraction of candidate regions and the

categorization of candidate regions. Firstly, selective

search is used to select some candidate areas in the

input image that may contain the target object. These

candidate regions are rectangular regions that are

used for subsequent feature extraction and

classification. Then feature extraction is carried out

for each candidate region, and a classifier is used to

classify them to determine whether they contain the

target object and the category of the target object.

Because RPN replaces the traditional Selective

Search in the Faster R-CNN, the time of regional

proposal is greatly shortened, making the operation of

regional proposal more efficient. At the same time,

Faster R-CNN combines RPN and Fast R-CNN into

one network. In actual training, the two can share

convolution features, which realizes end-to-end

training and reasoning, simplifies the training

process, and shortens the training time. However,

although the speed of regional proposal is

accelerated, its detection speed is still slow in real life

and can not meet the requirements of real-time

detection. At the same time, because it integrates

RPN and Fast R-CNN modules, the number of

parameters and calculation amount of the model are

large. At present, A wide range of target detection

tasks, including lesion analysis identification in

medical pictures and defect detection in industrial

production, also make extensive use of faster R-CNN.

3.2 YOLO - v3

The third iteration of the YOLO series, known as

Yolo-v3, was put forth by Joseph Redmon and Ali

Farhadi in 2018. It keeps the YOLO series' simplicity

and effectiveness while enhancing target recognition

speed and accuracy. The YOLO series is a common

single-stage object detection framework as its main

principle is to turn the object identification problem

into a regression problem. Its primary capabilities

include the ability to recognize targets of varying

sizes using multi-scale feature maps and to anticipate

targets directly on the feature map without region

proposal (Redmon, Divvala, Girshick, 2015).

Darknet-53, an effective convolutional neural

network that integrates multi-scale feature fusion and

residual linking for enhanced feature extraction

capabilities, forms the foundation of YOLO-v3. At

the same time, it will predict at 3 different scales, thus

improving the detection ability of small targets and

making full use of shallow and deep feature

information. After that, K-means clustering is used to

generate 9 Anchor Boxes for predicting target

boundary boxes. Then, the boundary Box

coordinates, object confidence and category

probability of each Anchor Box are predicted on the

feature map of each scale (Redmon, Farhadi, 2018).

As a typical single-stage object detection framework,

YOLO-v3 has a relatively fast detection speed and

can achieve real-time detection. However, compared

with the two-stage detection framework, its detection

accuracy will be significantly lower due to the defects

of the detection steps. At the same time, its multi-

scale prediction ability also improves the

performance of small target detection, but its

performance is still not satisfactory in the scenario of

small target density. At present, YOLO-v3 is widely

used in a variety of real-time target detection tasks,

such as automatic driving, security monitoring and

UAV target detection systems have YOLO-v3

framework applications.

3.3 Comparison and analysis of the two

frameworks

Table 1: Coco Dataset Validation mAP of two

Frameworks.

Framewor

Faste

R-CNN YOLO-v3

mAP 0.42 0.45

mAP of the two frameworks verified using the coco

dataset is as shown in table 1, that is, the average test

accuracy, which represents the average value of the

detection accuracy under different IoU thresholds

(Neha, Bhati, Shukla, 2024). According to the

experimental results, YOLOv3 achieves the best

performance on the COCO validation set, which may

be because it adopts more advanced network structure

and optimization methods, which makes the model

perform better in feature extraction and classification.

In this test, the YOLO-v3 was slightly more accurate

than the Faster R-CNN, but there are several versions

of YOLO-v3, such as the YOLO-v3 tiny, which is

typical of models that ensure sufficiently high test

speeds at the expense of some accuracy. So overall,

the accuracy of Faster R-CNN is still slightly higher.

The Evolution of Object Detection Algorithms Based on Deep Learning

201

Furthermore, the RPN, or regional proposal network,

is not very sensitive to the identification of small and

dense objects because the candidate region it

generates is typically huge. In contrast, Yolov3 works

much better. At the same time, the two stages of the

Faster R-CNN network require a large amount of

data, so the training of Faster R-CNN is relatively

difficult.

4 CHALLENGES AND FUTURE

DEVELOPMENT

The performance of the algorithm may be affected

when processing images under extreme weather and

illumination conditions, and the detection effect of

some specific types of small targets still needs to be

improved. Further research can be carried out to

improve and optimize the above limiting factors to

further improve the practicality and robustness of the

big data target detection algorithm. When working

with a high number of candidate regions, the R-CNN

series technique requires a lot of processing, which

leads to inadequate real-time performance (Zou,

Chen, Shi, 2023). The YOLO algorithm's accuracy

has to be increased while working with small, dense

targets. Therefore, how to balance the detection speed

and accuracy is still a problem to be solved. In

addition, how to deal with scale change and target

occlusion still to be further studied.

From conventional manual feature approaches to

deep learning-based CNN and Transformer

architectures, object identification technology has

advanced tremendously in the field of computer

vision. Accuracy and real-time detection have also

increased significantly. Nevertheless, challenges such

as small target detection, domain adaptation, and

adversarial attacks remain. In the future, multi-modal

fusion, self-supervised learning and open-world

object detection will provide new impetus for the

development of object detection technology.

5 CONCLUSIONS

This study compares the performance of two well-

known deep learning object detection frameworks,

Faster R-CNN and YOLO-v3, by examining the

evolution of object recognition technology from the

manual, feature-based approach to the deep learning-

based convolutional neural network (CNN)

architecture. To assist readers in understanding the

advancement of technology in this area, a detailed

summary of the object detection technology's

evolution from the old method to the deep learning

method is provided. By reviewing the traditional

methods such as HOG and Haar, we reveal their

advantages in computational efficiency and real-time

performance, and point out their limitations in

complex scenarios. Next, two common deep learning

object identification frameworks — YOLO-v3 and

Faster R-CNN — are thoroughly examined and

contrasted. The experimental results show that

YOLO-v3 has a slightly higher average accuracy

(mAP) on the COCO dataset than Faster R-CNN, and

performs better in small target detection and dense

scenes, while Faster R-CNN still has some

advantages in overall accuracy. This comparative

analysis provides a valuable reference for researchers

and helps to choose a suitable detection framework

for practical applications. This paper not only

summarizes the current status of object detection

technology, but also points out the future research

direction and challenges. For example, how to

improve the robustness of the algorithm under

extreme weather and lighting conditions, how to

optimize the detection accuracy of small target

detection and dense scenes, and how to balance the

detection speed and accuracy. These research

directions provide new ideas and impetus for the

further development of object detection technology.

In conclusion, this paper not only provides valuable

reference and inspiration for researchers in this field,

but also provides an important basis for technology

selection and optimization in practical application.

The findings of this study will support the

advancement of target detection technology and

enhance its viability and resilience across a range of

application domains.

REFERENCES

Dalal, N., Triggs, B., 2005. Histograms of oriented

gradients for human detection. In 2005 IEEE Computer

Society Conference on Computer Vision and Pattern

Recognition (CVPR'05), San Diego, CA, USA, pp.

886-893 vol. 1

GUO, H., 2024. Object detection: From traditional methods

to deep learning. Emerging Science and Technology,

3(2), pp.128-145.

Liu, L., Ouyang, W., Wang, X. et al., 2020. Deep Learning

for Generic Object Detection: A Survey. International

Journal of Computer Vision, 128, pp.261–318.

https://doi.org/10.1007/s11263-019-01247-4

Liu, L., Ouyang, W., Wang, X., Fieguth, P.W., Chen, J.,

Liu, X., & Pietikäinen, M., 2018. Deep Learning for

ICDSE 2025 - The International Conference on Data Science and Engineering

202

Generic Object Detection: A Survey. International

Journal of Computer Vision, 128, pp.261-318.

Neha, F., Bhati, D., Shukla, D.K., & Amiruzzaman, M.,

2024. From classical techniques to convolution-based

models: A review of object detection algorithms. ArXiv

Redmon, J., Divvala, S.K., Girshick, R.B., & Farhadi, A.,

2015. You Only Look Once: Unified, Real-Time Object

Detection. In 2016 IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), pp.779-788.

Redmon, J., & Farhadi, A., 2018. YOLOv3: An

Incremental Improvement.

Ren, S., He, K., Girshick, R., Sun, J., 2017. Faster R-CNN:

Towards Real-Time Object Detection with Region

Proposal Networks. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 39(6), pp.1137-

1149. doi: 10.1109/TPAMI.2016.2577031.

Viola, P., Jones, M., 2001. Rapid object detection using a

boosted cascade of simple features. In Proceedings of

the 2001 IEEE Computer Society Conference on

Computer Vision and Pattern Recognition (CVPR

2001)

Zou, Z., Chen, K., Shi, Z., Guo, Y., Ye, J., 2023. Object

Detection in 20 Years: A Survey. Proceedings of the

IEEE, 111(3), pp.257-276.

The Evolution of Object Detection Algorithms Based on Deep Learning

203