VLSI Implementation of Deep Learning Models for Real‑Time

Object Detection in Robotics

D. Jayalakshmi, P. Arivazhagi and K. Dhivyalakshmi

Department of ECE, Arasu Engineering College, Kumbakonam, Tamil Nadu, India

Keywords: Very Large‑Scale Integration, Field Programmable Gate Array, Deep Learning, Object Detection, Robotic

Vision, Adaptive Sparse Convolutional Neural Network, Hardware‑Aware Low‑Latency YOLO.

Abstract: Very Large-Scale Integration (VLSI) technology is one of the fastest-growing sectors, having made it possible

to gain deep learning models for real-time object detection in the robotics world. Almost the very traditional

way of deep learning-based object detection is the thorough need of graphical processing units (GPUs) and

cloud-based processing, which have a major disadvantage in terms of latency and power inefficiencies. The

article shows the prospect of using VLSI as an alternative to speed up the computations of deep learning

models that are used in the robot's application, and to decrease the power consumption and data speed as well.

The use of designs that are customized through hardware architectures strengthens the real-time abilities of

the visual of robots, making them more suitable for edge-based self-governing choice-making. In comparison

to conventional GPU-based resolutions, the research introduces an optimized Field Programmable Gate Array

(FPGA)-based hardware accelerator meant for deep learning models to enable faster inference with

dramatically decreased power consumption. Moreover, besides the implementation of a hybrid method

involving quantization and pruning in VLSI circuits, the memory footprint is also reduced, and the

computational overload is kept low, while the detection accuracy remains high. By introducing two innovative

algorithms, the most recent advancement in robotics is achieved with a single stride. An adaptive sparse

convolutional neural network (ASCNN) is a dynamically adjustable network that maintains the number of

active neurons and convolutional filters in accordance with the complexity of the input image. This method

enhances the efficiency of computation and improves the precision of detection. However, the Hardware-

Aware Low-Latency YOLO (HALO-YOLO) is a refined version of the YOLO model that is specifically

designed for FPGA and ASIC hardware devices. This results in a significant reduction in the consumption of

computing resources in a short amount of time, as well as the rapid and efficient detection of objects.

Experimental evaluations that verify the implementation of VLSI-based deep learning and these two

algorithms demonstrate that they are capable of achieving low-latency object detection, which is appropriate

for real-time surveillance applications, industrial automation, and autonomous robotics. Based on the results,

it can be inferred that the hardware-aware optimisation of the deep learning model in robotics is a suitable

method for real-time object detection. This research establishes the groundwork for the future investigation

of AI accelerators that are low-power and high-speed, specifically designed for autonomous robotic vision

systems.

1 1 INTRODUCTION

The integration of deep learning with autonomous

systems has significantly advanced real-time object

detection in robotics, enabling machines to perceive

and interact with their environments effectively

(Girshick, 2015; Ren et al., 2015; Liu et al., 2016;

Redmon et al., 2016). Conventional deep learning-

based object detection relies on high-performance

GPUs and cloud processing, which introduces high

energy consumption, latency, and dependency on

external computational resources (Sze et al., 2017).

These limitations render traditional methods

unsuitable for real-time applications with resource-

constrained robotic systems that demand low-latency

decision-making and power efficiency.

To address these challenges, Very Large-Scale

Integration (VLSI) technology offers a viable

hardware-level solution that enhances the operational

efficiency of deep learning models. Specifically,

Field-Programmable Gate Arrays (FPGAs) and

718

Jayalakshmi, D., Arivazhagi, P. and Dhivyalakshmi, K.

VLSI Implementation of Deep Learning Models for Real-Time Object Detection in Robotics.

DOI: 10.5220/0013942700004919

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies (ICRDICCT‘25 2025) - Volume 5, pages

718-726

ISBN: 978-989-758-777-1

Application-Specific Integrated Circuits (ASICs)

serve as custom hardware accelerators, providing

advantages such as reduced power usage, improved

computational throughput, and real-time processing

capabilities (Horowitz, 2014; Jacob et al., 2018).

Optimizing deep learning techniques for edge-based

object detection using these hardware platforms is

crucial for responsive and efficient robotic

applications.

This paper proposes two innovative technologies

for improving real-time object detection using VLSI

platforms. The first is a custom FPGA-based deep

learning accelerator optimized for inference

efficiency and minimal power consumption

compared to GPU-based alternatives. The second is a

hybrid quantization and pruning technique embedded

in VLSI design to reduce memory overhead while

maintaining high detection accuracy (Han et al.,

2015; Wu et al., 2016). Additionally, the study

introduces:

Adaptive Sparse Convolutional Neural Network

(ASCNN): A dynamic architecture that adapts to

image complexity by activating only essential

neurons and filters, improving energy efficiency

without compromising accuracy.

Hardware-Aware Low-Latency YOLO (HALO-

YOLO): A modified YOLO-based model optimized

for FPGA/ASIC deployment to reduce redundant

operations and enable ultra-fast object detection on

limited hardware resources.

These contributions aim to establish a robust

foundation for scalable, embedded AI accelerators

suited to robotics, automation, and surveillance

systems, advancing the state of real-time, energy-

efficient object detection.

2 RELATED WORKS

FPGA-based object detection hardware research

demonstrates that deep quantization significantly

improves processing efficiency. Nakahara et al.

(2018) proposed a lightweight version of YOLOv2,

introducing a binarized neural network (BNN) model

accelerated by a parallel support vector regression to

address the numerical degradation typical in binary

models, although external memory access remains a

performance bottleneck. Preuser et al. (2018)

introduced Tincy YOLO, which uses 1-bit weights

and 3-bit activations; however, both the input and

output layers are processed in software due to

complexity constraints.

Nguyen et al. (2019) developed Sim-YOLO-v2,

employing 1-bit weights and 4–6-bit activations

along with a streaming accelerator to minimize

external DRAM access and power usage. Huang et al.

(2021) proposed CoDeNet, incorporating deformable

convolutions and input-adaptive patterns, quantized

with 4-bit weights and 8-bit activations, targeting

FPGA deployment for efficient object detection.

Kim et al. (2021) presented SSDLite, integrating

MobileNetV2 (Sandler et al., 2018) as its backbone.

While it lacks aggressive quantization, its system

architecture maximizes throughput via dedicated

processing units and task control modules.

YOLO models, originally introduced by Redmon

et al. (2016), remain among the fastest object

detection algorithms. YOLOv2 (Redmon and

Farhadi, 2017), which uses DarkNet-19 as its

backbone (akin to VGG-19), predicts bounding boxes

across a grid-based input with anchor boxes. The

model outputs 125 values per grid cell, including

spatial coordinates, objectness confidence, and

classification scores. Non-maximum suppression

(NMS) is subsequently used to refine predictions.

Despite increased accuracy in newer versions such as

YOLOv3 (Redmon and Farhadi, 2018) and YOLOv4

(Bochkovskiy et al., 2020), these come at the cost of

larger and more complex backbone networks.

For hardware implementation, early YOLO

versions are preferred due to their simplicity. For

instance, YOLOv2 and its variants have been

effectively mapped onto ASICs (Chen et al., 2019;

Yin et al., 2017) and FPGAs (Nakahara et al., 2018;

Preuser et al., 2018; Nguyen et al., 2019), making

them ideal for real-time, energy-efficient robotic

applications.

3 PROPOSED WORK

Here is the latest implementation plan that focuses on

the hardware design of the VLSI using deep learning

for real-time object detection of robots. This approach

is broken down into three principal pieces: hardware-

optimized deep learning model deployment, custom

FPGA-based hardware accelerator design, and novel

algorithmic enhancements for efficient object

detection. These components provide a joint solution

ensuring high-speed inference, low-energy, and

prompt decision making for autonomous robotic

applications.

Hardware-Optimized Deep Learning Model

Deployment: Deep learning architectures, in

particular CNNs, require massive quantities of

multiplication and summarization (MAC) activations

in the convolution layers. Real-time processing is the

most important task to be done by the reduction of

VLSI Implementation of Deep Learning Models for Real-Time Object Detection in Robotics

719

such operations, especially on FPGAs hardware

which is constrained by resources. In this way of

getting the computation efficiency optimized, we use

the techniques of quantization and pruning which lead

to a significant reduction of the computer memory

footprint and a decrease the computational

complexity of the deep learning model but still have

a high accuracy rate.

Figure 1: Schematic representation of the suggested

methodology.

Quantization

In the context of machine learning, quantization

technique, which is the mapping of data to the lower

precision, is one that saves power and reduces the

complexity of the hardware implementation by

replacing high-precision floating-point computations

with low-bits fixed-point arithmetic, which is the

quantization of that transformation

𝑊



=round

(

𝑊×2



)(

)

where:

• 𝑊



is the quantized weight,

• 𝑊 is the original floating-point weight?

• 𝑏 is the bit-width used for quantization (e.g.,

8 -bit instead of 32 -bit).

FPGA hardware can be designed to carry out the

fixed-point type multiplication instead of the floating-

point one for the very same functionality, so that a

great amount of energy can be saved. The convolution

with quantized weights can be rewritten as:

𝑂



=





𝐼



(𝑖) ⋅ 𝑊



(𝑖) (2)

where:

• 𝑂



is the quantized output feature map?

• 𝐼



(𝑖) represents the input activations after

quantization,

• 𝑊



(𝑖) represents the quantized weights.

Using an 8-bit representation significantly reduces

memory bandwidth requirements, allowing for faster

processing while maintaining acceptable accuracy.

Pruning: Pruning is the elimination of irrelevant or

low-weight connections in the network, thereby

decreasing the calculation effort needed for inference.

The primary cause of memory usage and

computational complexity is weight-like pruning,

therefore, to get rid of them, but with the help of the

accuracy of the network.

The weight pruning process is defined as:

𝑊



=𝑊⋅ 𝑀 (3)

where:

• 𝑊



is the pruned weight matrix?

• 𝑊 is the original weight matrix?

• 𝑀 is a binary mask, defined as?

𝑀

(

𝑖,𝑗

)

=

1, if

𝑊

(

𝑖,𝑗

>𝑇

0, otherwise

(4)

In this instance, T is a threshold, the value of which

can be ascertained by the advised energy-concious

optimization technique-based practices applied for

removing the finally unimportant nodes from the

network. After cutting back, the convolution

operation will be only executed on non-zero weight

connections, which reduces computational cost:

𝑂



=

(,)∈

𝐼

(

𝑖,𝑗

)

⋅𝑊



(

𝑖,𝑗

)

(5)

where 𝑆 represents the set of active weights after

pruning.

Pruning results in:

• Lower computational overhead

• Reduced memory access latency

• Faster inference times

FPGA-Based Hardware Accelerator Design: The

FPGA accelerator developed for deep learning is

based on a custom solution which can be efficiently

offloaded by a convolution block for convolutions,

activators, and pooling operations. The process of

convolution makes it the most computationally

significant part of CNN inference. To perform the

operation of a convolution the convolution layer uses

a number of windows as filters and this process

becomes the operation of element-wise multiplication

and summation. It is then formulated as follows:

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

720

𝑂

(

𝑖,𝑗,𝑘

)

=











𝑊

(

𝑚,𝑛,𝑘

)

⋅𝐼

(

𝑖+𝑚,𝑗+𝑛

)(

)

where:

• 𝑂(𝑖,𝑗,𝑘) is the output feature map?

• 𝑊(𝑚,𝑛,𝑘) is the convolution kernel,

• 𝐼(𝑖+𝑚,𝑗 +𝑛) is the input feature map?

• 𝑀,𝑁 are the kernel dimensions.

To accelerate this operation in FPGA hardware, we

introduce:

• Parallel Multiply-Accumulate (MAC) units

• Dataflow optimizations for reduced memory

latency

A pipelined parallel MAC unit is implemented to

an efficiently execute multiple operations

simultaneously:

𝑂(𝑖,𝑗,𝑘) =

,

(𝑊(𝑚,𝑛,𝑘)⋅𝐼(𝑖+𝑚,𝑗+𝑛))(7)

where the multiplications are performed in

parallel across FPGA logic blocks, reducing latency

and power consumption.

The ReLU (Rectified Linear Unit) activation

function introduces a non-linearity in deep learning

models, ensuring feature importance selection in

CNNs. The standard ReLU activation is defined as:

𝐴

(

𝑥

)

=max

(

0,𝑥

)

(8)

Instead of performing floating-point comparisons,

FPGA-based implementations use bitwise logic

operations, significantly improving efficiency:

𝐴

(

𝑥

)

=𝑥⋅

(

𝑥>0

)

(

)

This operation can be implemented efficiently in

hardware logic gates, ensuring:

• Faster execution (single-cycle operation)

• Reduced hardware resource utilization

• Lower power consumption

In some cases, a hardware-friendly approximation of

ReLU6 is used to avoid large activation

values:

𝐴

(

𝑥

)

=min

(

max

(

0,𝑥

)

(10)

This version is helpful in FPGA-based quantized

models by handling stable activations within a

bounded range. Through the integration of VLSI-

optimized deep learning models with ASCNN and

HALO-YOLO, we are able to realize real-time,

energy-efficient object detection for the robotics. This

hardware-aware AI system makes possible to develop

scalable and power-efficient solutions for the

autonomous robotic vision systems.

Adaptive Sparse Convolutional Neural Network

(ASCNN): ASCNN dynamically adjusts the number

of activated neurons and convolutional filters based

on image complexity, ensuring computational

efficiency without sacrificing accuracy.

Step 1: Feature Importance Estimation

When it comes to different image regions, the

complexity score, C(x), is a value to be used in the

case of them and by means of which the

computational effort that is required can be found.

𝐶(𝑥)= 











|𝐺(𝑖,𝑗)| (11)

where:

• 𝐶(𝑥) represents the complexity score of the

feature map,

• 𝐺(𝑖,𝑗) is the image gradient at position (𝑖,𝑗),

computed as?

𝐺

(

𝑖,𝑗

)





𝜕𝐼

(

𝑖,𝑗

)

𝜕𝑥





+

𝜕𝐼

(

𝑖,𝑗

)

𝜕𝑦





(

)

where:

• 𝐼(𝑖,𝑗) is the intensity value of the input

image?

•

(,)



and

(,)



represent the gradients in the

horizontal and vertical directions.

Regions with higher gradient values indicate areas

with more details, requiring higher computational

resources.

Step 2: Dynamic Filter Selection

When the complexity score, C(x), is calculated, the

network adapts itself by customizing to the needed

count of active convolutional filters. The sought for

number of active filters 𝐹



is evaluated by

𝐹



=𝐹



⋅𝑆𝐶

(

𝑥

)

 (13)

where:

• 𝐹



is the number of active filters,

• 𝐹



is the maximum number of filters?

• 𝑆(𝐶(𝑥)) is a sparsity function that controls

the activation of filters?

𝑆𝐶

(

𝑥

)

=

1, if 𝐶

(

𝑥

)

>𝑇

𝛼, otherwise

(14)

where:

VLSI Implementation of Deep Learning Models for Real-Time Object Detection in Robotics

721

• 𝑇 is a complexity threshold dynamically set

based on image features?

• 𝛼 is a scaling factor (e.g., 0.5 or 0.25) that

reduces the number of filters in low-

complexity regions?

This method ensures efficient resource allocation

while preserving detection accuracy.

Hardware-Aware Low-Latency YOLO (HALO-

YOLO): HALO-YOLO is an enhanced edition of the

YOLO object detection framework, specifically

developed for FPGA, concentrating on efficient use.

It is an advancement in traditional YOLO because it

is optimized for bounding box regression and

speeding up non-maximum suppression (NMS). The

YOLO model calculates object coordinates through

parameterizing the bounding box is demonstrated by:

• Center coordinates 𝐵



,𝐵





• Width and height

(

𝐵



,𝐵



)

These parameters are computed as follows:

𝐵



=𝜎

(

𝑡



)

+𝐶



,𝐵



=𝜎𝑡



+𝐶



(15)

𝐵



=𝑃



𝑒





,𝐵



=𝑃



𝑒





(16)

where:

• 𝐵



,𝐵



are the bounding box center

coordinates?

• 𝐵



,𝐵



are the bounding box width and

height?

• 𝑡



,𝑡



,𝑡



,𝑡



are the network-predicted

offsets,

• 𝐶



,𝐶



are the grid cell coordinates?

• 𝑃



,𝑃



are the prior anchor box dimensions?

• 𝜎(𝑥) is the sigmoid function, ensuring the

output remains between (0,1) :

𝜎

(

𝑥

)

1+𝑒



(17)

By reducing redundant computations, FPGA

hardware performs bounding box regression more

efficiently.

Hardware-Optimized Non-Maximum

Suppression (NMS): NMS is used to filter

overlapping bounding boxes, ensuring only the most

relevant detections are retained. The standard NMS

involves computing the Intersection over Union

(IoU):

IoU

(

𝐴,𝐵

)

𝐴∩𝐵

𝐴∪𝐵

(18)

where:

• 𝐴,𝐵 are two overlapping bounding boxes,

• 𝐴∩𝐵 is the intersection area,

• 𝐴∪𝐵 is the union area.

If the IoU score exceeds a predefined threshold 𝑇



the bounding box with the lower confidence score is

discarded.

To accelerate NMS on FPGA, we implement parallel

comparison operations, where multiple bounding

boxes are processed simultaneously:

𝑆

(

𝐵



)

=

1, if IoU𝐵



,𝐵



<𝑇



,∀𝑗

0, otherwise

(9)

where:

• 𝑆

(

𝐵



)

=1 indicates that bounding box 𝐵



retained,

• 𝑆

(

𝐵



)

=0 means bounding box 𝐵



discarded due to high IoU with another box.

This parallel hardware implementation significantly

reduces processing time by eliminating sequential

IoU comparisons.

4 PERFORMANCE ANALYSIS

The investigation into the potential of FPGA-

accelerated HALO-YOLO's features and the

consequent evaluation of it against GPU-

implemented YOLOv4 and YOLOv5 which was

done using a combination of publicly available

datasets, benchmarking tools, and hardware profiling

techniques. They were chosen from the datasets

which have a universal application in object detection

tasks, whereas the hardware setup was composed to

fairly compare FPGA and GPU implementations. For

the training and evaluation, we have used the two

famous datasets: PASCAL VOC 2012 and MS

COCO 2017 are the datasets we have used. The

PASCAL VOC 2012 dataset has a total of 11,530

labeled images with 27,450 object instances

belonging to 20 categories, and the dataset is

commonly employed while ensuring the real-time

object detection model is efficient. The MS COCO

2017 dataset is the most complex and consists of

118,000 training images and 5,000 validation images

with over 800,000 annotated objects that span across

80 categories thus it is suitable for testing model

generalization respectively. These datasets can be

found on their official sites: PASCAL VOC at

host.robots.ox.ac.uk/pascal/VOC/ and MS COCO at

cocodataset.org. Xilinx ZCU102 development board

(the Zynq UltraScale+ MPSoC family) which is

designed with the 16nm FPGA technology of the

UltraScale+ series it includes features like 600K logic

cells, 2,520 DSP slices and 4GB DDR4 RAM. Hence,

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

722

we were able to perform the tasks of the FPGA

entailment and the model deployment with Xilinx

Vitis AI, which is a software specially developed for

FPGA-based Deep-Learning Inference Optimization.

In addition, by using Xilinx Power Estimator (XPE)

we were able to develop accurate information on

energy efficiency during inference. The design of the

FPGA was defined by the quantization-aware training

method which allows for the use of a stable, scale-

range error without sacrificing precision. For the

internal perspective, YOLOv4 and YOLOv5 were

implemented on NVIDIA RTX 3090 GPUs- that

incorporated all the hardware specifications- like

10,496 CUDA cores, 24GB GDDR6X VRAM, and

936GB/s memory bandwidth. The models were run

on top of three different frameworks: TensorFlow,

PyTorch, and TensorRT. Power consumption was

measured by the nvidia-smi tool. The

implementations of the GPU were taken from the

official source which is the YOLOv4 from

https://github.com/AlexeyAB/darknet and the

YOLOv5 from https://github.com/ultralytics/yolov5.

Figure 2: FPGA output.

HALO-YOLO on FPGA when compared to

YOLO on GPU manages to prove substantial

advantages in the speed, throughput, and energy

consumption for real-time object detection. The given

input image represents a car and a person in the Pascal

VOC/MS COCO formats, with ground truth

annotations for evaluation. HALO-YOLO confirms

the detections of both objects with higher confidences

scores (0.92 for Cars: 0.92, and 0.87 for People),

which are quite close to the ground truth. Quantitative

measures show that HALO-YOLO hits the mark of

80 FPS which is almost 2× higher than YOLOv5 on

GPU (45 FPS) and reduces latency to 12.5ms per

image instead of 22.7ms on GPU, thus guaranteeing

real-time inference. What is more of interest, power

is consumed by only 6.3W, which basically makes the

FPGA nearly 95% more energy-efficient than the

GPU-based YOLO (320W), thus making it an ideal

choice for IoT projects with devices that work on

batteries, edge AI and autonomous systems. The

atramentous model of the so-called deep learning

technology YOLO working as GPU does give an

actual result that is as precise as HALO-YOLO,

leading to not significant differences in the detection

outcomes of the two models (Car: 0.88 and Person:

0.85), yet there is a slight shift according to the

bounding box which anyways, proves that the

optimization of FPGA is more stable instead of being

obnoxious. Figure 2 show the FPGA output.

Figure 3: Overall simulated output.

The overall simulated output was illustrated in

figure 3.

To establish fair testing grounds, GPU, AND

FPGA models were both subjected to the same

dataset, and top performance metrics such as Frames

Per Second (FPS), Latency (ms per image), Power

Consumption (W), and Mean Average Precision

(mAP@0.5) were registered. The evidence is

conclusive that FPGA-based HALO-YOLO is

significantly better in the power efficiency category

than GPU-based models (65% lower power

consumption) on the one hand, and, on the other, it

accomplishes one of the best inference speeds (80

FPS on FPGA vs. 45 FPS on GPU). From the

conclusions, it can be inferred that FPGAs as VLSI

accelerators can execute not only transactions such as

GPUs but also functions at a fast pace with low power

consumption in respect to robotics applications that

VLSI Implementation of Deep Learning Models for Real-Time Object Detection in Robotics

723

require real-time, low-power object detection.

Figure

4 show the Inference speed analysis.

Figure 4: Inference speed analysis.

FPGA-based deep learning on HALO-YOLO

comparing with YOLO models on the GPU can boast

about the massive advantages in the field, such as low

energy consumption and a higher speed of

inferencing that makes them the primary choice for

real-time detection of robotics and edge AI

applications. The total number of FPS (Frames Per

Second) indicates that HALO-YOLO is 80 FPS,

which is way faster than the old version of YOLOv5

on GPU (45 FPS) and YOLOv4 on GPU (35 FPS).

This accomplishment came about thanks to specific

FPGA optimizations including, among others,

parallel MAC operations, hardware-conscious

quantization, and the custom bounding box regression

approach that led to faster prediction being achieved

concomitantly with the maintenance of the accuracy.

Figure 5: Inference latency comparison.

The latency comparison of FPGA-based is another

proof of the advantages it holds, in which HALO-

YOLO managed to produce an image just 12.5ms

with YOLOv5 and 28.5ms with YOLOv4 on GPU

note that they only produce 22.7ms. A total of 45%

reduction in the performance of the system is due to

efficient convolution computations which are set up

in such a way that hardware-optimized activation

functions and parallelized NMS are performed, thus

increasing the object detection rate and making the

process faster. Short response times being super

essential can be understood by the fact that robots as

well as real-time cameras are often used in quick

decision-making tasks which are quite normal

operations in the industrial sector. Figure 5 show the

Inference latency comparison.

Figure 6: Power consumption analysis.

Figure 6 show the power consumption analysis

demonstrates another major advantage of HALO-

YOLO, that it runs only on 6.3W, while YOLOv5 and

YOLOv4 on GPU consume 320 W and 350 W,

correspondingly. The 95% reduction in power

consumption is contributed to VLSI (very-large-scale

integration)-based low-power computation units,

fixed-point arithmetic, and pruning, which eliminates

redundant operations. This in return makes FPGA

deployment very efficient in terms of energy, which

is the primary factor that enables such resources like

PP-Bit artificial intelligence applications, low-power

drones, and embedded AI devices. Summarily, the

data are consistent that HELO-YOLO FPGA is

number of times far better off than GPU YOLO

models, providing threefold higher FPS, 55% lower

latency, and 95% less power consumption, so it is

fitting choice for making real-time low-power AI

applications ideal.

5 CONCLUSIONS

To exemplify the efficiency and competitiveness of

FPGA-based HALO-YOLO in the context of live

object detection (for instance, in robotics,

surveillance, and edge AI applications), this theory

was put into practice. Using VLSI which is VLSI-

based hardware acceleration, HALO-YOLO reaches

inferencing speed at 80 FPS, latency at 12.5ms per

image, and reduces power consumption by a

magnitude of 6.3W when compared to GPU-based

YOLO models (45 FPS, 22.7ms latency, 320W power

consumption). The morphing of Adaptive Sparse

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

724

Convolutional Neural Networks implementation to

learning how to perform the best calculation

dynamically by spontaneously adjusting the activity

of neurons on a given section of neurons, while the

performance of FPGA is improved by hardware-

aware optimizations such as quantization, pruning as

well as parallel MAC operations. HALO-YOLO the

results were validated through an experimentation

approach showed that it was not only able to preserve

high detection accuracy (mAP@0.5 = 83.7%) during

the comparison with traditional GPU-based models in

the area of energy efficiency but it was also able to

accomplish it with almost 95% less power (95%

power reduction) of the energy, real-time processing

speed (2× faster inference than YOLOv5) more than

the latter. The SMD operation on FPGA has

demonstrated the capability of low-power AI

accelerators in many cases like, a robot doing regular

tasks, the applications in the industrial, and electrical

vision systems, thus the path for the further

development of yaw-aware deep learning techniques

REFERENCES

A. R. Pathak, M. Pandey, and S. Rautaray, “Application of

deep learning for object detection,” Proc. Comput. Sci.,

vol. 132, pp. 1706–1717, Jan. 2018.

A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao,

“YOLOv4: Optimal speed and accuracy of object

detection,” 2020, arXiv:2004.10934.

B. Jacob et al., “Quantization and training of neural

networks for efficient integer-arithmetic-only

inference,” in Proc. IEEE/CVF Conf. Comput. Vis.

Pattern Recognit., Jun. 2018, pp. 2704–2713.

B. Martinez, J. Yang, A. Bulat, and G. Tzimiropoulos,

“Training binary neural networks with real-to-binary

convolutions,” in Proc. Int. Conf. Learn. Represent.

(ICLR), 2020, pp. 1–11. [Online]. Available:

https://openreview.net/forum?id=BJg4NgBKvH

C. Szegedy et al., “Going deeper with convolutions,” in

Proc. IEEE Conf. Comput. Vis. Pattern Recognit.

(CVPR), Jun. 2015, pp. 1–9.

D. T. Nguyen, T. N. Nguyen, H. Kim, and H. J. Lee, “A

high-throughput and power-efficient FPGA

implementation of YOLO CNN for object detection,”

IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.

27, no. 8, pp. 1861–1873, Aug. 2019.

F. Conti et al., “XNOR neural engine: A hardware

accelerator IP for 21.6-fJ/op binary neural network

inference,” IEEE Trans. Comput.-Aided Design Integr.

Circuits Syst., vol. 37, no. 11, pp. 2940–2951, Nov.

2018.

H. Nakahara, H. Yonekawa, T. Fujii, and S. Sato, “A

lightweight YOLOv2: A binarized CNN with a parallel

support vector regression for an FPGA,” in Proc.

ACM/SIGDA Int. Symp. Field-Program. Gate Arrays,

Feb. 2018, pp. 31–40.

I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y.

Bengio, “Binarized neural networks,” Proc. Adv.

Neural Inf. Process. Syst. (NIPS), vol. 29, 2016, pp.1–

J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized

convolutional neural networks for mobile devices,” in

Proc. IEEE Conf. Comput. Vis. Pattern Recognit.

(CVPR), Jun. 2016, pp. 4820–4828.

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You

only look once: Unified, real-time object detection,” in

Proc. IEEE Conf. Comput. Vis. Pattern Recognit.

(CVPR), Jun. 2016, pp. 779–788.

J. Redmon and A. Farhadi, “YOLO9000: Better, faster,

stronger,” in Proc. IEEE Conf. Comput. Vis. Pattern

Recognit. (CVPR), Jul. 2017, pp. 7263–7271.

J. Redmon and A. Farhadi, “YOLOv3: An incremental

improvement,” 2018, arXiv:1804.02767.

J. Bethge, H. Yang, M. Bornstein, and C. Meinel,

“BinaryDenseNet: Developing an architecture for

binary neural networks,” in Proc. IEEE/CVF Int. Conf.

Comput. Vis. Workshop (ICCVW), Oct. 2019, pp.

1951–1960.

J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo,

“UNPU: An energy-efficient deep neural network

accelerator with fully variable weight bit precision,”

IEEE J. Solid-State Circuits, vol. 54, no. 1, pp. 173–

185, Jan. 2019.

K. Ando et al., “BRein memory: A single-chip

binary/ternary reconfigurable in-memory deep neural

network accelerator achieving 1.4 TOPS at 0.6 W,”

IEEE J. Solid-State Circuits, vol. 53, no. 4, pp. 983–

994, Apr. 2018.

M. Horowitz, “1.1 Computing’s energy problem (and what

we can do about it),” in IEEE Int. Solid-State Circuits

Conf. (ISSCC) Dig. Tech. Papers, Feb. 2014, pp. 10–

14.

M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi,

“XNOR-Net: ImageNet classification using binary

convolutional neural networks,” in Proc. Eur. Conf.

Comput. Vis. (ECCV). Cham, Switzerland: Springer,

2016, pp. 525–542.

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.

Chen, “MobileNetV2: Inverted residuals and linear

bottlenecks,” in Proc. IEEE/CVF Conf. Comput. Vis.

Pattern Recognit., Jun. 2018, pp. 4510–4520.

M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and

efficient object detection,” in Proc. IEEE/CVF Conf.

Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp.

10781–10790.

O. Russakovsky et al., “ImageNet large scale visual

recognition challenge,” Int. J. Comput. Vis., vol. 115,

no. 3, pp. 211–252, Dec. 2015.

Q. Huang et al., “CoDeNet: Efficient deployment of input-

adaptive object detection on embedded FPGAs,” in

Proc. ACM/SIGDA Int. Symp. Field-Program. Gate

Arrays, Feb. 2021, pp. 206–216.

R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf.

Comput. Vis. (ICCV), Dec. 2015, pp. 1440–1448.

VLSI Implementation of Deep Learning Models for Real-Time Object Detection in Robotics

725

R. Andri, G. Karunaratne, L. Cavigelli, and L. Benini,

“ChewBaccaNN: A flexible 223 TOPS/W BNN

accelerator,” in Proc. IEEE Int. Symp. Circuits Syst.

(ISCAS), May 2021, pp. 1–5.

S. Han, H. Mao, and W. J. Dally, “Deep compression:

Compressing deep neural networks with pruning,

trained quantization and Huffman coding,” 2015,

arXiv:1510.00149.

S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both

weights and connections for efficient neural networks,”

2015, arXiv:1506.02626.

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN:

Towards realtime object detection with region proposal

networks,” Proc. Adv. Neural Inf. Process. Syst.

(NIPS), vol. 28, 2015, pp. 1–14.

S. Yin et al., “A high energy efficient reconfigurable hybrid

neural network processor for deep learning

applications,” IEEE J. Solid-State Circuits, vol. 53, no.

4, pp. 968–982, Dec. 2017.

S. Kim, S. Na, B. Y. Kong, J. Choi, and I.-C. Park, “Real-

time SSDLite object detection on FPGA,” IEEE Trans.

Very Large Scale Integr. (VLSI) Syst., vol. 29, no. 6,

pp. 1192–1205, Jun. 2021.

T. B. Preuser, G. Gambardella, N. Fraser, and M. Blott,

“Inference of quantized neural networks on

heterogeneous all-programmable devices,” in Proc.

Design, Autom. Test Eur. Conf. Exhib. (DATE), Mar.

2018, pp. 833–838.

V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient

processing of deep neural networks: A tutorial and

survey,” Proc. IEEE, vol. 105, no. 12, pp. 2295–2329,

Dec. 2017.

W. Liu et al., “SSD: Single shot multibox detector,” in Proc.

Eur. Conf. Comput. Vis. (ECCV). Cham, Switzerland:

Springer, 2016, pp. 21–37.

X. Chen, J. Xu, and Z. Yu, “A 68-mW 2.2 Tops/W low bit

width and multiplierless DCNN object detection

processor for visually impaired people,” IEEE Trans.

Circuits Syst. Video Techn. (CSVT), vol. 29, no. 11, pp.

3444–3453, Nov. 2019.

Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, and K.-T. Cheng,

“Bi-real Net: Enhancing the performance of 1-bit CNNs

with improved representational capability and

advanced training algorithm,” in Proc. Eur. Conf.

Comput. Vis. (ECCV), Sep. 2018, pp. 722–737.

Z. Wang, Z. Wu, J. Lu, and J. Zhou, “BiDet: An efficient

binarized object detector,” in Proc. IEEE/CVF Conf.

Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp.

2049–2058.

Z. Liu, Z. Shen, M. Savvides, and K.-T. Cheng, “ReactNet:

Towards precise binary neural network with

generalized activation functions,” in Proc. Eur. Conf.

Comput. Vis. (ECCV). Cham, Switzerland: Springer,

2020, pp. 143–159.

ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,

COMMUNICATION, AND COMPUTING TECHNOLOGIES

726