Micro-YOLO: Exploring Efﬁcient Methods to Compress CNN based

Object Detection Model

Lining Hu

and Yongfu Li

Department of Micro-Nano Electronics, MoE Key Lab of Artiﬁcial Intelligence, Shanghai Jiao Tong University, China

Keywords:

Object Detection, YOLO, MobileNets, Depthwise Separable Convolution, Model Compression, Prune.

Abstract:

Deep learning models have made signiﬁcant breakthroughs in the performance of object detection. How-

ever, in the traditional models, such as Faster R-CNN and YOLO, the size of these networks make it too

difﬁcult to be deployed on embedded mobile devices due to limited computation resources and tight power

budgets. Hence, we propose a new light-weight CNN based object detection model, Micro-YOLO based on

YOLOv3-Tiny, which achieves a signiﬁcation reduction in the number of parameters and computation cost

while maintaining the detection performance. We propose to replace convolutional layers in the YOLOv3-tiny

network with the Depth-wise Separable convolution (DSConv) and the mobile inverted bottleneck convolution

with squeeze and excitation block (MBConv), and design a progressive channel-level pruning algorithm to

minimize the number of parameters and maximize the detection performance. Hence, the proposed Micro-

YOLO network reduces the number of parameters by 3.46× and multiply-accumulate operation (MAC) by

2.55× while slightly decreases the mAP evaluated on the COCO dataset by 0.7%, compared to the original

YOLOv3-tiny network.

1 INTRODUCTION

The accelerated growth in the deep learning ﬁeld has

greatly promoted the development of the object de-

tection with its widespread applications in face de-

tection, autonomous driving, robot vision and video

surveillance (Borji et al., 2019; Pan et al., 2020). With

the vigorous development in object detection, there

are several deep convolutional neural network models

proposed in the recent years, .e.g. R-CNN, SSD, and

YOLO (Girshick et al., 2014; Liu et al., 2016; Red-

mon and Farhadi, 2018). However, as the network

becomes more complicated, the size of these mod-

els continues to increase, which makes it increasingly

difﬁcult to deploy these models on embedded devices

in real life (Cheng et al., 2017). Therefore, it is of vi-

tal importance to develop an efﬁcient and fast object

detection model to reduce the parameter size without

affecting the object detection quality.

The goal of object detection is to detect objects of

a certain class (such as humans, animals, or cars) in

digital images (Borji et al., 2019). One of the most

famous object detection network is “You Only Look

https://orcid.org/0000-0003-3506-7873

https://orcid.org/0000-0002-6322-8614

Once” (YOLO) architecture. After years of improve-

ment for YOLO, it has evolved into the fourth gen-

eration, YOLOv4 architecture (Bochkovskiy et al.,

2020). It achieves average precision (AP) of 43.5%

(65.7% AP50) for the MS COCO dataset at a real

time speed of 65 frames per second (FPS) on Tesla

V100(Bochkovskiy et al., 2020). However, it con-

tains more than 60 million parameters and requires to

perform more than 107 billion ﬂoating number mul-

tiplications when processing an image. Besides, the

faster version of YOLOv3, the previous version of

YOLOv4, YOLOv3-tiny is proposed where its param-

eters and multiplication requirements have reduced

by 7.5× and 13×, respectively(Redmon and Farhadi,

2018). The new model has achieved 33.1% mAP with

220FPS on Titan X. However, it remains challenging

to deploy this model for several embedded devices.

In this work, we propose a lightweight version of

the objection detection model, Micro-YOLO, which

is based on YOLOv3-tiny architecture (Redmon and

Farhadi, 2018). We proposed three effective methods

to optimize the Micro-YOLO architecture. The key

contributions of our work are as follows:

1) We propose to replace the standard convolutional

layers (Conv) in the YOLOv3-tiny network with

depth-wise separable convolutions (DSConv) and

Hu, L. and Li, Y.

Micro-YOLO: Exploring Efﬁcient Methods to Compress CNN based Object Detection Model.

DOI: 10.5220/0010234401510158

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 2, pages 151-158

ISBN: 978-989-758-484-8

151

Kernel Size Exploration

Progressive Pruning

YOLOv3-tiny

(block replaced)

3×3

5×5

. . .

Channel Pruning

Conv1×1 (channel×3)

Dwise k×k

SE Layer

Conv1×1 (channel/3)

Dwise 3×3

Point wise1×1

YOLOv3-tiny

DSConv 3×3

MBConv k×k

Mobilenets Block-Based Network

Figure 1: The proposed optimization techniques adopted in Micro-YOLO.

inverted bottleneck convolution with squeeze and

excitation block (MBConv), reducing the weight

parameters with slightly degrading the detection

accuracy.

2) We explore and identify the optimal kernel sizes in

MBConv to achieve the best trade-off between the

weight parameters and detection accuracy on the

Micro-YOLO architecture.

3) We propose a progressive pruning algorithm to

perform a coarse-grained pruning on the DSConv

and MBConv layers, which further reduce the

weight parameters with slightly degrading the

detection accuracy. After pruning, we further

decrease the size to 1.92M parameters and the

computation cost to 0.87GMAC with 3.1% mAP

degradation.

The rest of this paper is organized as follows. Section

2 provides a understanding of the state-of-arts model

compression techniques and its evaluation methods

and problem statement. Section 3 provides details

of our proposed Micro-YOLO network and its model

compression methods. Section 4 discusses the exper-

imental setup and result and followed by the com-

parison with state-of-the-art works. We conclude our

work in Section 5.

2 PRELIMINARIES

2.1 Model Compression Techniques for

Object Detection Networks

As the family of object detection networks continues

to become more complicated, it is important to reduce

the weight parameters and computational cost. The

model compression methods are categorized into low-

rank factorization, knowledge distillation, pruning,

and quantization (Fernandez-Marques et al., 2020),

where pruning has shown to be an effective method

in reducing the network complexity by removing re-

dundant parameters (Cheng et al., 2017).

To address the object detection network problem,

there are several state-of-art works techniques to re-

duce the number parameters in YOLO architecture.

(Huang et al., 2018) developed the YOLO-lite net-

work, where batch normalization layer is removed

from YOLOv2-tiny to speed up the object detection.

This network has achieved a mAP of 33.81% and

12.26% on PASCAL VOC 2007 and COCO dataset,

respectively. (Wong et al., 2019) created a highly

compact network, YOLO-nano, which is a 8-bit quan-

tized model based on YOLO network and is opti-

mized on PASCAL VOC 2007 dataset. This network

has achieved 3.18M model size and 69.1% mAP on

the PASCAL VOC 2007 dataset.

2.2 Evaluation Methods

We evaluate the effectiveness of object detection net-

work based on three aspects: model size, computation

cost and accuracy performance on the COCO dataset

(Lin et al., 2014).

Deﬁnition 1 (Model Size). Model size is deﬁned as

the number of parameters in a neural network, which

is the sum of trainable elements in each layer. It is

formulated as follows:

Model Size =

∑

i=1

, (1)

where l

denotes the number of trainable elements in

the i-th layer and N represents the total number of

layers in the neural network.

Deﬁnition 2 (Computation Cost). We deﬁne Compu-

tation Cost as the number of multiply-accumulate op-

erations (MACs) which is the count of operation units

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

152

in which the product of two numbers is computed and

that product is added to an accumulator.

Deﬁnition 3 (Mean average precision (mAP)). The

most common evaluation method for object detection

is “Average Precision” (AP), which is deﬁned as the

average detection precision under different recalls.

Precision measures how accurate is the model pre-

dictions. Recall measures how good the model ﬁnds

all the positives. mAP (mean average precision) is

the average of AP. For COCO dataset, we evaluate

the mAP on 80 categories.

2.3 Problem Formulation

With the above deﬁnitions, the problem is formulated

as follows:

[Model Compression for Object Detection Problem]

Given an object detection neural network model, the

objective is to utilize efﬁcient compression schemes

on the model to achieve small Model Size and Com-

putation Cost while maintaining the network’s mAP.

3 OUR PROPOSED METHOD

As shown in Fig. 1, we propose three methods on the

YOLOv3-tiny network and obtain a lightweight ver-

sion of the network, named Micro-YOLO: (1) To re-

duce the convolutional network blocks in the YOLO

network, we propose to replace the standard con-

volution (Conv) layers with two types of convolu-

tional blocks: (a) the depth-wise separable convolu-

tion (DSConv) used in MobileNetv1 (Howard et al.,

2017) and (b) mobile inverted bottleneck convolution

with squeeze and excitation block (MBConv) used in

MobileNetv3(Howard et al., 2019); (2) We explore

and identify the optimal kernel sizes in MBConv to

achieve the best trade-off between the weight param-

eters and detection accuracy on the network; (3) We

propose a progressive structured pruning method to

further shrink the Model Size.

3.1 MobileNets Block-based Network

To reduce the size of the network, we have explored

alternative lightweight convolutional layers to replace

the convolutional layers Conv in the YOLO network.

The MobileNet networks (Howard et al., 2017; San-

dler et al., 2018; Howard et al., 2019) adopt two

lightweight convolutional layers (a) the Depth-wise

separable convolution (DSConv) layer and (b) mo-

bile inverted bottleneck convolution with squeeze and

excitation block (MBConv)) layer. As shown in the

Fig. 2(a), DSConv layer performs two types of con-

volutions: (i) the depthwise convolution and (ii) the

pointwise convolution, which can signiﬁcantly reduce

the Model Size and Computation Cost of the net-

work (Howard et al., 2017). As shown in the Fig.

2(b), the structure of MBConv is a 1×1 channel ex-

pansion convolution followed by depthwise convolu-

tions and a 1×1 channel reduction layer. It utilizes

squeeze and excitation block, which is a branch con-

sisting of a global average pooling operation in the

squeeze phase and two small FC layers in the excita-

tion phase (Hu et al., 2019) between depthwise con-

volution and channel reduction layer. Since the num-

ber of output channels is not equal to the number of

input channels, we remove the residual connection in

MBConv. MBConv layer provides a compact repre-

sentation at the input and output while expanding the

input to a higher-dimensional feature space internally

to increase the expressiveness of nonlinear transfor-

mations. Hence, the MBconv layer provides a better

compressed network without degrading the detection

accuracy as compared to the DSconv layer.

To evaluate the Model Size amongst these layers,

the number of parameters in the Conv (N

), in the

DSConv (N

), and in the MBConv (N

) can be com-

puted with (2), (3) and (4), respectively.

= k

× C

out

, (2)

= k

× C

+ 1 × 1 × C

× C

out

, (3)

= C

× αC

× 1 × 1 + k

× αC

+ 2 × αC

× αC

/β + αC

× C

out

(4)

where k denotes kernel size, C

denotes number of

input channels, C

out

denotes number of output chan-

nels, α and β denotes expansion factor and reduction

factor in MBConv, respectively.

The Computation Cost amongst these layers, i.e.

the Conv layer (C

), the DSConv layer (C

), and the

MBConv layer (C

) can be expressed with the fol-

lowing (5), (6), (7), respectively.

× W × H × C

× C

out

, (5)

× W × H × C

+W × H × C

× C

out

(6)

(W × H × C

× αC

+ k

× W × H × αC

+ 2 × αC

× αC

/β +W × H × αC

× C

out

(7)

where k denotes kernel size, C

denotes number of in-

put channels, C

out

denotes number of output channels,

W and H denote width and height of feature maps, α

and β denotes expansion factor and reduction factor

in MBConv, respectively.

Micro-YOLO: Exploring Efﬁcient Methods to Compress CNN based Object Detection Model

153

Depth-wise

convolution

Point-wise

convolution

(a) DSConv: Depth-wise separable convolution

1×1 channel

expansion

Depth-wise

convolution

1×1 channel

reduction

(b) MBConv: Inverted linear bottleneck

with squeeze and excitation layer

G-Pool

FC,relu

FC,hard-σ

Figure 2: Two types of convolutions used in our work.

3.2 Kernel Size Exploration

To further reduce the weight parameters in the convo-

lutional layer without compromising the accuracy, we

propose a kernel size optimization technique. Most

of the traditional convolutional neural network de-

sign uses 3×3 convolutional kernel (Howard et al.,

2017; Sandler et al., 2018). Similarly, YOLOv3-

tiny network also uses convolution kernel size of 3×3

in the Conv layers. However, the emergent of net-

work architecture search algorithms changed the sit-

uation. For example, (Cai et al., 2019) highlighted

that the ﬁrst few convolutional layers prefer to us-

ing smaller kernel sizes while the deep convolutional

layer prefers to using larger kernel sizes. Further-

more, recent works on network exploration (Tan and

Le, 2019) have shown a similar result that the com-

bination of multiple kernel sizes leads to better detec-

tion accuracy. Hence, it is necessary to explore the

optimization space between the use of different con-

volutional kernel size and the mAP of our proposed

Micro-YOLO network. The details of our experiment

will be discussed in Section 4.

3.3 Progressive Channel Pruning

After ﬁnalizing the architecture of our proposed

Micro-YOLO network, we can further reduce the

weight parameters by using the pruning technique. In

our proposed work, we have adopted coarse-grained

pruning because the DSConv and MBConv layers are

mostly composed of 1×1 kernel size, which left min-

imal room for ﬁne-grained pruning. (Liu et al., 2019)

indicates that the pruned architecture itself, rather

than a set of inherited “important” weights, is more

crucial to the efﬁciency in the ﬁnal model, which sug-

gests that in some cases pruning can be useful as an

architecture search paradigm. Hence, we proposed a

progressive pruning method to search for a “thinner”

Algorithm 1: Progressive Channel Pruning Algorithm.

Input: The original network structure

Net(C

...C

Output: The pruned network structure

Net.

1: for i = 1 to N do

2: Train Net for 20 epochs;

3: Evaluate mAP

origin

of Net;

4: OldC

= C

, mAP

old

= mAP

origin

;

5: repeat

6: NewC

= OldC

− 1/16C

;

7: OldC

= NewC

;

8: Initialize a new network

Net(C

...C

);

9: Train

Net for 20 epochs;

10: Evaluate mAP

new

Net;

11: mAP

old

= mAP

new

12: until mAP

new

< mAP

origin

− 0.5%

13: end for

architecture in the modiﬁed network. The details of

the proposed progressive channel pruning algorithm

are shown in Algorithm 1. We ﬁrst train the original

network Net and evaluate the mAP

original

before prun-

ing (Lines 2-3). The numbers of current pruned con-

volution layer channels OldC

are recorded (Line 4).

During the pruning of convolutional layer i, we reduce

the number of output channels by 1/16 each time since

this pruning step balances the pruning speed and ac-

curacy. Note that when the output channel of layer i is

pruned, the corresponding input channel of layer i+1

also needs to be pruned. Then the number of chan-

nels of convolutional layer i is updated, and a new

network (

Net) with the reduced number of channels is

initialized (Lines 6-8).

Net is retrained for 20 epochs

to evaluate the new mAP

new

(Lines 9-11). The prun-

ing procedure for layer i is repeated until mAP

new

0.5% lower than the original mAP

origin

since our ex-

periment shows a threshold of 0.5% ensure that chan-

nel pruning will not decrease the detection accuracy

severely (Line 12). Then, we begin to prune the next

convolutional layer until all the convolutional layers

are pruned, and the pruned network is returned.

4 EXPERIMENTAL RESULTS

We implemented and evaluated our Micro-YOLO net-

work using Python programming language with Py-

torch (Paszke et al., 2017) library on a 2.50GHz 12

cores Xeon Intel Linux machine, 128GB memory, and

2 Nvidia GTX 2080Ti graphics cards.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

154

Conv

3×3

MaxPool

MBConv

3×3

MaxPool

MBConv

5×5

MaxPool

MBConv

3×3

MBConv

5×5

MaxPool

MBConv

3×3

DSConv3×3

Conv

1×1

MBConv

3×3

Conv

1×1

Conv

1×1

Upsample

Conv

1×1

Concat

MBConv

3×3

3 4 5

12 13

14 15 16

19 20 21 22 23

Figure 3: Our proposed object detection neural network architecture. MBConvk×k denotes mobile inverted bottleneck con-

volution with squeeze and excitation block with kernel size k×k, DSConv×k denotes depth-wise separable convolution with

kernel size k×k.

4.1 Micro-YOLO Network

Optimization

In our proposed Micro-YOLO network, the choice of

convolution type in each layer and kernel size in the

convolutional layer have a great inﬂuence on the de-

tection accuracy. Thus, we conduct experiments to

determine the architecture for the Micro-YOLO net-

work.

4.1.1 The Choice of Convolution Types

As discussed in Section 3.1, there are great differ-

ences among the number of parameters in the Conv,

DSConv and MBConv layers. As shown in Table 1, we

compute the number of parameters required for differ-

ent layer types and different input channels with the

same kernel size according to (2)-(7). Note that the

number of output channels is twice the number of in-

put channels. As shown in the last two columns of the

table, the number of parameters used in MBConv and

DSConv layers are signiﬁcantly smaller than Conv

layer.

To understand the impact of different convolution

types on Model Size, Computation Cost and mAP,

we replace Conv of YOLOv3-tiny with our proposed

strategies. Table 2 shows Model Size, Computation

Cost of networks composed of different convolution

types and mAP evaluated on COCO dataset. As

shown in the table, networks that with only DSConv

layers have far smaller Model Size and Computa-

tion Cost compared to networks consists of MBConv

layers only. However, using the MBConv layer is

more effective in maintaining the mAP while DSConv

can be applied to reduce the number of parameters.

Hence, it is necessary to choose an optimal trade-off

between Model Size and mAP of the network.

As shown in Tables 1 and 2, the increase in the

number of input channel and convolutional layers

leads to the increase of Model Size. For example, in

the YOLOv3-tiny model, the 10th, 12th, and 14th lay-

ers have a total weight parameters of 6.63M, which

accounts for 74.95% of the entire network. We use

DSConv in the 12th layer and MBConv in the remain-

ing layers since the 12th layer contains the largest

amount of parameters. This leads to the Model size

reduction by 3.46× while the mAP only degrades by

1.7%. Hence, the ﬁnal form of our proposed Micro-

YOLO network is shown in Fig. 3.

4.1.2 Kernel Size Exploration

As discussed in Section 3.2, the choice of kernel

size is very essential to improve mAP. Therefore, we

choose the 3rd, 5th, 7th, 9th, and 11th layers, which

are layers before the detection part of YOLOv3-tiny,

to explore the effect of different kernel sizes on those

layers. For each layer, we choose kernel size from

3×3 and 5×5, thus leading to 2

=32 different per-

mutations and combinations. To save our training

time, we train each experiment for 20 epochs from

scratch and ﬁnd the best combination of these per-

mutations and combinations. As shown in Figure 4,

among the 32 kinds of combinations, the quality of

the networks which interleaving 3×3 and 5×5 ker-

nel sizes is the best. Thus, this indicates that the best

mAP is achieved by using convolution kernels of size

3,5,3,5,3 in the 3rd, 5th, 7th, 9th, 11th layers, respec-

tively.

Micro-YOLO: Exploring Efﬁcient Methods to Compress CNN based Object Detection Model

155

Table 1: Number of parameters required for different convolution types and different input channels with the same kernel size

3×3.

No. of Parameters Redution Multiples

No. of Input

Channels Conv DSConv MBConv DSConv MBConv

16 4,068 656 3,888 7.02 × 1.19 ×

32 18,432 2,336 14,688 7.89 × 1.25 ×

64 73,728 8,768 57,024 8.41 × 1.29 ×

128 294,912 33,920 224,640 8.69 × 1.31 ×

256 1,179,648 133,376 891,648 8.84 × 1.32 ×

512 4,718,592 528,896 3,552,768 8.92 × 1.33 ×

Reduction Multiples denote the reduced multiples of parameters compared to standard convolution.

Table 2: The amount of parameters of the YOLOv3-tiny

network composed of different convolution types.

Network

Model Size

(M)

Computation

Cost (G)

mAP

YOLOv3-tiny 8.85 2.81 33.1

All DSconv 3 × 3 1.44 0.52 24.6

All DSconv 5 × 5 1.47 0.55 25.4

All MBconv 3 × 3 6.45 2.08 30.4

All MBconv 5 × 5 6.53 2.16 31.5

Figure 4: Kernel size exploration result. Different bars indi-

cate different combinations of kernel sizes. For simplicity,

we only show the optimal kernel size combination in red.

4.2 Pruning Results

To further compress our model, we have applied our

progressive pruning algorithm for the ﬁrst 7 convolu-

tion layers. The pruning results are presented in Table

3, where the x/16 indicates the pruning step of each

layer. For example, the 3rd layer contains 32 chan-

nels, which we ﬁrst prune 2 channels based on the

1/16 of the initial number of channels calculation. At

the second step, we prune 4 channels, which is 2/16 of

32 channels. When pruning 3/16 of the initial number

of channels, compared with the initial value, the mAP

decreased by 1.4%, which is greater than 0.5%, then

we stop pruning this layer and move on to the next

layer.

As shown in Table 3, most of the convolution lay-

ers cannot be further pruned when we perform prun-

ing on 2/16 of the number of channels. If we continue

to perform pruning, the mAP starts to degrade signif-

icantly. Hence, the results shown in Table 3 has also

conﬁrmed our conjecture: As the depth of the network

and the number of convolutional layer channels in-

crease, the convolutional layer’s “tolerance” to prun-

ing gradually increases, enabling us to prune more

channels in deeper layers, such as 11th and 13rd lay-

ers. In particular, we even prune 6/16 of the number

of channels, that is, 384 channels, in the 13th layer

without decreasing mAP. However, in the 15th layer,

we observe an exception situation where even 1/16 of

the number of channels cannot be pruned. We suspect

that the reason may be that this layer is too close to

the detection layer.

4.3 Benchmark and Comparisons

We have made a comparison between our proposed

Micro-YOLO against YOLO-nano(Huang et al.,

2018), YOLO-lite(Wong et al., 2019) and YOLOv3-

tiny (Redmon and Farhadi, 2018). We trained all of

the networks from scratch for 500,200 batches, simi-

lar to the training method used in YOLOv3-tiny (Red-

mon and Farhadi, 2018). Table 4 illustrates the Model

size, Computation cost, mAP on COCO datasets and

FPS of YOLOv3-tiny, YOLO-lite, YOLO-nano and

Micro-YOLO.

As compared with the YOLOv3-tiny network,

the initial version of our Micro-YOLO has already

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

156

Table 3: Pruning results with progressive channel pruning algorithm.

Layer

Kernel

Size

No. of

Channels

mAP %

Original 1/16 2/16 3/16 4/16 5/16 6/16 7/16

3 3 32 17 16.8 16.8 15.6 - - - -

5 5 64 16.8 16.8 16.4 15.3 - - - -

7 3 128 16.4 16.1 16.2 16.1 16 16.2 15 -

9 5 256 16.2 16.2 16 15.9 14 - - -

11 3 512 15.9 15.6 15.5 15 - - - -

13 3 1024 15.5 15.5 15.4 15.5 15.6 15.6 15.7 14.7

15 3 512 15.7 15.0 - - - - - -

Table 4: Model’s amount of parameters, computation cost,

mAP, and latency of YOLOv3-tiny, YOLO-lite, YOLO-

nano, and Micro-YOLO (original and pruned). The input

size is 416×416 for all networks.

Model

Size

(M)

Computation

Cost

(GMAC)

mAP %

(CO-

CO)

mAP %

(VOC

2007)

FPS

YOLOv3-tiny 8.85 2.81 33.1 - 313

YOLO-lite 0.46 0.93 12.3 33.6 378

YOLO-nano 3.18 3.49 14.5 69.1 240

Micro-YOLO 2.56 1.10 32.4 - 328

Micro-YOLO

(pruned)

1.92 0.87 29.3 - 357

achieved a signiﬁcant reduction of the parameters by

3.46× and the number of operations by 2.55× with

slightly decrease of 0.7% mAP on COCO dataset.

After applying coarse-grained pruning technique, the

Micro-YOLO has reduced the weight parameters by

4.61× and computation cost by 3.23× with a slight

drop of 3.8% mAP compared with YOLOv3-tiny.

YOLO-lite model has a size of 0.46M parameters

and requires a computation cost of 0.93GMAC and

achieves 12.3% mAP on the COCO dataset. YOLO-

nano model has a size of 3.18M parameters and re-

quires a computation cost of 3.49GMAC and achieve

14.5% mAP and 69.1% mAP on COCO and PASCAL

VOC 2007 datasets, respectively, it’s because YOLO-

nano is optimized based on the PASCAL VOC 2007

dataset, it does not perform very well on the COCO

dataset. As for the latency, we re-evaluate the FPS

of all the networks on a single Nvidia GTX 2080Ti

graphics card. Our Micro-YOLO and pruned Micro-

YOLO achieve 328 and 357 FPS respectively, second

only to YOLO-lite. Since YOLO-nano is optimized

based on the PASCAL VOC 2007 dataset, it does not

perform very well on the COCO dataset.

5 CONCLUSIONS

In this paper, we explore several model compres-

sion methods and propose an improved object detec-

tion architecture, Micro-YOLO, based on YOLOv3-

tiny. We analyze several types of convolutional layers,

such as depth-wise separable convolution (DSConv)

and inverted bottleneck convolution with squeeze and

excitation block (MBConv), to determine the optimal

layer for our Micro-YOLO network. We also explore

the effect of different kernel sizes in these convolu-

tional layers on Micro-YOLO performance. Further-

more, we propose a new progressive channel prun-

ing method to minimize the number of parameters

and computation costs with slightly mAP reduction

of the original network. The Micro-YOLO only re-

quires 2.56M parameters and 1.10GMAC of Compu-

tation Cost to achieve the mAP of 32.4% and 328 FPS,

which is slightly lower than the original YOLOv3-

tiny network. After applying the pruning technique,

we can further reduce the number of parameters and

computation cost to 1.92M and 0.87GMAC with mAP

of 29.3% and 357 FPS. We also compare our work

with other variety of YOLO-based networks for ob-

ject detection and achieve promising results. We be-

lieve that our methodology to compress YOLOv3-

tiny can be highly applicable to the future version of

YOLO or other object detection models.

ACKNOWLEDGEMENTS

This research is supported in part by the National

Key Research and Development Program of China

under Grant No. 2019YFB2204500 and in part by

the Science, Technology and Innovation Action Plan

of Shanghai Municipality, China under Grant No.

1914220370.

Micro-YOLO: Exploring Efﬁcient Methods to Compress CNN based Object Detection Model

157

REFERENCES

Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. (2020).

YOLOv4: Optimal Speed and Accuracy of Object De-

tection. arXiv preprint arXiv:2004.10934.

Borji, A., Cheng, M.-M., Hou, Q., Jiang, H., and Li, J.

(2019). Salient object detection: A survey. Compu-

tational Visual Media, 5(1):117–150.

Cai, H., Zhu, L., and Han, S. (2019). Proxylessnas: Direct

neural architecture search on target task and hardware.

arXiv preprint arXiv:1812.00332.

Cheng, Y., Wang, D., Zhou, P., and Zhang, T. (2017). A sur-

vey of model compression and acceleration for deep

neural networks. arXiv preprint arXiv:1710.09282.

Fernandez-Marques, J., Whatmough, P. N., Mundy, A., and

Mattina, M. (2020). Searching for Winograd-aware

Quantized Networks. MLSys.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detection

and semantic segmentation. In IEEE CVPR, pages

580–587.

Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B.,

Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V.,

et al. (2019). Searching for Mobilenetv3. IEEE ICCV,

pages 1314–1324.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D.,

Wang, W., Weyand, T., Andreetto, M., and Adam,

H. (2017). Mobilenets: Efﬁcient convolutional neu-

ral networks for mobile vision applications. arXiv

preprint arXiv:1704.04861.

Hu, J., Shen, L., Albanie, S., Sun, G., and Wu, E.

(2019). Squeeze-and-Excitation Networks. IEEE

PAMI, 5(1):117–150.

Huang, R., Pedoeem, J., and Chen, C. (2018). YOLO-LITE:

a real-time object detection algorithm optimized for

non-GPU computers. In IEEE Big Data, pages 2503–

2510.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-

manan, D., Doll

ar, P., and Zitnick, C. L. (2014). Mi-

crosoft coco: Common objects in context. In ECCV,

pages 740–755. Springer.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot

multibox detector. In ECCV, pages 21–37.

Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T.

(2019). Rethinking the value of network pruning.

arXiv preprint arXiv:1810.05270.

Pan, M., Zhu, X., Li, Y., Qian, J., and Liu, P. (2020). MR-

Net: A Keypoint Guided Multi-scale Reasoning Net-

work for Vehicle Re-identiﬁcation. In Yang, H., Pa-

supa, K., Leung, A. C., Kwok, J. T., Chan, J. H.,

and King, I., editors, Neural Information Processing,

ICONIP 2020, volume 1332, pages 469–478.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,

DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and

Lerer, A. (2017). Automatic differentiation in Py-

Torch. In NIPS Autodiff Workshop.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental

improvement. arXiv preprint arXiv:1804.02767.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and

Chen, L.-C. (2018). Mobilenetv2: Inverted residuals

and linear bottlenecks. In IEEE CVPR, pages 4510–

4520.

Tan, M. and Le, Q. V. (2019). Efﬁcientnet: Rethink-

ing model scaling for convolutional neural networks.

ICML.

Wong, A., Famuori, M., Shaﬁee, M. J., Li, F., Chwyl, B.,

and Chung, J. (2019). Yolo nano: a highly compact

you only look once convolutional neural network for

object detection. arXiv preprint arXiv:1910.01271.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

158