Neural Network Pruning based on Filter Importance Values

Approximated with Monte Carlo Gradient Estimation

Csan

´

ad S

´

andor

1,2 a

, Szabolcs P

´

avel

1,2 b

and Lehel Csat

´

o

1 c

1

Faculty of Mathematics and Informatics, Babes¸-Bolyai University, Kog

˘

alniceanu 1, Cluj-Napoca, Romania

2

Robert Bosch SRL, Somes¸ului 14, Cluj-Napoca, Romania

Keywords:

Neural Network Pruning, Filter Pruning, Structured Pruning, Neural Network Acceleration.

Abstract:

Neural network pruning is an effective way to reduce memory- and time requirements in most deep neu-

ral network architectures. Recently developed pruning techniques can remove individual neurons or entire

ﬁlters from convolutional neural networks, making these “slim” architectures more robust and more resource-

efﬁcient. In this paper, we present a simple yet effective method that assigns probabilities to the network units

– to ﬁlters in convolutional layers and to neurons in fully connected layers – and prunes them based on these

values. The probabilities are learned by maximizing the expected value of a score function – calculated from

the accuracy – that ranks the network when different units are tuned off. Gradients of the probabilities are

estimated using Monte Carlo gradient estimation. We conduct experiments on the CIFAR-10 dataset with a

small VGG-like architecture as well as on the lightweight version of the ResNet architecture. The results show

that our pruning method has comparable results with different state-of-the-art algorithms in terms of parameter

and ﬂoating point operation reduction. In case of the ResNet-110 architecture, our pruning method removes

72.53% of the ﬂoating point operations and 68.89% of the parameters, that marginally surpasses the result of

existing pruning methods.

1 INTRODUCTION

Modern deep networks contain tens or hundreds of

layers and within each layer there are a plethora of pa-

rameters (He et al., 2016). While these large networks

can easily be used with sufﬁcient memory and com-

puting power – generally via GPUs or TPUs –, their

use is complicated on resource-limited devices. The

commonly used – embedded or IoT – devices have

limited memory and computing power, they often run

on batteries, meaning that energy consumption is also

an important factor in the ergonomy of these devices.

To reduce the memory, energy and power consump-

tion of these networks, pruning can be applied on

them. Studies showed that more than half of the net-

work parameters can be removed such that their ac-

curacy is not affected (Han et al., 2016b; He et al.,

2019; Sandor et al., 2020).

Network pruning can be (1) unstructured pruning

and (2) structured pruning. In unstructured pruning

a

https://orcid.org/0000-0001-6666-0114

b

https://orcid.org/0000-0002-8825-2768

c

https://orcid.org/0000-0003-1052-1849

individual parameters – weights – are removed from

the network. This leads to highly compressed archi-

tectures with more than 90% of the parameters re-

moved (Han et al., 2016b), but the left-over param-

eters need to be stored in sparse matrices that require

special hardware (Han et al., 2016a) and special li-

braries (like sparse BLAS

1

) to be efﬁcient. However,

these resources may not be always available. Struc-

tured pruning on the other hand focuses on remov-

ing groups of parameters (Li et al., 2016; He et al.,

2018, 2019): rows or columns from parameter matri-

ces (e.g. kernels), entire neurons or ﬁlters – we call

these functional units. Whilst there are less removed

parameters, the resulting architecture does not require

any special treatment.

This paper focuses on structured pruning of neural

networks: we approximate the importance of the net-

work functional units and remove the ones with small

scores. Our main contributions:

• We introduce binary random variables associated

with the functional units – we call them masks –,

1

cuSPARSE: the CUDA sparse matrix library, https://

docs.nvidia.com/cuda/cusparse

Sándor, C., Pável, S. and Csató, L.

Neural Network Pruning based on Filter Importance Values Approximated with Monte Carlo Gradient Estimation.

DOI: 10.5220/0010786700003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 5: VISAPP, pages

315-322

ISBN: 978-989-758-555-5; ISSN: 2184-4321

Copyright

c

2022 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

315

parameterize and infer these hyper-parameters by

optimizing energy functions. We apply the log-

derivative trick and Monte Carlo gradient estima-

tion during the optimization.

• We show that the inferred values for the mask pa-

rameters can be used for pruning.

• We compare our method with different state-

of-the-art pruning algorithms and show that it

achieves comparable results with them.

2 PRUNING METHODS

Pruning is an active research ﬁeld of neural network

compression (Blalock et al., 2020). The ﬁrst pruning

techniques were presented in the beginning of 90s (Le

Cun et al., 1990; Hassibi et al., 1993). These meth-

ods used the Hessian of the loss function to remove

parameters. However, Hessian matrix calculation re-

quires huge computational power and a lot of mem-

ory. Due to these obstacles, it is hard to apply on

modern deep neural networks.

More recent work uses the magnitude of the

weights as a criterion (Han et al., 2015). After prun-

ing, the network is ﬁne-tuned to regain the original

accuracy. The intuition is that small weights have

small impact, hence their absence will not affect over-

all performance. Han et al. (2016b) apply the un-

structured magnitude pruning and adds quantization

and Huffman coding to the pipeline. This way the

authors managed to reduce the VGG architecture of

Simonyan and Zisserman (2014) by a factor of 49.

Li et al. (2016) introduces a ﬁlter pruning ap-

proach with sensitivity analysis: ﬁlters in layers are

sorted and pruned based on their

1

norms. This

process is stopped if the accuracy of the network

drops signiﬁcantly. Yao et al. (2017) introduces a

compressor-critic framework, where the ﬁlter impor-

tance values are learned by a recurrent neural net-

work. This ”compressor” network takes the param-

eters of the original network and outputs probabilities

as importance values for the network units. To train

the compressor network, the expected value of the

original network’s loss is minimized over the proba-

bilities generated by the compressor. He et al. (2018)

presents a soft ﬁlter pruning approach where ﬁlters

with small

2

norm are iteratively set to zero but they

are retrained afterwards – together with the other ﬁl-

ters. This provides larger optimization space and the

pruning and retraining process has less dependence on

the pretrained model. While xprevious works utilize

the smaller norm less important criterion, He et al.

(2019) prunes deep neural networks based on ﬁlter

redundancy in layers: the geometric median of the

ﬁlters are calculated and the ones close to this me-

dian are removed from the network. Discrimination-

aware channel pruning (Liu et al., 2021) introduces

additional discrimination-aware losses to remove the

most discriminative channels. The paper introduces

the channel pruning as a sparsity-inducing optimiza-

tion problem and solve the convex objective with a

greedy algorithm.

Our method is similar to the work of Yao et al.

(2017) in that we use Monte Carlo method for the

gradient estimation, however we use a simple factor

model – using Bernoulli distributions – to approxi-

mate the probabilities – compared to their compres-

sor RNN network. This means our model is sim-

pler, more intuitive and requires less resources dur-

ing pruning. The way the pruning is deﬁned could be

used as a dropout mechanism (Srivastava et al., 2014)

as well: while dropout uses predeﬁned probabilities to

iteratively remove the units, our method learns these

values based on the importance of the units.

3 OUR PROPOSED METHOD

Consider a dataset D that contains N image and label

pairs (x

x

x

i

,y

i

)

N

i=1

. Let µ

W

(x

x

x) denote a neural network

that predicts y for the given x

x

x input, where W is the

set of network parameters.

2

We deﬁne network pruning as ﬁnding a binary

mask z

z

z ∈ {0, 1}

|W |

that sets part of the functional

units to 0, such that the accuracy remains sufﬁciently

high. This binary mask basically deﬁnes a subnet-

work in the original network. In pruning the question

is always how to ﬁnd an optimal z

z

z mask?

To tackle this problem, we consider z

z

z a vector

of binary random variables where each z

i

indicates

whether the associated unit is active or not (i.e. is

dead). Let P

θ

θ

θ

(z

z

z) denote the joint probability distribu-

tion of the vector z

z

z, where θ

θ

θ are hyper-parameters and

let s(µ

W

(x

x

x|z

z

z)) be a score function of the mask z

z

z and

the input x

x

x. As deﬁned above, x

x

x is an input image, but

it could be an image batch as well.

Our goal is to learn the P

θ

θ

θ

(z

z

z) probability distribu-

tion function such that the network score is as high

as possible. More formally, we want to maximize the

expected value of the score with respect to the proba-

bility distribution P

θ

θ

θ

(z

z

z):

2

Here W could “simply” be any vectorized form of the

parameters, but in the experiments we used e.g. ﬁlter ma-

trices as individual components and assigned a single mask

bit to this subset.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

316

θ

θ

θ = argmax

θ

θ

θ

E

z

z

z∼P

θ

θ

θ

(z

z

z)

[s(µ

W

(x

x

x|z

z

z))] (1)

= argmax

θ

θ

θ

S(P

θ

θ

θ

),

where we deﬁned the expected score as S(P

θ

θ

θ

).

To maximize the expected value, we need to op-

timize the parameterized probability distribution by

gradient ascent:

θ

θ

θ

k+1

= θ

θ

θ

k

+ α∇

θ

θ

θ

S(P

θ

θ

θ

)|

θ

θ

θ

k

(2)

However, calculating the expectation is not pos-

sible due to the large number of mask combinations

(in total 2

|W |

possibilities) where the network score

should be evaluated. Instead, we approximate the gra-

dient by developing Monte Carlo estimators (Robert

and Casella, 2010).

∇

θ

S(P

θ

θ

θ

) = ∇

θ

θ

θ

E

z

z

z∼P

θ

θ

θ

(z

z

z)

[s(µ

W

(x

x

x|z

z

z))]

=

Z

z

z

z

∇

θ

P

θ

θ

θ

(z

z

z)s(µ

W

(x

x

x|z

z

z))dz

z

z (3)

In Eq. (3) the gradient of the probability distribu-

tion appears (∇

θ

θ

θ

P

θ

θ

θ

(z

z

z)). This can be expressed using

the log-derivative trick:

∇

θ

θ

θ

P

θ

θ

θ

(z

z

z) = P

θ

θ

θ

(z

z

z)∇

θ

θ

θ

logP

θ

θ

θ

(z

z

z), (4)

and replacing Eq. (4) into Eq. (3), we get:

∇

θ

S(P

θ

θ

θ

) =

Z

z

z

z

P

θ

θ

θ

(z

z

z)∇

θ

θ

θ

logP

θ

θ

θ

(z

z

z)s(µ

W

(x

x

x|z

z

z))dz

z

z

=

E

z

z

z∼P

θ

θ

θ

(z

z

z)

[∇logP

θ

θ

θ

(z

z

z)s(µ

W

(x

x

x|z

z

z))], (5)

where we have the product of a probability distribu-

tion and a function that we can evaluate. Since in

Eq. (5) we have an expected value, we can rewrite

the expression and obtain that ∇

θ

θ

θ

S(P

θ

θ

θ

) is the expected

value of the score function times the gradient of the

log probability distribution.

Using the Monte Carlo method, we can approxi-

mate the gradient by deriving a general-purpose esti-

mator using N samples from the P

θ

θ

θ

(z

z

z) distribution:

∇

θ

θ

θ

S(P

θ

θ

θ

) ≈

1

N

N

∑

i=1

∇

θ

logP

θ

θ

θ

(z

z

z

i

)s(µ

W

(x

x

x|z

z

z

i

)) (6)

Since the estimated gradient can have higher vari-

ance, the convergence of the optimization can be

slower. To tackle this, we apply simple variance re-

duction techniques following the work of Yao et al.

(2017): we subtract the moving average of the score

from the actual score value and divide it by

√

1 −v,

where v denotes the variance of the score. We used the

above formulation in our algorithm (see Section 3.3).

3.1 Score of the Network

Eq. (1) contains a function that scores the network

when a z

z

z

i

mask is applied on it. This score has to be

high when the network performs ”well” and low oth-

erwise. To calculate the score for a given z

z

z

i

mask, the

network is evaluated on a random image batch from

the validation set (meaning that we train the proba-

bility distribution on the validation set). We exper-

iment with 3 different score functions and measure

how quickly the probabilities converge to 0 or 1 and

how the pruning affects the network accuracy.

The loss-score function is used from the work

of Sandor et al. (2020):

s

i

=

L

max

−L

i

L

max

−L

min

, (7)

where L

i

is the network loss with the z

z

z

i

mask, L

min

and L

max

are the minimum and maximum values

among the L

i

losses. This way the score is 1 when the

network loss is the smallest (network performs well)

and 0 when the loss is the highest.

The acc-score uses the accuracy as a score func-

tion. Let X denote a random batch of images from the

validation set and Acc(µ

W

(X |z

z

z

i

)) denote the network

accuracy on the image batch when z

z

z

i

mask is applied

on it. Then the score of the z

z

z

i

mask is simply:

s

i

= Acc(µ

W

(X |z

z

z

i

)) (8)

The exp-acc-score applies a scaling and an expo-

nential function on the accuracy:

s

i

= exp(

Acc(µ

W

(X |z

z

z

i

))

β

) (9)

While the accuracy can be used as a score func-

tion, it cannot capture ﬁne-grained details: if z

z

z

i

and

z

z

z

j

differs only in a single value, the scores could be

very close to each other. This results small difference

between the gradients as well, resulting slow conver-

gence during the factor model optimization. The exp-

acc-score function increases the distance between the

scores when the accuracy values are similar.

The experiments with the different score functions

are presented in section 4.1.

3.2 Probability Distribution

An important question is how to represent the proba-

bility distribution over the set of random variables z?

For simplicity, we assume independence between the

elements and deﬁne the probability distribution as a

product of Bernoulli distributions:

P

θ

θ

θ

(z

z

z) =

∏

i

p

z

i

i

(1 − p

i

)

1−z

i

, (10)

Neural Network Pruning based on Filter Importance Values Approximated with Monte Carlo Gradient Estimation

317

where the probability that z

i

= 1 depends on the θ

i

parameter: P

θ

i

(z

i

= 1) = p

i

= σ(θ

i

).

Using the factor model from Eq. (10), the log

probability in Eq. (6) can be written as:

logP

θ

θ

θ

(z

z

z) = log

∏

i

p

z

i

i

(1 − p

i

)

1−z

i

(11)

=

∑

i

z

i

log p

i

+ (1 −z

i

)log(1 −p

i

)

3.3 Network Pruning Algorithm

To use the factor model as a probability distribution,

we assume independence between the network units

(neurons and ﬁlters). Since this assumption is clearly

not true in case of a multilayer network, we assume

independence only between units from the same layer.

This way, we prune the network layer by layer, learn-

ing separate probability distributions for each layer. A

formalized version is presented in Algorithm 1.

Algorithm 1: Network pruning.

Require: µ

W

pre-trained network

1: l ← index of ﬁrst or last layer

2: while stopping condition not met do

3: initialize P

θ

θ

θ

(z

z

z) for layer l

4: for some predeﬁned steps do

5: sample {z

z

z

1

,...,z

z

z

N

} masks from P

θ

θ

θ

(z

z

z)

6: calculate ∇

θ

θ

θ

S(P

θ

θ

θ

) from Eq. (6)

7: update θ

θ

θ from Eq. (2)

8: prune layer l based on P

θ

θ

θ

(z

z

z)

9: ﬁne-tune µ

W

10: l ← next layer index

4 EXPERIMENTS

We analyze our pruning algorithm on different archi-

tectures trained on the CIFAR-10 (Krizhevsky et al.,

2009) dataset.

Similar to (Frankle and Carbin, 2019), we use a

small VGG-like (Simonyan and Zisserman, 2014) ar-

chitecture with two convolutional layers, a max pool

layer and three fully connected layers. Both con-

volutional layers contain 64, 3 ×3 ﬁlters while the

fully connected layers have 256, 256 and 10 neu-

rons respectively. In the hidden layers ReLU ac-

tivation function is used. The network is trained

on the CIFAR-10 dataset for 10 epochs (using

Adam optimizer (Kingma and Ba, 2015) with 2 ×

10

−4

learning rate and early stopping condition) and

reaches 69.76% accuracy on the validation dataset

and 68.95% on the test dataset.

Figure 1: Left: Inferred probabilities for the 74 ﬁlters in the

ﬁrst convolutional layer (10 of them are ﬁlters with random

weights). Right: standard deviation of the scores during

the training process. Top: loss-score function, middle: acc-

score function, bottom: exp-acc-score function.

4.1 Score Function

To analyze the different score functions presented in

section 3.1, we prune the trained VGG-like network:

First, we insert 10 extra ﬁlters (containing random

weights) in the ﬁrst convolutional layer. Then we

prune the network and measure the true positive rate:

how many extra ﬁlters are pruned from the network.

A score function is better if the pruning algorithm can

identify more ﬁlters with random weights.

In case of each function, we train our factor model

from Eq. (10) for 200 iterations and remove the ﬁlters

with small probabilities (we use 0.2 as a threshold).

We repeat the process until at least 10 ﬁlters are re-

moved from the network and measure the validation

and test accuracy of the pruned networks. For each

of the 3 score functions we repeat the experiment 5

times and report the average validation and test accu-

racy, the number of removed ﬁlters and the true posi-

tives (number of removed ﬁlters that contain random

weights). Results are reported in Table 1. As the table

shows, the pruning method detects part of the inserted

random ﬁlters. However, the true positive rate and the

pruned network accuracy varies. When the network

loss and accuracy is used to calculate the score (ﬁrst

2 rows in the table), on average more then 11 ﬁlters

are removed from the network. However, around 6-7

of them are true positives, that means 30 −40% of the

removed units contains trained weights. In these cases

the network accuracy decreases quite heavily as well:

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

318

Table 1: Pruning results using the three different score functions. The validation and test accuracy, true and false positives

(Tp and Fp) are averages from 5 different experiments. We insert 10 random ﬁlters in each experiment, thus the ideal case

would be that the algorithm prunes only those 10 ﬁlters.

Score function Val. acc. (%) Test acc. (%) Tp Tp+Fp

loss-score 68.84 (↓ 0.91%) 67.76 (↓ 1.18%) 6.8 11.2

acc-score 67.99 (↓ 1.76%) 66.72 (↓ 2.22%) 6.0 11.0

exp-acc-score 70.11 (↑ 0.35%) 68.56 (↓ 0.38%) 8.4 10.2

Figure 2: Up: Accuracy change of the pruned networks when different number of ﬁlters (10, 20, 30) or neurons (20, 40, 60)

are inserted into the layers. Down: true and false positive rates of the pruning.

0.91% and 1.76% on the validation set and 1.18% and

2.22% on the test set. Results are more promising in

case of the exponential score function: on average 8.4

from 10 random ﬁlters are found by the algorithm and

the pruned network accuracy increases by 0.35% on

the validation set and decreases only 0.38% on the test

dataset.

In Figure 1, we present the dynamics of the in-

ferred probabilities and scores during the factor model

training. The left ﬁgures show how the mask proba-

bilities converge to 0 and 1 during this process. It is

easy to see that the convergence is much faster in case

of the loss-score and exp-acc-score functions (top and

bottom). Using these functions, the values quickly

moves close to 0 or 1 depending on the ﬁlter’s impor-

tance. Slow convergence with the acc-score function

is related to the value of the scores (right ﬁgure): dur-

ing the factor model training the standard deviation

of the scores decreases. This means that the differ-

ence between the gradients decreases as well, leading

to slow convergence. The standard deviation of the

scores is much higher in case of the loss-score and

exp-acc-score functions. This is important, since as

the probabilities start to converge (the network ac-

curacy approximate the original accuracy), a small

change on the z

z

z

i

mask leads to higher change on the

score – meaning the difference between the gradients

are higher.

While the dynamics of the probabilities with loss-

score and exp-acc-score functions are similar (top and

bottom ﬁgures), the values converges a bit quicker

with the latter. Moreover, as Table 1 shows, the re-

sult are also better with this function. This is be-

cause exp-acc-score function provides consistent val-

ues during the training process while the values of

loss-score depends on L

min

and L

max

: the same z

z

z

i

mask can have different scores in different training

iterations as L

min

and L

max

changes - which is very

likely, since the masks are randomly sampled. This

varying score affects the gradients that leads to slower

convergence.

4.2 Pruning Randomly Inserted Filters

from Trained Networks

Based on the presented experiments in section 4.1,

we select the exp-acc-score function and examine the

pruning algorithm on different layers of the VGG-like

network. Similar to section 4.1, we insert randomly

initialized ﬁlters and neurons into the trained VGG-

like network, however, we vary the number of ran-

dom ﬁlters and the target layers as well. In case of

the convolutional layers, we insert 10, 20 and 30 ran-

dom ﬁlters while in the fully connected layers we in-

sert 20, 40 and 60 random neurons one after another.

We repeat each experiment 5 times and report the av-

erage change of the validation and test accuracy (be-

tween the pruned and original networks), the average

number of removed ﬁlters (TP + FP) and the average

number of the removed random ﬁlters (TP). Figure 2

Neural Network Pruning based on Filter Importance Values Approximated with Monte Carlo Gradient Estimation

319

presents the results of this experiment.

The ﬁrst 4 ﬁgures show the accuracy change of

the pruned networks compared to the original net-

work accuracy (trained network, no random ﬁlters in-

serted). Since the probability distribution is optimized

to maximize the expected value of the scores calcu-

lated on the validation set, the validation accuracy of

the pruned network outperforms the validation accu-

racy of the original network in almost all 4 cases. This

validates that Eq. (6) correctly estimates the gradi-

ent and the gradient method can increase the expected

value of the score. In case of the test accuracy, a drop

between 0.5% and 1.2% can be detected, and the gap

is slowly increases with the number of inserted ﬁlters.

This means the overﬁtting becomes stronger as more

random ﬁlters are inserted into the network.

Figures from the bottom show the number of re-

moved ﬁlters (TP+FP) and number of removed ran-

dom ﬁlters (TP) among the 4 experiments. The ﬁrst

observation here is that ﬁlters in the ﬁrst layer are

more important than ﬁlters in the second layer: while

the true positive rate is around 80% in the ﬁrst layer, it

is between 60 −75% at the second layer. This means

that more trained ﬁlters are removed from the second

layer, but the accuracy values remain similar (or even

better in case of the second layer). At the fully con-

nected layers the true positive rate decreases below

50%. These layers contain more than 256 neurons

but only a fraction of them contributes to the correct

output. The pruning algorithm can ”pick” almost ran-

domly from the pool and the accuracy still remains

near to the original accuracy.

In our second experiment we use the same net-

work but instead of targeting only a single layer we

insert randomly initialized ﬁlters into all the network

layers. This problem is more challenging since ﬁlters

have inﬂuence to each other through the layers. As

presented in section 3.3, we apply sequential prun-

ing: the algorithm prunes the layers one by one and

repeats the process until the target size is not reached

(Figure 3).

The results show that the algorithm can ﬁnd more

than 75% of the random ﬁlters in the convolutional

layers and more than 50% of the neuron in the fully

connected layers. While these values are similar with

the results of the previous experiment, here the test

accuracy drop increases to 1.5 −3% (Figure 3, top).

4.3 Pruning the ResNet Architecture

Next, we are testing the ResNet (He et al., 2016) net-

work. This is an efﬁcient CNN architecture that ap-

plies residual blocks and “shortcut connections” for

better propagation of the error signal. The residual

Figure 3: Up: Accuracy change of the pruned networks

when different number of units are inserted into the four

layers. Middle and down: true and false positive rates of

the pruning at different layers.

block contains two sets of convolutional, batch nor-

malization and ReLU layers, such that the output of a

layer is fed into the input of the next layer. To prune

the units in this residual block, we insert mask layers

following the work of Sandor et al. (2020): the ﬁrst

mask layer is inserted after the ﬁrst ReLU layer while

the second mask layer is inserted before the shortcut

connection.

Training Details: We experiment with the ResNet-

32, 56 and 110 architectures. We train the networks

following the work of He et al. (2016), with the fol-

lowing modiﬁcations: we change the initial 0.1 learn-

ing rate to 0.01 and 0.001 at epochs 100, and 150 and

stop the training at 200 epochs. During training we

apply cropping and horizontal ﬂip as data augmenta-

tion. The network is trained on a 45K train set, a 5K

validation set is used for training the parameterized

probability distribution.

Pruning Details: Pruning is applied based on the al-

gorithm presented in section 3.3. We start the process

from the network’s ﬁrst layer and calculate the prob-

abilities layer by layer. In each layer, we train the

probability distribution (Eq. 10) in 200 iteration such

that in each iteration 50 masks are sampled and their

corresponding scores are calculated. In each iteration

the gradients are estimated – using Eq. (6) – and the

parameters are updated – using Eq. (2). After 200 it-

erations units with small probability are turned off –

here we follow the work of Sandor et al. (2020): we

drop the least important units such that the accuracy

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

320

Table 2: Comparison of pruned ResNet with the results from the literature.

ResNet Method

Accuracy (%) ↓(%)

Baseline Pruned Diff. ↓ FLOPs Params.

32

SFP (He et al., 2018) 92.63 92.08 0.55 41.5 41.24*

FPGM (He et al., 2019) 92.63 92.82 -0.19 53.2 53.2*

LFE (Sandor et al., 2020) 92.97 92.42 0.55 46.4 49.35

Ours 92.97 92.29 0.68 50.22 43.65

56

(Li et al., 2016) 93.04 93.06 -0.02 27.6 13.7

SFP (He et al., 2018) 93.59 93.35 0.1 47.14 52.6*

ThiNet (Luo et al., 2019) 93.8 92.98 0.82 49.78 49.67

FPGM (He et al., 2019) 93.59 93.49 0.1 47.14 52.6*

LFE (Sandor et al., 2020) 93.44 93.18 0.26 57.64 68.14

Adapt-DCP (Liu et al., 2021) 93.74 93.77 -0.03 68.48 54.80

Ours 93.44 93.08 0.36 64.22 57.79

110

(Li et al., 2016) 93.53 93.3 0.23 38.6 32.40

SFP (He et al., 2018) 93.68 93.86 -0.18 40.8 40.72*

FPGM (He et al., 2019) 93.68 93.85 -0.17 52.3 52.7*

LFE (Sandor et al., 2020) 94.05 93.48 0.57 63.68 60.08

Ours 94.05 93.45 0.6 72.53 68.89

*Parameter drop percentage is not reported in the paper. These values are calculated from other

available information (e.g. ”40% of the ﬁlters are selected”).

drop on the validation dataset is less than 1.0%. After

a layer is pruned, we apply ﬁne-tuning for 10 epochs.

Finally, when no more ﬁlters can be removed,

we retrain the network for 100 epochs by setting the

learning rate to 0.1 and decrease to 0.01 and 0.001 at

epochs 40 and 60.

Results: We report the pruning results of the

ResNet architecture in Table 2. The algorithm re-

moves 43.65% of the parameters and 50.22% of the

ﬂoating point operations (FLOPs) from the ResNet-

32 architecture. While these values outperform the

results of He et al. (2018) in terms of parameter and

FLOPs reduction, remains below the results of He

et al. (2019) and Sandor et al. (2020). While the prun-

ing results are modest with the smaller ResNet, they

are more promising with the ResNet-56 and ResNet-

110 versions. We manage to remove 64.22% of

the FLOPS and 57.79% of the parameters from the

ResNet-56 with only 0.36 accuracy drop. These val-

ues outperform most of the results reported by the pa-

pers selected for comparison: only the FLOPs reduc-

tion result of Liu et al. (2021) and the parameter re-

duction result of Sandor et al. (2020) can outperform

our algorithm. In case of the ResNet-110 architecture,

our pruning algorithm removes more than two thirds

of the ﬂoating point operations (72.53%) and the pa-

rameters (68.89%). These values surpass the results

of the other papers with a signiﬁcant margin.

5 CONCLUSIONS

In this paper, we described a structured pruning al-

gorithm that approximates the importance probability

of network units using Monte Carlo gradient estima-

tion. To calculate the importance values, we intro-

duced a function that scores the performance of dif-

ferent subnetworks. A subnetwork is deﬁned as a bi-

nary masks that speciﬁes the active and inactive units

in the network. Given a set of score and their cor-

responding subnetwork – binary mask –, we maxi-

mize the expected score of the network by optimizing

the probability distribution of the masks using esti-

mated gradients from the Monte Carlo method. Based

on the importance values, our method applies prun-

ing on the network and produces a compressed model

with parameters stored in smaller, dense matrices. We

showed the effectiveness of our pruning algorithm on

the CIFAR-10 dataset with a small VGG-like archi-

tecture as well as on different versions of the ResNet

architecture. The experiments show that our algo-

rithm has comparable results with current state-of-

the-art pruning methods.

REFERENCES

Blalock, D. W., Ortiz, J. G., Frankle, J., and Guttag, J.

(2020). What is the state of neural network pruning?

ArXiv, abs/2003.03033.

Frankle, J. and Carbin, M. (2019). The lottery ticket hy-

pothesis: Finding sparse, trainable neural networks.

In ICLR’2019.

Neural Network Pruning based on Filter Importance Values Approximated with Monte Carlo Gradient Estimation

321

Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M.,

and Dally, W. (2016a). Eie: Efﬁcient inference en-

gine on compressed deep neural network. ISCA’2016,

pages 243–254.

Han, S., Mao, H., and Dally, W. J. (2016b). Deep compres-

sion: Compressing deep neural network with prun-

ing, trained quantization and huffman coding. In

ICLR’2016.

Han, S., Pool, J., Tran, J., and Dally, W. J. (2015). Learn-

ing both weights and connections for efﬁcient neural

networks. In NIPS’2015, NIPS’15, pages 1135–1143,

Cambridge, MA, USA.

Hassibi, B., Stork, D. G., Wolff, G., and Watanabe, T.

(1993). Optimal brain surgeon: Extensions and

performance comparisons. In NIPS’1993, NIPS’93,

pages 263–270.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In CVPR’2016,

pages 770–778.

He, Y., Kang, G., Dong, X., Fu, Y., and Yang, Y. (2018).

Soft ﬁlter pruning for accelerating deep convolutional

neural networks. In IJCAI’2018, pages 2234–2240.

He, Y., Liu, P., Wang, Z., Hu, Z., and Yang, Y. (2019). Filter

pruning via geometric median for deep convolutional

neural networks acceleration. In CVPR’2019.

Kingma, D. P. and Ba, J. (2015). Adam: A method for

stochastic optimization. In ICLR’2015.

Krizhevsky, A., Nair, V., and Hinton, G. (2009). Learning

multiple layers of features from tiny images. Techni-

cal report, Faculty of Computer Science, University of

Toronto.

Le Cun, Y., Denker, J. S., and Solla, S. A. (1990). Optimal

brain damage. In NIPS’1990, pages 598–605.

Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P.

(2016). Pruning ﬁlters for efﬁcient convnets. CoRR,

abs/1608.08710.

Liu, J., Zhuang, B., Zhuang, Z., Guo, Y., Huang, J., Zhu,

J., and Tan, M. (2021). Discrimination-aware network

pruning for deep model compression. TPAMI’2021,

PP:(early access).

Luo, J.-H., Zhang, H., Zhou, H.-Y., Xie, C.-W., Wu, J., and

Lin, W. (2019). Thinet: Pruning cnn ﬁlters for a thin-

ner net. TPAMI’2019, 41(10):2525–2538.

Robert, C. P. and Casella, G. (2010). Monte Carlo Statisti-

cal Methods.

Sandor, C., Pavel, S., and Csato, L. (2020). Pruning CNN’s

with Linear Filter Ensembles. In ECAI’2020, volume

325 of Frontiers in Artiﬁcial Intelligence and Applica-

tions, pages 1435–1442.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

CoRR, abs/1409.1556.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,

and Salakhutdinov, R. (2014). Dropout: A sim-

ple way to prevent neural networks from overﬁtting.

JMLR’2014, 15(56):1929–1958.

Yao, S., Zhao, Y., Zhang, A., Su, L., and Abdelzaher, T.

(2017). Deepiot: Compressing deep neural network

structures for sensing systems with a compressor-

critic framework. In SenSys’2017.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

322