Revisiting the Deformable Convolution by Visualization

Yuqi Zhang, Yuyang Xie, Linfeng Luo and Fengming Cao

Pingan International Smart City Technologies Ltd., Shenzhen, China

Keywords:

Deformable Convolution, Object Detection, Visualization.

Abstract:

The deformable convolution improves the performance by a large margin across various tasks in computer

vision. The detailed analysis of the deformable convolution attracts less attention than the application of it. To

strengthen the understanding of the deformable convolution, the offset ﬁelds of the deformable convolution

in object detectors are visualized with proposed visualizing methods. After projecting the offset ﬁelds to the

feature map coordinates, we ﬁnd that the displacement condenses the features of each object to the object

center and it learns to segment objects even without segmentation annotations. Meanwhile, projecting the

offset ﬁelds to the kernel coordinates demonstrates that the displacement inside each kernel is able to predict

the size of the object on it. The two ﬁndings indicate the offset ﬁeld learns to predict the location and the size

of the object, which are crucial in understanding the image. The visualization in this work explicitly shows

the power of the deformable convolution by decoding the information in the offset ﬁelds. The ablation studies

of the two projections of the offset ﬁelds reveal that the projection in the kernel viewpoint contributes mostly

in current object detectors.

1 INTRODUCTION

The wide applications of deformable convolution op-

eration (Dai et al., 2017; Zhu et al., 2019; Yang et al.,

2019a; Chen et al., 2020; Thomas et al., 2019; Yang

et al., 2019b; Kong et al., 2020; Vu et al., 2019)

show its importance in computer vision. The de-

formable convolution operation is proposed to im-

prove the anchor-based object detector and semantic

segmentation initially (Dai et al., 2017; Zhu et al.,

2019), then is applied in several anchor-free object

detectors (Wang et al., 2019; Yang et al., 2019a;

Chen et al., 2020; Yang et al., 2019b; Kong et al.,

2020), and recently it helps in modelling the 3D point

clouds (Thomas et al., 2019). The deformable con-

volution operation is believed to have two function-

alities (Wang et al., 2019): (1) it improves the rep-

resentation of features, (2) it encodes the object ge-

ometry inside its parameters. The deformable convo-

lution has demonstrated its superior performance in

the computer vision tasks. However, the understand-

ing of the deformable convolution is not thoroughly

enough, and one challenge remains in the visualiza-

tion of it. The visualization, as the most straightfor-

ward approach, can demystify how the black boxes

learns and why the operation works.

For the deformable convolution (Dai et al., 2017),

the output feature map y for the location p

y(p

) =

∑

∈R

w(p

) ·x(p

+ p

+ ∆p

) (1)

where w is the weight of the kernel and x is the value

from feature map. As a 3 ×3 kernel with dilation of

1, R ∈ {(−1,−1),(−1,0),..., (0,1),(1,1)}, and ∆p

is the offset with respect to each p

. The main dif-

ference between the deformable convolution and the

normal convolution is that the deformable convolu-

tion contains the dynamic offset ﬁeld ∆p

. The dy-

namic ∆p

enables the network to recognize objects

with various geometric variations in images. The dy-

namic offset ﬁeld is predicted from the feature maps

with a lightweight neural networks. Therefore, the

explanation of the deformable convolution relies on

the offset ﬁeld ∆p

. Unfortunately, previous attempts

to visualize the deformable convolution failed to con-

centrate on the offset ﬁeld due to the high dimen-

sions of the offset ﬁeld. Although many works pro-

vide visualization methods for convolution neural net-

works such as gradient visualization (Simonyan and

Zisserman, 2014), perturbation (Ribeiro et al., 2016),

class activation map (Zhou et al., 2016; Wang et al.,

2020), and the deconvolutions related methods (Zeiler

and Fergus, 2014), the deformable convolutions differ

from basic convolutions because of the use of the off-

set ﬁeld.

In this work, to understand the deformable convo-

lution further, we visualize the offset ﬁeld using the

190

Zhang, Y., Xie, Y., Luo, L. and Cao, F.

Revisiting the Deformable Convolution by Visualization.

DOI: 10.5220/0010200801900195

In Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2021), pages 190-195

ISBN: 978-989-758-486-2

vector analysis. In the visualization, the offset ﬁeld is

projected to the feature map coordinate and the kernel

coordinate respectively. The two projections effec-

tively separate the learned information into a global

context in image scale and a local context in kernel

scale. The visualization results straightforwardly re-

veal why the deformable convolution surpasses the

normal convolution in object detectors: (1) the de-

formable convolution learns to condense the features

of the object to the object center, (2) the deformable

convolution learns to segment objects even without

segmentation annotations, (3) the deformable convo-

lution learns the size of the object for each feature

point. The ablation studies of the two projections of

the offset ﬁelds are also carried out and show that the

projection in the kernel viewpoint has more contribu-

tions.

2 RELATED WORK

In the beginning, the deformable convolutions were

used in the backbone of the object detectors in (Dai

et al., 2017; Gao et al., 2019) by substituting the nor-

mal convolution operation with the deformable con-

volution. Amazed by its prevailing results on the

COCO detection benchmark (Lin et al., 2014), many

other researchers tried to apply the deformable con-

volution in the region proposal networks (Wang et al.,

2019; Vu et al., 2019) and the bounding box heads

(Yang et al., 2019a; Chen et al., 2020; Kong et al.,

2020; Yang et al., 2019b). With so many applications

of the deformable convolution, the straightforward vi-

sualization of it demands more attention.

The receptive ﬁelds and the sampling locations are

used to visualize the effect of the offset ﬁeld in (Dai

et al., 2017), the visualization connects the sampling

locations to the activation units and shows the power

of the dynamic sampling locations. After that, the ef-

fective receptive ﬁelds and effective sampling loca-

tions are used to further analyze the deformable con-

volution in (Zhu et al., 2019; Gao et al., 2019), still,

the visualization results fail to focus on the core of the

deformable convolution, which is the offset ﬁeld. In

(Wang et al., 2019; Yang et al., 2019a; Chen et al.,

2020; Yang et al., 2019b; Kong et al., 2020), the

offset ﬁeld of the deformable convolutions are cor-

related directly with the size and the position of the

object, which cast a light on the potential use of the

deformable convolution. To promote the future ap-

plication of the deformable convolution, we believe

the detailed visualization of it will increase the trans-

parency to humans and provide promising insights of

its potential use.

3 VISUALIZE THE OFFSET

FIELD

For a deformable convolution operation, the ∆p

)

has 9 separate offset maps, corresponding to the offset

ﬁeld for 9 kernel points. In this work we focus on a

single deformable convolution operation and visual-

ize the offset ﬁeld inside of it. When a kernel point

slides through the feature map, the offset maps

in Figure 1(a)-(i) records its learned offset ﬁeld. In

the beginning of the training process, the offset ﬁelds

are initialized with zero vectors, and after the learning

of the detector, most of the offset ﬁelds evolve into a

quite ordering state as the red arrows in each sub ﬁg-

ures have similar vector value.

Apart from visualizing the offset ﬁeld as a whole,

the offset ﬁeld can be projected into two perspectives.

One is from the feature map coordinate and the other

one is from the kernel coordinate. In the feature map

coordinate, the average displacement d

, p

) over

is,

) =

∑

∈R

∆p

) (2)

where R ∈ {(−1,−1),(−1, 0),...,(0,1), (1,1)}. Then

in the kernel coordinate, the displacement ﬁeld inside

each kernel, the d

, is,

, p

) = ∆p

) −d

) (3)

The distribution of d

) over the entire feature

map can be seen in Figure 2. Since observing that

the arrows of d

) point to the objects’ centers, we

borrow the concept of the divergence from the vec-

tor analysis here to visualize the d

) better. The

divergence div(p

) of the d

) is,

div(p

) =

∂

∂x

) +

∂

∂y

) (4)

where d

) and d

) are the displacement com-

ponents in the x and y axis respectively, and ∂/∂x and

∂/∂y are the partial derivative operators in the x and y

axis respectively. The divergence is a vector operator

that operates on a vector ﬁeld, producing a scalar ﬁeld

representing the volume density of the outward ﬂux of

a vector ﬁeld from an inﬁnitesimal volume around a

given point. The positive value means the vector ﬁeld

outward ﬂux and the negative value means the inward

ﬂux.

Figure 2 plots the distributions of the d

for the

feature maps, which are learned in the Faster RCNN

object detector with deformable convolutions in the

backbone as stated in (Gao et al., 2019) and trained

with COCO bounding boxes annotations only. Two

fundamental insights can be found in them. The ﬁrst

Revisiting the Deformable Convolution by Visualization

191

Figure 1: The offset ﬁelds for 3×3 kernel points for a deformable convolution with respect to an example image in the

validation dataset from COCO. The small arrows in (a)-(i) represent the offset ﬁeld for each kernel point and the thick arrows

in central show the trends of offset for each kernel point.

insight is that the arrows d

from Equation 2 are con-

verging to the centers of each objects, which are rep-

resented by the negative value of divergence from

Equation 4 and the negative divergence from the fea-

ture points are shown with red masks in Figure 2.

Converging feature points to the object center is a key

function of the deformable convolution. The converg-

ing of the feature points can be understood as an in-

formation condensation process when the feature map

stride is smaller than the object size. The information

from the feature map with only a part of a zebra in it

is less than the information from the feature map with

the whole zebra. Therefore, after the deformable con-

volution, the feature maps are learned with the con-

densed information which helps the network to under-

stand the image. The second insight is that though the

network is trained without any mask information, the

visualization of the offset ﬁeld shows that it is able to

learn the segmentation inside the deformable convolu-

tion operation. In different scale of the feature maps,

the deformable convolution focuses on different scale

of information. For example, the deformable convo-

lution in the feature map generated with smaller stride

locates the strips of the zebra and the branches of the

tree in Figure 2(c) while it locates the zebra and the

tree as a whole with larger stride, as shown in Figure

2(d).

To ﬁnd out what information is inside the kernel

viewpoint, Figure 3 is shown for the learned distri-

bution of the d

. The radius of the circle is calcu-

lated as r ∝

√

width ×height where width is the off-

set difference in x axis between left and right kernel

points, and height is the offset difference in y axis be-

tween top and bottom kernel points. Therefore, the

areas of each circles in Figure 3 indicate the expan-

sion or shrinkage status of the d

. In other words,

the circles reﬂect how large each 3×3 kernel wants

to cover. Figure 3 shows that the d

has learned the

size of the object for each 3×3 kernel since the ra-

dius of the circle accords with the size of the object

behind the feature point. In detail, in Figure 3(a) with

a dog, a person, a Frisbee, and a tree, the largest cir-

cle is inside the tree while the smallest circle is in-

side the Frisbee, which corresponds to the sizes of the

objects, Size

tree

> Size

Frisbee

. It should be noted that

not all feature points inside the object have the same

size, especially when the object is across many fea-

ture points. Not every feature point inside the object

ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods

192

Figure 2: The visualization of the offset ﬁeld in deformable convolution for the feature map point of view. (a) and (c) are

from stage 3, (b), (e), and (f) are from stage 4, and (d) is from stage 5. All are examples of displacement d

converging to the

centers of the object.

reﬂects the real size of the object except the feature

point with the largest circle.

The separate visualizations of the offset ﬁeld into

the feature map coordinate and the kernel coordi-

nate straightforwardly show that the offset ﬁeld has

learned the position information and the size informa-

tion of the objects in the basic learning of the object

detectors without any direct supervision on the offset

ﬁelds. The position of objects is global information,

so it can be represented by d

) as shown in Figure

2. In the meanwhile, the size of the object is rather a

local information, as shown in Figure 3.

4 ABLATION STUDY OF d

AND

We trained our detectors on the 118k images of

the COCO 2017 train dataset and evaluated the

performance on the 5k images of the validation

dataset. The standard mean average-precision over

IoU=0.5:0.05:0.95 is used to measure the perfor-

mance of the detectors.

The Faster R-CNN and the Mask R-CNN are cho-

sen as two baselines representing the use of the de-

formable convolution in object detectors. The imple-

mentation is based on the MMDetection framework.

The Faster R-CNN detector is trained with stochas-

tic gradient descent optimizer over 2 GPUs with a to-

tal batch size of 8 images per mini batch. The Mask

R-CNN detector is trained over 2 GPUs with a to-

tal batch size of 4 images per mini batch. The ”1×”

schedule is adopted for learning rate. No test time

augmentation is used and non-maximum suppression

IoU threshold of 0.5 is employed for both detectors.

Both the d

and the d

adjust how the convolution

sees the feature map. The d

changes the receptive

ﬁeld globally while the d

changes it locally inside

each kernel. To investigate the performance of the

and the d

in the offset ﬁelds of deformable con-

volution, experiments are conducted in Mask RCNN

and Faster RCNN detectors with either d

or d

. The

Revisiting the Deformable Convolution by Visualization

193

Figure 3: The visualization of the offset ﬁeld in deformable convolution for the kernel point of view. The size of the circle for

each point represents the expansion or shrinkage inside each, in other words, the size of the circle represents the area of the

offset ﬁeld inside each 3×3 kernel.

detectors are still trained end to end, and the new off-

set ﬁeld are calculated by Equation 2 and Equation

3 respectively. The performance are shown in Table

1. The major contribution of deformable convolution

comes from the d

while the d

also improves the per-

formance compared with the baseline detectors with-

out the deformable convolution. Although from the

visualization of d

and d

in 3 and 2, both compo-

nents have learned the information of the object, the

ablation studies show that the local information en-

coded in d

is more signiﬁcant than the d

and the

global information in d

may not be well utilized in

the following parts of the networks.

Table 1: The effect of the d

and the d

baseline d

bbox

segm

Mask RCNN 0.382 0.347

X 0.388 0.351

X 0.418 0.375

X X 0.420 0.376

Faster RCNN 0.374 NA

X 0.380 NA

X 0.413 NA

X X 0.416 NA

5 FUTURE WORK

The visualization and analysis in this work show the

effects of the deformable convolution in predicting

the size and the position of the object. It should be

noted that the deformable convolution learns the two

sources of information without direct supervision and

the automatically learned information is all from its

natural usage of the offset ﬁeld. Therefore, the fu-

ture work includes (1) the study of the behaviour of

the deformable convolution when direct supervision

is enforced on the offset ﬁeld; (2) utilization of the

two properties of the offset ﬁelds in the prediction

branches of bounding box and mask.

6 CONCLUSION

In summary, the detailed visualization and analysis of

the offset ﬁeld are made to promote the straightfor-

ward understanding of the deformable convolution.

For the object detector, even a single deformable con-

volution has convey the information of object position

ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods

194

and object size. We directly prove this by visualizing

the offset ﬁeld in the feature map viewpoint and the

kernel viewpoint separately. The position of the ob-

ject is global information which is in the feature map

viewpoint while the size of the object is local infor-

mation which is in the kernel viewpoint. The effect of

the offset ﬁeld in the two viewpoints is investigated

separately and the results show the components in the

kernel viewpoint improves the deformable convolu-

tion more.

REFERENCES

Chen, Y., Zhang, Z., Cao, Y., Wang, L., Lin, S., and

Hu, H. (2020). Reppoints v2: Veriﬁcation meets

regression for object detection. arXiv preprint

arXiv:2007.08508.

Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., and

Wei, Y. (2017). Deformable convolutional networks.

In Proceedings of the IEEE international conference

on computer vision, pages 764–773.

Gao, H., Zhu, X., Lin, S., and Dai, J. (2019). Deformable

kernels: Adapting effective receptive ﬁelds for object

deformation. arXiv preprint arXiv:1910.02940.

Kong, T., Sun, F., Liu, H., Jiang, Y., Li, L., and Shi,

J. (2020). Foveabox: Beyound anchor-based object

detection. IEEE Transactions on Image Processing,

29:7389–7398.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,

Ramanan, D., Doll

ar, P., and Zitnick, C. L. (2014).

Microsoft coco: Common objects in context. In Euro-

pean conference on computer vision, pages 740–755.

Springer.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ” why

should i trust you?” explaining the predictions of any

classiﬁer. In Proceedings of the 22nd ACM SIGKDD

international conference on knowledge discovery and

data mining, pages 1135–1144.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Thomas, H., Qi, C. R., Deschaud, J.-E., Marcotegui, B.,

Goulette, F., and Guibas, L. J. (2019). Kpconv: Flex-

ible and deformable convolution for point clouds. In

Proceedings of the IEEE International Conference on

Computer Vision, pages 6411–6420.

Vu, T., Jang, H., Pham, T. X., and Yoo, C. (2019). Cas-

cade rpn: Delving into high-quality region proposal

network with adaptive convolution. In Advances in

Neural Information Processing Systems, pages 1432–

1442.

Wang, H., Wang, Z., Du, M., Yang, F., Zhang, Z., Ding, S.,

Mardziel, P., and Hu, X. (2020). Score-cam: Score-

weighted visual explanations for convolutional neu-

ral networks. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

Workshops, pages 24–25.

Wang, J., Chen, K., Yang, S., Loy, C. C., and Lin, D. (2019).

Region proposal by guided anchoring. In Proceedings

of the IEEE Conference on Computer Vision and Pat-

tern Recognition, pages 2965–2974.

Yang, Z., Liu, S., Hu, H., Wang, L., and Lin, S. (2019a).

Reppoints: Point set representation for object detec-

tion. In Proceedings of the IEEE International Con-

ference on Computer Vision, pages 9657–9666.

Yang, Z., Xu, Y., Xue, H., Zhang, Z., Urtasun, R., Wang, L.,

Lin, S., and Hu, H. (2019b). Dense reppoints: Rep-

resenting visual objects with dense point sets. arXiv

preprint arXiv:1912.11473.

Zeiler, M. D. and Fergus, R. (2014). Visualizing and under-

standing convolutional networks. In European confer-

ence on computer vision, pages 818–833. Springer.

Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Tor-

ralba, A. (2016). Learning deep features for discrim-

inative localization. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 2921–2929.

Zhu, X., Hu, H., Lin, S., and Dai, J. (2019). Deformable

convnets v2: More deformable, better results. In Pro-

ceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 9308–9316.

Revisiting the Deformable Convolution by Visualization

195