Object Detection for Autonomous Driving: Motion-aid Feature

Calibration Network

Dongfang Liu, Yaqin Wang and Eric T. Matson

Computer and Information Technology, Purdue University, West Lafayette, IN, U.S.A.

Keywords:

Autonomous Driving, Video Object Detection, Pixel Feature Calibration, Instance Feature Calibration.

Abstract:

Object detection is a critical task for autonomous driving. The latest progress in deep learning research for

object detection has built a solid contribution to the development of autonomous driving. However, direct

employment of the state-of-the-art object detectors from image to video is problematic. Object appearances

in videos have more variations, e.g., video defocus, motion blur, truncation, etc. Such variations of objects

in a video have fewer occurrences in still image and could compromise the detection results. To address

these problems, we build a fast and accurate deep learning framework, motion-assist calibration network

(MFCN) for video detection. Our model leverages the motion pattern of temporal coherence on video features.

It calibrates and aggregates features of detected objects from previous frames along the spatial changes to

improve the feature representations on the current frame. The whole model architecture is trained end-to-

end which boosts the detection accuracy. Validations on the KITTI and ImageNet dataset show that MFCN

can improve the baseline results by 10.3% and 9.9% mAP respectively. Compared with other state-of-the-

art models, MFCN achieves a leading performance on KITTI benchmark competition. Results indicate the

effectiveness of the proposed model which could facilitate the autonomous driving system.

1 INTRODUCTION

Object detection is the primary task for the per-

ception of autonomous car in order to achieve full

autonomy(Liu et al., 2019)(Wang et al., 2019)(Ra-

managopal et al., 2018). In recent years, the develop-

ment of deep learning network has rendered a great

progress in object detection which can be adapted

by autonomous driving(Ren et al., 2015)(Lin et al.,

2017b)(Zeng et al., 2016). The current state-of-the-

art detectors are able to achieve recognizable success

on static images (Chen et al., 2017), but their per-

formance on live-stream videos needs to be improved

greatly in order to feed the needs of autonomous driv-

ing(Zhu et al., 2017b). For autonomous driving, cam-

era sensor constantly produces a live-stream video

frame which may include a large number of object

appearances with drastic variations. For instance, mo-

tion blur or video defocus are frequently observed.

Under these circumstances, still Object detection may

fail and generate unstable results.

To address these challenges in video object detec-

tion, one of intuitive solutions is to analyze the tem-

poral and spatial coherence in videos and employ in-

formation from nearby frames(Du et al., 2017). Since

videos generally include rich imagery information, it

is easy to identify a same object instance in multiple

frames in a short time. (Kang et al., 2018)(Han et al.,

2016)(Lee et al., 2016) exploit such temporal infor-

mation in existing video to detect objects in a sim-

ple way. These studies ﬁrstly utilize object detectors

in individual frame and then assemble the inference

results in a dedicated post processing step across all

the temporal dimensions. Following the same prin-

ciple, (Han et al., 2016)(Kang et al., 2018)(Kang

et al., 2016)(Feichtenhofer et al., 2017) propose hand-

crafted bounding box association rules to improve the

ﬁnal predictions. Although the approaches mentioned

above generate promising results, these methods are

post-processing fashions not principled learning tech-

niques.

To address the limitations of existing work, we

further exploit the temporal information and seek

more robust feature calibration to boost the accuracy

of video object detection. In this study, we present

motion-assist calibration network (MFCN), a fast, ac-

curate, end-to-end framework for video object detec-

tion. We consider the effects of movement on fea-

tures to facilitate the per-frame feature learning. We

leverage the temporal coherence and calibrate pixel

232

Liu, D., Wang, Y. and Matson, E.

Object Detection for Autonomous Driving: Motion-aid Feature Calibration Network.

DOI: 10.5220/0009107602320239

In Proceedings of the 12th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2020) - Volume 2, pages 232-239

ISBN: 978-989-758-395-7; ISSN: 2184-433X

and instance features across frames to prompt feature

representation of current frame. Accordingly, the ﬁ-

nal detection is effectively improved.

The key contributions of our work are in three-

fold: First, we introduce a fast, accurate, end-to-end

framework for video object detection which could be

adopted by autonomous driving system; Second, we

use ablation study to verify the improvements of the

proposed model from a strong single-frame baseline

and contributions of each module in the proposed

model; and third, we conduct extensive experiments

to evaluate the performance of the proposed model by

comparing with state-of-the-art systems. Results in-

dicate our model is competitive with the exiting best

approaches for video object detection.

The rest of the paper is structured as follows. In

Section II, a through discussion of related work is pre-

sented which builds the foundations of this research.

Next, we articulate the technical details of the pro-

posed model in Section III. In Section IV, we present

the results of extensive experiments of our proposed

model on two credential datasets. Finally, in section

V, we conclude this study and address future work.

2 RELEVANT RESEARCH

With the development of deep learning for computer

vision, the object detection has been extended from

still image to video(Zhu et al., 2017b)(Zhu et al.,

2018)(Zhang et al., 2018). One example is that

the video object detection competition, such as Im-

ageNet(Russakovsky et al., 2015), has drawn a wide

attention in both academia and industry. For video ob-

ject detection, there are two typical approaches from

exiting work: 1. post processing; and 2. principed

learning. We elaborate them in the following para-

graph.

2.1 Post Processing

Many existing work leveraged state-of-the-art single

image object detection systems to tackle the issue of

object detection in video domain. (Uijlings et al.,

2013)(Zitnick and Doll

ar, 2014) directly applied ob-

ject detectors on individual frame and then assembled

the inference results to generate the ﬁnal predictions.

These explorations were effective but naiıve because

they completely ignored the temporal dimension from

the sequential video frames. On the contrary, (Han

et al., 2016)(Lee et al., 2016)(Feichtenhofer et al.,

2017) incorporated temporal information during the

post-processing and effectively improved the detec-

tion results within each individual frame. Once re-

gion proposals and their corresponding class scores

for each frame were obtained, (Han et al., 2016)(Lee

et al., 2016)(Feichtenhofer et al., 2017) implemented

algorithm to select boxes to maximize a sequence

score. The selected boxes are then employed to sup-

press overlapping boxes in the corresponding frames

in order to boost weaker detection. The post pro-

cessing approach provided valuable insights for ap-

plications related to video surveillance. However, due

to the nature of its off line design, post processing

approach casts a cloud over the object detection for

autonomous driving because this approach cannot be

adopted directly for real-time driving system. In order

to facilitate the autonomous driving, we seek answers

from principled learning approach which can have

real-time inferences for object detection on videos.

2.2 Principled Learning

Current state-of-the-art detectors generally embrace

an identical two-stage structure. First, Deep Convolu-

tional Neural Networks (CNNs) capture the outstand-

ing features of the input image and generate a set

of intermediate convolutional feature maps accord-

ingly(Krizhevsky et al., 2012)(Simonyan and Zisser-

man, 2014)(He et al., 2016). Then, a detection net-

work predicts the detection results based on the fea-

ture maps (Lin et al., 2017a)(He et al., 2015)(Liu

et al., 2016). The ﬁrst stage counts for the ma-

jor computation cost compared to the ﬁnal detection

task. The intermediate convolutional feature maps

have the same spatial extents of the input image but

much smaller resolution. Such small resolution fea-

ture maps could be cheaply propagated by spatial

warping from nearby frames (Ren et al., 2017) (Ma

et al., 2015). To leverage this property of CNN,

(Zhu et al., 2017b)(Kang et al., 2016)(Kang et al.,

2018)(Zhu et al., 2017a) proposed end-to-end frame-

work which used the feature propagation to enhance

the feature of individual frames in videos. (Zhu et al.,

2017b)(Zhu et al., 2017a) applied optical ﬂow net-

work (Ilg et al., 2017) to estimate per-pixel motion.

Since the optical ﬂow estimation and feature propa-

gation had much faster computation than that of con-

volutional network feature extractions, (Zhu et al.,

2017b)(Zhu et al., 2017a) avoided the computational

bottleneck and achieve signiﬁcant runtime improve-

ment. Since both the motion ﬂow and object de-

tection were predicted by deep learning networks,

the entire architecture was trained end-to-end and the

detection accuracy and efﬁciency were signiﬁcantly

boosted (Zhu et al., 2017b)(Zhu et al., 2017a). Simi-

larly, (Kang et al., 2016) (Kang et al., 2018) presented

a novel work using tubelet proposal networks, which

Object Detection for Autonomous Driving: Motion-aid Feature Calibration Network

233

extracted multi-frame features to predict the pattern

of object motion. By calculating the spatial changes,

(Kang et al., 2016) (Kang et al., 2018) utilized the

spatiotemporal proposals for ﬁnal detection task. The

results from (Kang et al., 2016) (Kang et al., 2018)

demonstrated the state-of-the-art performance in ob-

ject detection on videos.

All the above methods, however, focus on spare

pixel-level propagation approach. This approach

would encounter difﬁculties when objects dramati-

cally changes along with their appearance of scales

in the video. With the inaccurate motion estima-

tion, the pixel-level propagation approach(Zhu et al.,

2017b)(Zhu et al., 2017a) may encounter difﬁculties

to produce desirable results. In addition, propagat-

ing pixel features from (only) few designated frames

may be incapable to facilitate the ﬁnal detection task

if an object instance appears shorter than the prop-

agation range. In our work, we intend to calibrate

features of interests at pixel and instance level. We

consider object motion in our calibration, thus we can

accurately aggregate features of interest from consec-

utive frames instead of using spare frames as a refer-

ence. Comparing to pixel-feature propagation, our ap-

proach has more robustness to appearance variations

in video frames.

3 TECHNICAL APPROACH

In this section, we present the details of our techni-

cal approach which includes the model design, infer-

ence of the model, and model architectures. Model

design introduces the implementation of our model;

inference of the model explains the prediction proce-

dures of the trained model; and model architecture ar-

ticulates the sub-networks of the overall deep neural

network.

3.1 Model Design

The core concept of the MFCN is to calibrate imagery

features at pixel level and instance level along with

motion from frames t −i to the current frame t to

improve the detection results. Following the popu-

lar architectures of CNN for detection task(Ren et al.,

2015)(Liu et al., 2016), our model has feature ex-

traction sub-networks N

ext

to produce a number of

intermediate feature maps and object detection sub-

networks N

det

to process the extracted features maps.

When camera sensor produces instant video frames

, t = 1, ..., ∞, feature extraction sub-networks N

ext

process the input frame and produce a number of in-

termediate feature maps, as f

= N

ext

). The feature

maps are fed into object detection sub-networks N

det

as d

= N

det

). The output d

includes position-

sensitive score maps d

and the proposals of the input

frame.

MFCN includes three contributing components:

1. pixel-feature calibration; 2. instance-feature cal-

ibration; and 3. Adaptive weight. For pixel-feature

calibration, N

ext

constantly receives a single frame

input, and generate the intermediate feature map.

We employ motion estimation sub-network to esti-

mate motion tendency of feature maps across frames.

Based on the motion estimation, we employ a bi-

linear wrapping function W (·, ·) to wrap pixel fea-

tures from previous frames as f

t−i→t

. Meanwhile, we

adopt a fully convolutional network ε(·) to compute

the embedding features from t −i to t for similarity

measurement which calculates the adaptive weight for

each feature maps from t −i. Once we have adap-

tive weight, we aggregate each feature map from f

t−i

to f

and calibrate the feature of the current frame as

t−f inal

Next, for instance-feature calibration, we use

object detection sub-network to obtain position-

sensitive score maps and instance proposals from

frame t −i to frame t. Since each proposal has ground

truth information, we use a spatial-change regression

network R(·) to estimate the movement of each pro-

posal from the corresponding locations at frame t −i

to frame t. With the estimated proposal movement,

we calibrate each instance feature accordingly from

frame t − i to current frame t. Once the calibra-

tion is done, we average the aggregated instance fea-

tures from t −i to obtain the ﬁnal result for frame

t. The procedures for pixel-feature calibration, adap-

tive weight calculation, and instance-feature calibra-

tion are elaborated below.

3.1.1 Pixel-feature Calibration

We model the pixel-level calibration from a previous

frame I

t−i

to current frame I

with motion estimation.

We denote F (·, ·) as the motion estimation function.

F (I

t−i

, I

) calculates the spatial changes from frame

to I

t−i

. We leverage the motion estimation to have

an interference of location p

t−i

in the previous frame

t −i to the location p

in the current frame t. Let ∆p

be the estimated distance between p

t−i

and p

, and

thus ∆p = F (I

t−i

, I

)(p

). The calibrated feature maps

deﬁne as:

t−i→t

) =

∑

G(q

t−i

, p

t−i

+ ∆p) f

t−i

). (1)

In (2), q enumerates all spatial locations at p

t−i

feature map f

t−i

, and G(·) denotes bi-linear interpo-

lation kernel, which has two one-dimensional kernels

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

234

as:

G(q, p+ ∆p) = g(q

, p

+∆p

)·g(q

, p

+∆p

). (2)

Then, the calibrated feature maps from previous

frames are wrapped into the current frame accord-

ingly. The wrapping process is as the follow:

t−i→t

= W ( f

t−i

, F (I

t−i

, I

)). (3)

In (3), f

t−i

is the feature map produced by N

ext

frame t −i, f

t−i→t

denotes the wrapped features from

time t −i to time t, and W (·, ·) is a bi-linear wrap-

ping function which is applied on each feature map

of all channels across frames. Through this feature

wrapping process, the current frame aggregates mul-

tiple feature maps from previous frames. To process

the aggregated features, we average these features to

update the feature of the current frame:

t−f inal

∑

j=t−i

j→t

(4)

In (4), ω

j→t

is the adaptive weight which decides

contribution of each wrapping features, f

j→t

is the

wrapping features, f

t−f inal

is the calibrated feature

for the current frame, and i speciﬁes calibration range

from the previous frames to the current frame (i = 3

by default).

3.1.2 Adaptive Weight Calculation

The adaptive weight indicates the importance of each

wrapped feature from previous frame to the current

frame. At location p

t−i

, if feature f

t−i

) is close

to the feature f

) at the current frame, a larger

weight is assigned to it. On the contrary, a smaller

weight is assigned. To calculate the adaptive weight,

we employ the cosine similarity metric(Liu et al.,

2016) to measure the similarity between the previous

frame t − i and the current frame. In implementa-

tion, we apply a three-layer fully convolutional net-

work ε(·) to f

t−i→t

and f

, which projects embedding

features from the similarity measurement. The mea-

surement deﬁnes as:

t−i→t

, f

= ε( f

t−i→t

, f

) (5)

In (5), f

t−i→t

and f

are embedding features from

similarity measurement. Thus, the adaptive weight

deﬁnes as:

t−i→t

= exp(

t−i→t

· f

t−i→t

||f

). (6)

The adaptive weight w

t−i→t

is normalized for every

spatial location over previous frames from I

t−i

to I

3.1.3 Instance-feature Calibration

After updating the feature map for the current frame,

we feed f

t−f inal

into the detection sub-network N

det

and obtain improved detection result d

for the cur-

rent frame t. The instance-feature calibration is con-

ducted on calibrating features at instance level across

frames. The ﬁnal detection result at frame t is calcu-

lated by aggregating each calibrated instance features

with proposal movements from previous frames.

To compute individual proposal movements from

frame t −i to t, we utilize the RoI pooling operation to

predict the pooled features m

t−i→t

of the n

proposal

t−i

at location (x

, y

, w

, h

) which is in the current

frame t:

t−i→t

= φ(F (I

t−i

, I

), (x

, y

, w

, h

)). (7)

In (7), φ(·, ·) is the RoI pooling operation from (Gir-

shick, 2015) and F (I

t−i

, I

) is the motion estimation

function from pixel-feature calibration. Using the RoI

pooling operation, we convert the features in each re-

gion of interest from proposal into a small feature

map.

After the RoI pooling operation, we build a

spatial-change regression network R(·) to predict the

movements of proposals between the frame t −i and

t based on the m

t−i→t

. R(·) deﬁnes as:

(∆

t−i→t

, ∆

t−i→t

, ∆

t−i→t

, ∆

t−i→t

) = R(m

t−i→t

). (8)

In (8), (∆

t−i→t

, ∆

t−i→t

, ∆

t−i→t

, ∆

t−i→t

) is the spatial

changes which deﬁnes the movement of each pro-

posal between frame t −i and from t.

In our study, if the proposal overlaps more than

0.8 in intersection-over-union (IoU), the regression

instance is considered to be a positive proposal with

ground truth in movement. Only the positive propos-

als are learnt across frames. In the same vein, we re-

cursively produce calibrated ground-truth of each in-

stances with spatial changes to current frame t.

After we obtain the spatial changes of individual

proposal above, we can calibrate each proposal from

previous frames. The calibrated proposal deﬁnes as:

t−i→t

= ∆

t−i→t

×ω

+ x

t−i→t

= ∆

t−i→t

×h

+ y

t−i→t

= exp(∆

t−i→t

) ×ω

t−i→t

= exp(∆

t−i→t

) ×h

(9)

Finally, we aggregate the calibrated proposals from

t −i to current frame t and employ the average op-

eration to obtain the ﬁnal detection result. The ﬁnal

detection for n

proposal deﬁnes as below:

t−f inal

∑

j=t−i

ψ(d

, (x

j→t

, y

j→t

, w

j→t

, h

j→t

)

i + 1

(10)

Object Detection for Autonomous Driving: Motion-aid Feature Calibration Network

235

Algorithm 1: Inference Process of MFCN.

1: input: frame{I

} / Image sensor produce input

2: f

= N

ext

) / Extract feature maps of the input frame

3: for j = t −i to t do / Initiate pixel feature buffer

4: f

j→t

= W ( f

, F (I

, I

)) / Wrap frame j to frame t

5: f

j→t

, f

= ε( f

j→t

, f

) / Calculate embedding features

6: ω

j→t

= exp(

j→t

·f

j→t

||f

) / Calculate weight for aggregation

7: end for

8: f

t−f inal

∑

j=t−i

j→t

/ Calibrate pixel features for frame t

9: d

= N

det

( f

t−f inal

) / Produce score maps for frame t

10: for j = t −i to t do / Initiate instance feature buffer

11: for i = 0 to n do / Initiate the calibration of all the n proposals

from score maps

12: (x

, y

, w

, h

) = d

/ i

proposal from score maps d

13: m

j→t

= φ(F (I

, I

), (x

, y

, w

, h

)). / Predict the pooled features of the i

proposal

14: (∆

j→t

, ∆

j→t

, ∆

j→t

, ∆

j→t

) = R(m

j→t

) / Predict the proposal movement

15: (x

, y

, ω

, h

) = (∆

j→t

×ω

+ x

, ...) / Calibrate the i

proposal

16: end for

17: end for

18: d

t−f inal

∑

j=t−i

ψ(d

, (x

j→t

, y

j→t

, w

j→t

, h

j→t

)

i+1

/ Calibrate instance features for frame t

19: output: detection result {d

t−f inal

}

In (10), d

denotes the position-sensitive score

maps from previous frames, ψ is a position-sensitive

pooling operation from(Dai et al., 2016), and i the cal-

ibration range (i = 3 by default).

3.2 Inference Details

We summarize the MFCN inference process in Al-

gorithm 1. The camera sensor is constantly produc-

ing image frame I

as input. In the pixel-feature cal-

ibration process, the feature extraction sub-networks

ext

ﬁrst process each input frame I

and produce a

number of intermediate feature map f

(L2 in Algo-

rithm 1). Based on the designated feature calibration

range i, feature maps from frame t −i to frame t are

wrapped to update the current frame t (L4 in Algo-

rithm 1). Meanwhile, we apply a three-layer fully

convolutional network to extract the embedded fea-

tures and we use them to conduct the similarity mea-

surement to compute the adaptive weight (L5 to L6 in

Algorithm 1). Pixel feature at the current frame t is

calibrated by aggregating the wrapped features with

adaptive weight (L8 in Algorithm 1). After this pro-

cess, we calculate the proposal movements and cali-

brate each proposal accordingly between frame t −i

to frame t (L9 to L17 in Algorithm 1). Once we have

the proposal movement, we adopt a position-sensitive

pooling operation to process the calibrated propos-

als and the associated score maps. The ﬁnal result

t−f inal

is computed based on the average operation

of the instance features between frame t −i and t.

In this section, we introduce the experiment setup,

ablation study, comparison with the State-of-the-art

models, architecture evaluation, and calibration range

evaluation. Collectively, the experiment design and

results demonstrate the effectiveness of our model.

4 EVALUATION AND RESULTS

4.1 Experiment Setup

We evaluate the proposed model on the KITTI

dataset(Geiger et al., 2013) and ImageNet VID

dataset(Russakovsky et al., 2015). For KITTI, we

adopt video sequences from the raw data. Note that

we simplify KITTI classes into car and person for

training and evaluation. For ImageNet VID dataset,

we purposefully select videos focusing on car and per-

son (on bicycle and motorcycle). We employ Ima-

geNet VID dataset because it includes more challeng-

ing vehicular instances with motions.

In our study, all the modules described above are

trained end-to-end. Experiments are performed on

a workstation with 8 NVIDIA GTX 1080Ti GPUs

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

236

and Intel Core i7-4790 CPU. For training on KITTI

dataset, we employ SGD optimizer with the batch size

of 8 on 8GPUs. 20K iterations are performed with

the learning rates of 10

−3

and decay rate of 10

−4

For training on ImageNet dataset, we also employ

SGD optimizer with the batch size of 8 on 8GPUs.

30K iterations are performed with the learning rates

of 10

−4

and decay rate of 10

−5

. For feature extrac-

tion and motion estimation sub-networks, we adopt

a pre-trained weight from ImageNet classiﬁcation for

ResNet101 and a pre-trained weight from the Flying

Chair dataset (Dosovitskiy et al., 2015) for FlowNet.

In the evaluation, we follow the protocol in (Geiger

et al., 2013) and utilize the mean average precision

(mAP) as the evaluation metric.

4.2 Ablation Study

We ﬁrst conduct an ablation study to evaluate the

improvements of the proposed model from a strong

single-frame baseline and the pixel-feature aggrega-

tion approach. The results are listed in Table 1.

Method (a) is the single-frame baseline. We adopt

the state-of-the-art R-FCN (He et al., 2016) with

ResNet101 (Dai et al., 2016) for the baseline. Method

(a) achieves 77.1% and 75.7% mAP on KITTI dataset

and ImageNet dataset respectively.

Method (b) incorporates pixel-feature aggrega-

tion in addition to the baseline detector. We adopt

FlowNet2(Ilg et al., 2017) to estimate the motion of

movements. Compared to method (a), the mAP for

method (b) increases 7% and 5.9% on KITTI and Im-

ageNet respectively. Accordingly, the average preci-

sion for car and person also increases from the base-

line detection. The results indicate that pixel-feature

aggregation is helpful for promoting detection accu-

racy from the baseline detector.

Method (c) includes both pixel-feature aggrega-

tion and instance-feature aggregation but not adap-

tive weight. For method (c), we employ average op-

eration for pixel-feature calibration instead of using

the adaptive weight. Compared to the previous two

methods, method (c) achieves higher mAP as well as

the average precision (AP) in car and person on the

two datasets. The mAP for method (c) is 8.9% and

1.9% higher than method (a) and (b) respectively on

KITTI. The mAP for method (c) is 8.25% and 2.35%

higher than method (a) and (b) respectively on Ima-

geNet. By adopting instance feature aggregations, we

observe improvements in detection accuracy from the

two previous methods.

Method (d) is the proposed model which includes

all the contributing modules. Compared to method

(c), we add the adaptive weight for detection. On both

KITTI and ImageNet dataset, method (c) achieves the

highest mAP as well as the average precision (AP)

in car and person compared to its counterparts. The

mAP for method (d) is 10.3%, 1.4%, and 3.3% higher

than the previous methods respectively on KITTI. In

addition, the mAP for method (d) is 9.9%, 1.65%, and

4% higher than its counterparts respectively on Ima-

geNet. By adopting pixel and instance feature cal-

ibration along with the adaptive weight, we achieve

the best performance in detection accuracy.

The major challenges for video detection include

motion blur, video defocus, truncation, occlusion,

and rare pose of an instance. When encountering

these problems, the baseline detector generally has

incorrect inference or struggles to generate an in-

ference; the pixel-feature aggregation approach fre-

quently generates a loose bounding box which is not

accurate enough in evaluation; while the proposed

model outperforms its counterparts in such circum-

stance.

4.3 Comparison with the

State-of-the-Art Models

We further evaluate our model by comparing some

of the state-of-the-art models. We select four leading

models from KITTI benchmark report (Geiger et al.,

2013) whose code are publicly available. We imple-

ment the selected models and our model on the same

workstation in order to have fair comparison.

Table 2 summarizes the performance of each

model in detection. Based on the results, our model

achieves highest mAP compared to its counterparts.

For car class, our model achieves 1.13% higher than

the best model in this category; while for person class,

Table 1: The Validation Results from Ablation Study.

KITTI

Methods a b c d

Baseline

√ √ √ √

Pixel-feature calibration

√ √ √

Instance-feature calibration

√ √

Adpative Weight

√

AP(%) for Person 73.3 78.5

↑5.2

80.9

↑7.6

82.9

↑9.6

AP(%) for Car 80.9 89.7

↑8.8

91.1

↑10.2

91.9

↑11.0

mAP(%) 77.1 84.1

↑7.0

86.0

↑8.9

87.4

↑10.3

ImageNet

Methods a b c d

Baseline

√ √ √ √

Pixel-feature calibration

√ √ √

Instance-feature calibration

√ √

Adpative Weight

√

AP(%) for Person 76.3 79.5

↑3.2

80.1

↑3.8

81.7

↑5.4

AP(%) for Car 75.1 83.7

↑8.6

87.8

↑12.7

89.5

↑14.4

mAP(%) 75.7 81.6

↑5.9

83.95

↑8.25

85.6

↑9.9

Object Detection for Autonomous Driving: Motion-aid Feature Calibration Network

237

Table 2: Comparison of KITTI Benchmark.

Model MFCN TuSimple RRC sensekitti F-PointNet

Car 91.90% 90.77%

86.86%

86.62%

84.82%

Person 82.90% 80.26%

78.08%

70.13%

81.48%

mAP(%) 87.40% 85.52%

82.47%

78.38%

83.15%

Table 3: Calibration Range Evaluation.

Aggregation Range 1 3 (default) 5 7 9 11

mAP(%) 82.1 87.4 88.9 89.2 89.3 89.3

Runtime(fps) 36.2 35.8 25.2 17.7 9.3 5.7

our model leads the top model by 1.42% in accuracy.

The results indicate that the proposed model has the

best overall performance among the leading models

from KITTI benchmark competition. Hence we ar-

gue that our proposed approach is competitive with

the existing best approaches.

4.4 Calibration Range Evaluation

For our model, calibration range is a contributing pa-

rameter for the performance of the ﬁnal detection be-

cause it decides the amount of information for prop-

agation and computation. An excessive aggregation

range may cause a un-affordable computation cost

while an insufﬁcient aggregation range may optimize

the detection results. Hence, we want to study the

impact of aggregation range on speed-accuracy corre-

lation.

The results are summarized in Table 3. Notice that

the detection accuracy increases at a modest rate with

the calibration range increasing from 3 to 11. Consid-

ering the accuracy-runtime tradeoff, the model with

the default aggregation range has the best overall per-

formance. The runtimes for calibration range between

1 to 5 range 36.2 to 25.2 fps. The inference speed

reaches the frame rate of a live-stream video input

which generally has 25 to 30 frames per second. The

results indicate that our model is capable to process

live-stream video inputs for detection purpose.

Above results provide some useful clues for prac-

tical applications. However, MFCN is only tested on

two datasets, thus the results may be more heuristic

than general. We participate to explore the model

more thoroughly in the future.

5 CONCLUSION AND FUTURE

WORK

In this study, we propose a motion-assist calibration

network (MFCN) for video object detection, which is

based on end-to-end learning approach. Motion esti-

mation is the main tool we use to explore the tem-

poral coherence of features. Multi-frame features

with spatial changes are calibrated in a principled

way. We conducted a large number of experiments

which indicates that our model surpasses the existing

state-of-the-art model in both KITTI and Cityscapes

benchmarks. The results show that the motion mod-

eling feature calibration improves feature representa-

tion and facilitates ﬁnal detection in the video.

However, video object detection still has a lot of

room for improvement. For safe autonomous driving,

proper testing should reach a higher level. Despite

this, the accuracy we achieve is still some distance

from the target. We believe that a more efﬁcient and

accurate estimation network and object detection net-

work may be beneﬁcial to the improvement of video

object detection. These unresolved issues may be in-

structive for some future work.

REFERENCES

Chen, X., Ma, H., Wan, J., Li, B., and Xia, T. (2017). Multi-

view 3d object detection network for autonomous

driving. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

1907–1915.

Dai, J., Li, Y., He, K., and Sun, J. (2016). R-fcn: Object de-

tection via region-based fully convolutional networks.

In Advances in neural information processing sys-

tems, pages 379–387.

Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas,

C., Golkov, V., Van Der Smagt, P., Cremers, D., and

Brox, T. (2015). Flownet: Learning optical ﬂow with

convolutional networks. In Proceedings of the IEEE

international conference on computer vision, pages

2758–2766.

Du, Y., Yuan, C., Li, B., Hu, W., and Maybank, S. (2017).

Spatio-temporal self-organizing map deep network for

dynamic object detection from videos. In Proceedings

of the IEEE Conference on Computer Vision and Pat-

tern Recognition, pages 5475–5484.

Feichtenhofer, C., Pinz, A., and Zisserman, A. (2017). De-

tect to track and track to detect. In Proceedings of the

IEEE International Conference on Computer Vision,

pages 3038–3046.

Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. (2013).

Vision meets robotics: The kitti dataset. The Inter-

national Journal of Robotics Research, 32(11):1231–

1237.

Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE

international conference on computer vision, pages

1440–1448.

Han, W., Khorrami, P., Paine, T. L., Ramachandran, P.,

Babaeizadeh, M., Shi, H., Li, J., Yan, S., and Huang,

T. S. (2016). Seq-nms for video object detection.

arXiv preprint arXiv:1602.08465.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Spatial pyra-

mid pooling in deep convolutional networks for visual

ICAART 2020 - 12th International Conference on Agents and Artiﬁcial Intelligence

238

recognition. IEEE transactions on pattern analysis

and machine intelligence, 37(9):1904–1916.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A.,

and Brox, T. (2017). Flownet 2.0: Evolution of optical

ﬂow estimation with deep networks. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition, pages 2462–2470.

Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T.,

Zhang, C., Wang, Z., Wang, R., Wang, X., et al.

(2018). T-cnn: Tubelets with convolutional neural net-

works for object detection from videos. IEEE Trans-

actions on Circuits and Systems for Video Technology,

28(10):2896–2907.

Kang, K., Ouyang, W., Li, H., and Wang, X. (2016). Ob-

ject detection from video tubelets with convolutional

neural networks. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 817–825.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in neural information process-

ing systems, pages 1097–1105.

Lee, B., Erdenee, E., Jin, S., Nam, M. Y., Jung, Y. G., and

Rhee, P. K. (2016). Multi-class multi-object tracking

using changing point detection. In European Confer-

ence on Computer Vision, pages 68–83. Springer.

Lin, T.-Y., Doll

ar, P., Girshick, R., He, K., Hariharan, B.,

and Belongie, S. (2017a). Feature pyramid networks

for object detection. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 2117–2125.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll

ar, P.

(2017b). Focal loss for dense object detection. In

Proceedings of the IEEE international conference on

computer vision, pages 2980–2988.

Liu, D., Wang, Y., Chen, T., and Matson, E. T. (2019). Ap-

plication of color ﬁlter and k-means clustering ﬁlter

fusion in lane detection for self-driving car. In Pro-

ceedings of The Third IEEE International Conference

on Robotic Computing.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot

multibox detector. In European conference on com-

puter vision, pages 21–37. Springer.

Ma, C., Huang, J.-B., Yang, X., and Yang, M.-H. (2015).

Hierarchical convolutional features for visual track-

ing. In Proceedings of the IEEE international con-

ference on computer vision, pages 3074–3082.

Ramanagopal, M. S., Anderson, C., Vasudevan, R., and

Johnson-Roberson, M. (2018). Failing to learn: au-

tonomously identifying perception failures for self-

driving cars. IEEE Robotics and Automation Letters,

3(4):3860–3867.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. In Advances in neural information

processing systems, pages 91–99.

Ren, S., He, K., Girshick, R., Zhang, X., and Sun, J. (2017).

Object detection networks on convolutional feature

maps. IEEE transactions on pattern analysis and ma-

chine intelligence, 39(7):1476–1481.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., et al. (2015). Imagenet large scale visual

recognition challenge. International journal of com-

puter vision, 115(3):211–252.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Uijlings, J. R., Van De Sande, K. E., Gevers, T., and

Smeulders, A. W. (2013). Selective search for object

recognition. International journal of computer vision,

104(2):154–171.

Wang, Y., Liu, D., Jeon, H., Chu, Z., and Matson, E.

(2019). End-to-end learning approach for autonomous

driving: A convolutional neural network model. In

Proceedings of the 11th International Conference

on Agents and Artiﬁcial Intelligence - Volume 2:

ICAART, pages 833–839. INSTICC, SciTePress.

Zeng, X., Ouyang, W., Yang, B., Yan, J., and Wang, X.

(2016). Gated bi-directional cnn for object detection.

In European Conference on Computer Vision, pages

354–369. Springer.

Zhang, X., Zhou, X., Lin, M., and Sun, J. (2018). Shuf-

ﬂenet: An extremely efﬁcient convolutional neural

network for mobile devices. In Proceedings of the

IEEE Conference on Computer Vision and Pattern

Recognition, pages 6848–6856.

Zhu, X., Dai, J., Yuan, L., and Wei, Y. (2018). Towards high

performance video object detection. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition, pages 7210–7218.

Zhu, X., Wang, Y., Dai, J., Yuan, L., and Wei, Y. (2017a).

Flow-guided feature aggregation for video object de-

tection. In Proceedings of the IEEE International

Conference on Computer Vision, pages 408–417.

Zhu, X., Xiong, Y., Dai, J., Yuan, L., and Wei, Y. (2017b).

Deep feature ﬂow for video recognition. In Proceed-

ings of the IEEE Conference on Computer Vision and

Pattern Recognition, pages 2349–2358.

Zitnick, C. L. and Doll

ar, P. (2014). Edge boxes: Locating

object proposals from edges. In European conference

on computer vision, pages 391–405. Springer.

Object Detection for Autonomous Driving: Motion-aid Feature Calibration Network

239