Object Detection for Autonomous Driving: Motion-aid Feature
Calibration Network
Dongfang Liu, Yaqin Wang and Eric T. Matson
Computer and Information Technology, Purdue University, West Lafayette, IN, U.S.A.
Keywords:
Autonomous Driving, Video Object Detection, Pixel Feature Calibration, Instance Feature Calibration.
Abstract:
Object detection is a critical task for autonomous driving. The latest progress in deep learning research for
object detection has built a solid contribution to the development of autonomous driving. However, direct
employment of the state-of-the-art object detectors from image to video is problematic. Object appearances
in videos have more variations, e.g., video defocus, motion blur, truncation, etc. Such variations of objects
in a video have fewer occurrences in still image and could compromise the detection results. To address
these problems, we build a fast and accurate deep learning framework, motion-assist calibration network
(MFCN) for video detection. Our model leverages the motion pattern of temporal coherence on video features.
It calibrates and aggregates features of detected objects from previous frames along the spatial changes to
improve the feature representations on the current frame. The whole model architecture is trained end-to-
end which boosts the detection accuracy. Validations on the KITTI and ImageNet dataset show that MFCN
can improve the baseline results by 10.3% and 9.9% mAP respectively. Compared with other state-of-the-
art models, MFCN achieves a leading performance on KITTI benchmark competition. Results indicate the
effectiveness of the proposed model which could facilitate the autonomous driving system.
1 INTRODUCTION
Object detection is the primary task for the per-
ception of autonomous car in order to achieve full
autonomy(Liu et al., 2019)(Wang et al., 2019)(Ra-
managopal et al., 2018). In recent years, the develop-
ment of deep learning network has rendered a great
progress in object detection which can be adapted
by autonomous driving(Ren et al., 2015)(Lin et al.,
2017b)(Zeng et al., 2016). The current state-of-the-
art detectors are able to achieve recognizable success
on static images (Chen et al., 2017), but their per-
formance on live-stream videos needs to be improved
greatly in order to feed the needs of autonomous driv-
ing(Zhu et al., 2017b). For autonomous driving, cam-
era sensor constantly produces a live-stream video
frame which may include a large number of object
appearances with drastic variations. For instance, mo-
tion blur or video defocus are frequently observed.
Under these circumstances, still Object detection may
fail and generate unstable results.
To address these challenges in video object detec-
tion, one of intuitive solutions is to analyze the tem-
poral and spatial coherence in videos and employ in-
formation from nearby frames(Du et al., 2017). Since
videos generally include rich imagery information, it
is easy to identify a same object instance in multiple
frames in a short time. (Kang et al., 2018)(Han et al.,
2016)(Lee et al., 2016) exploit such temporal infor-
mation in existing video to detect objects in a sim-
ple way. These studies firstly utilize object detectors
in individual frame and then assemble the inference
results in a dedicated post processing step across all
the temporal dimensions. Following the same prin-
ciple, (Han et al., 2016)(Kang et al., 2018)(Kang
et al., 2016)(Feichtenhofer et al., 2017) propose hand-
crafted bounding box association rules to improve the
final predictions. Although the approaches mentioned
above generate promising results, these methods are
post-processing fashions not principled learning tech-
niques.
To address the limitations of existing work, we
further exploit the temporal information and seek
more robust feature calibration to boost the accuracy
of video object detection. In this study, we present
motion-assist calibration network (MFCN), a fast, ac-
curate, end-to-end framework for video object detec-
tion. We consider the effects of movement on fea-
tures to facilitate the per-frame feature learning. We
leverage the temporal coherence and calibrate pixel
232
Liu, D., Wang, Y. and Matson, E.
Object Detection for Autonomous Driving: Motion-aid Feature Calibration Network.
DOI: 10.5220/0009107602320239
In Proceedings of the 12th International Conference on Agents and Artificial Intelligence (ICAART 2020) - Volume 2, pages 232-239
ISBN: 978-989-758-395-7; ISSN: 2184-433X
Copyright
c
2022 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
and instance features across frames to prompt feature
representation of current frame. Accordingly, the fi-
nal detection is effectively improved.
The key contributions of our work are in three-
fold: First, we introduce a fast, accurate, end-to-end
framework for video object detection which could be
adopted by autonomous driving system; Second, we
use ablation study to verify the improvements of the
proposed model from a strong single-frame baseline
and contributions of each module in the proposed
model; and third, we conduct extensive experiments
to evaluate the performance of the proposed model by
comparing with state-of-the-art systems. Results in-
dicate our model is competitive with the exiting best
approaches for video object detection.
The rest of the paper is structured as follows. In
Section II, a through discussion of related work is pre-
sented which builds the foundations of this research.
Next, we articulate the technical details of the pro-
posed model in Section III. In Section IV, we present
the results of extensive experiments of our proposed
model on two credential datasets. Finally, in section
V, we conclude this study and address future work.
2 RELEVANT RESEARCH
With the development of deep learning for computer
vision, the object detection has been extended from
still image to video(Zhu et al., 2017b)(Zhu et al.,
2018)(Zhang et al., 2018). One example is that
the video object detection competition, such as Im-
ageNet(Russakovsky et al., 2015), has drawn a wide
attention in both academia and industry. For video ob-
ject detection, there are two typical approaches from
exiting work: 1. post processing; and 2. principed
learning. We elaborate them in the following para-
graph.
2.1 Post Processing
Many existing work leveraged state-of-the-art single
image object detection systems to tackle the issue of
object detection in video domain. (Uijlings et al.,
2013)(Zitnick and Doll
´
ar, 2014) directly applied ob-
ject detectors on individual frame and then assembled
the inference results to generate the final predictions.
These explorations were effective but naiıve because
they completely ignored the temporal dimension from
the sequential video frames. On the contrary, (Han
et al., 2016)(Lee et al., 2016)(Feichtenhofer et al.,
2017) incorporated temporal information during the
post-processing and effectively improved the detec-
tion results within each individual frame. Once re-
gion proposals and their corresponding class scores
for each frame were obtained, (Han et al., 2016)(Lee
et al., 2016)(Feichtenhofer et al., 2017) implemented
algorithm to select boxes to maximize a sequence
score. The selected boxes are then employed to sup-
press overlapping boxes in the corresponding frames
in order to boost weaker detection. The post pro-
cessing approach provided valuable insights for ap-
plications related to video surveillance. However, due
to the nature of its off line design, post processing
approach casts a cloud over the object detection for
autonomous driving because this approach cannot be
adopted directly for real-time driving system. In order
to facilitate the autonomous driving, we seek answers
from principled learning approach which can have
real-time inferences for object detection on videos.
2.2 Principled Learning
Current state-of-the-art detectors generally embrace
an identical two-stage structure. First, Deep Convolu-
tional Neural Networks (CNNs) capture the outstand-
ing features of the input image and generate a set
of intermediate convolutional feature maps accord-
ingly(Krizhevsky et al., 2012)(Simonyan and Zisser-
man, 2014)(He et al., 2016). Then, a detection net-
work predicts the detection results based on the fea-
ture maps (Lin et al., 2017a)(He et al., 2015)(Liu
et al., 2016). The first stage counts for the ma-
jor computation cost compared to the final detection
task. The intermediate convolutional feature maps
have the same spatial extents of the input image but
much smaller resolution. Such small resolution fea-
ture maps could be cheaply propagated by spatial
warping from nearby frames (Ren et al., 2017) (Ma
et al., 2015). To leverage this property of CNN,
(Zhu et al., 2017b)(Kang et al., 2016)(Kang et al.,
2018)(Zhu et al., 2017a) proposed end-to-end frame-
work which used the feature propagation to enhance
the feature of individual frames in videos. (Zhu et al.,
2017b)(Zhu et al., 2017a) applied optical flow net-
work (Ilg et al., 2017) to estimate per-pixel motion.
Since the optical flow estimation and feature propa-
gation had much faster computation than that of con-
volutional network feature extractions, (Zhu et al.,
2017b)(Zhu et al., 2017a) avoided the computational
bottleneck and achieve significant runtime improve-
ment. Since both the motion flow and object de-
tection were predicted by deep learning networks,
the entire architecture was trained end-to-end and the
detection accuracy and efficiency were significantly
boosted (Zhu et al., 2017b)(Zhu et al., 2017a). Simi-
larly, (Kang et al., 2016) (Kang et al., 2018) presented
a novel work using tubelet proposal networks, which
Object Detection for Autonomous Driving: Motion-aid Feature Calibration Network
233
extracted multi-frame features to predict the pattern
of object motion. By calculating the spatial changes,
(Kang et al., 2016) (Kang et al., 2018) utilized the
spatiotemporal proposals for final detection task. The
results from (Kang et al., 2016) (Kang et al., 2018)
demonstrated the state-of-the-art performance in ob-
ject detection on videos.
All the above methods, however, focus on spare
pixel-level propagation approach. This approach
would encounter difficulties when objects dramati-
cally changes along with their appearance of scales
in the video. With the inaccurate motion estima-
tion, the pixel-level propagation approach(Zhu et al.,
2017b)(Zhu et al., 2017a) may encounter difficulties
to produce desirable results. In addition, propagat-
ing pixel features from (only) few designated frames
may be incapable to facilitate the final detection task
if an object instance appears shorter than the prop-
agation range. In our work, we intend to calibrate
features of interests at pixel and instance level. We
consider object motion in our calibration, thus we can
accurately aggregate features of interest from consec-
utive frames instead of using spare frames as a refer-
ence. Comparing to pixel-feature propagation, our ap-
proach has more robustness to appearance variations
in video frames.
3 TECHNICAL APPROACH
In this section, we present the details of our techni-
cal approach which includes the model design, infer-
ence of the model, and model architectures. Model
design introduces the implementation of our model;
inference of the model explains the prediction proce-
dures of the trained model; and model architecture ar-
ticulates the sub-networks of the overall deep neural
network.
3.1 Model Design
The core concept of the MFCN is to calibrate imagery
features at pixel level and instance level along with
motion from frames t i to the current frame t to
improve the detection results. Following the popu-
lar architectures of CNN for detection task(Ren et al.,
2015)(Liu et al., 2016), our model has feature ex-
traction sub-networks N
ext
to produce a number of
intermediate feature maps and object detection sub-
networks N
det
to process the extracted features maps.
When camera sensor produces instant video frames
I
t
, t = 1, ..., , feature extraction sub-networks N
ext
process the input frame and produce a number of in-
termediate feature maps, as f
t
= N
ext
(I
t
). The feature
maps are fed into object detection sub-networks N
det
,
as d
t
= N
det
(I
t
). The output d
t
includes position-
sensitive score maps d
t
and the proposals of the input
frame.
MFCN includes three contributing components:
1. pixel-feature calibration; 2. instance-feature cal-
ibration; and 3. Adaptive weight. For pixel-feature
calibration, N
ext
constantly receives a single frame
input, and generate the intermediate feature map.
We employ motion estimation sub-network to esti-
mate motion tendency of feature maps across frames.
Based on the motion estimation, we employ a bi-
linear wrapping function W (·, ·) to wrap pixel fea-
tures from previous frames as f
tit
. Meanwhile, we
adopt a fully convolutional network ε(·) to compute
the embedding features from t i to t for similarity
measurement which calculates the adaptive weight for
each feature maps from t i. Once we have adap-
tive weight, we aggregate each feature map from f
ti
to f
t
and calibrate the feature of the current frame as
f
tf inal
.
Next, for instance-feature calibration, we use
object detection sub-network to obtain position-
sensitive score maps and instance proposals from
frame t i to frame t. Since each proposal has ground
truth information, we use a spatial-change regression
network R(·) to estimate the movement of each pro-
posal from the corresponding locations at frame t i
to frame t. With the estimated proposal movement,
we calibrate each instance feature accordingly from
frame t i to current frame t. Once the calibra-
tion is done, we average the aggregated instance fea-
tures from t i to obtain the final result for frame
t. The procedures for pixel-feature calibration, adap-
tive weight calculation, and instance-feature calibra-
tion are elaborated below.
3.1.1 Pixel-feature Calibration
We model the pixel-level calibration from a previous
frame I
ti
to current frame I
t
with motion estimation.
We denote F (·, ·) as the motion estimation function.
F (I
ti
, I
t
) calculates the spatial changes from frame
I
t
to I
ti
. We leverage the motion estimation to have
an interference of location p
ti
in the previous frame
t i to the location p
t
in the current frame t. Let p
be the estimated distance between p
ti
and p
t
, and
thus p = F (I
ti
, I
t
)(p
t
). The calibrated feature maps
define as:
f
tit
(p
t
) =
q
G(q
ti
, p
ti
+ p) f
ti
(q
ti
). (1)
In (2), q enumerates all spatial locations at p
ti
on
feature map f
ti
, and G(·) denotes bi-linear interpo-
lation kernel, which has two one-dimensional kernels
ICAART 2020 - 12th International Conference on Agents and Artificial Intelligence
234
as:
G(q, p+ p) = g(q
x
, p
x
+p
x
)·g(q
y
, p
y
+p
y
). (2)
Then, the calibrated feature maps from previous
frames are wrapped into the current frame accord-
ingly. The wrapping process is as the follow:
f
tit
= W ( f
ti
, F (I
ti
, I
t
)). (3)
In (3), f
ti
is the feature map produced by N
ext
at
frame t i, f
tit
denotes the wrapped features from
time t i to time t, and W (·, ·) is a bi-linear wrap-
ping function which is applied on each feature map
of all channels across frames. Through this feature
wrapping process, the current frame aggregates mul-
tiple feature maps from previous frames. To process
the aggregated features, we average these features to
update the feature of the current frame:
f
tf inal
=
t
j=ti
ω
jt
f
jt
(4)
In (4), ω
jt
is the adaptive weight which decides
contribution of each wrapping features, f
jt
is the
wrapping features, f
tf inal
is the calibrated feature
for the current frame, and i specifies calibration range
from the previous frames to the current frame (i = 3
by default).
3.1.2 Adaptive Weight Calculation
The adaptive weight indicates the importance of each
wrapped feature from previous frame to the current
frame. At location p
ti
, if feature f
ti
(p
ti
) is close
to the feature f
t
(p
t
) at the current frame, a larger
weight is assigned to it. On the contrary, a smaller
weight is assigned. To calculate the adaptive weight,
we employ the cosine similarity metric(Liu et al.,
2016) to measure the similarity between the previous
frame t i and the current frame. In implementa-
tion, we apply a three-layer fully convolutional net-
work ε(·) to f
tit
and f
t
, which projects embedding
features from the similarity measurement. The mea-
surement defines as:
f
s
tit
, f
s
t
= ε( f
tit
, f
t
) (5)
In (5), f
s
tit
and f
s
t
are embedding features from
similarity measurement. Thus, the adaptive weight
defines as:
ω
tit
= exp(
f
s
tit
· f
s
t
|f
s
tit
||f
s
t
|
). (6)
The adaptive weight w
tit
is normalized for every
spatial location over previous frames from I
ti
to I
t
.
3.1.3 Instance-feature Calibration
After updating the feature map for the current frame,
we feed f
tf inal
into the detection sub-network N
det
and obtain improved detection result d
t
for the cur-
rent frame t. The instance-feature calibration is con-
ducted on calibrating features at instance level across
frames. The final detection result at frame t is calcu-
lated by aggregating each calibrated instance features
with proposal movements from previous frames.
To compute individual proposal movements from
frame t i to t, we utilize the RoI pooling operation to
predict the pooled features m
n
tit
of the n
th
proposal
r
n
ti
at location (x
n
t
, y
n
t
, w
n
t
, h
n
t
) which is in the current
frame t:
m
n
tit
= φ(F (I
ti
, I
t
), (x
n
t
, y
n
t
, w
n
t
, h
n
t
)). (7)
In (7), φ(·, ·) is the RoI pooling operation from (Gir-
shick, 2015) and F (I
ti
, I
t
) is the motion estimation
function from pixel-feature calibration. Using the RoI
pooling operation, we convert the features in each re-
gion of interest from proposal into a small feature
map.
After the RoI pooling operation, we build a
spatial-change regression network R(·) to predict the
movements of proposals between the frame t i and
t based on the m
n
tit
. R(·) defines as:
(
n
x
tit
,
n
y
tit
,
n
w
tit
,
n
h
tit
) = R(m
n
tit
). (8)
In (8), (
n
x
tit
,
n
y
tit
,
n
w
tit
,
n
h
tit
) is the spatial
changes which defines the movement of each pro-
posal between frame t i and from t.
In our study, if the proposal overlaps more than
0.8 in intersection-over-union (IoU), the regression
instance is considered to be a positive proposal with
ground truth in movement. Only the positive propos-
als are learnt across frames. In the same vein, we re-
cursively produce calibrated ground-truth of each in-
stances with spatial changes to current frame t.
After we obtain the spatial changes of individual
proposal above, we can calibrate each proposal from
previous frames. The calibrated proposal defines as:
x
n
tit
=
n
x
tit
×ω
n
t
+ x
n
t
y
n
tit
=
n
y
tit
×h
n
t
+ y
n
t
ω
n
tit
= exp(
n
ω
tit
) ×ω
n
t
h
n
tit
= exp(
n
h
tit
) ×h
n
t
.
(9)
Finally, we aggregate the calibrated proposals from
t i to current frame t and employ the average op-
eration to obtain the final detection result. The final
detection for n
th
proposal defines as below:
d
n
tf inal
=
t
j=ti
ψ(d
j
, (x
n
jt
, y
n
jt
, w
n
jt
, h
n
jt
)
i + 1
.
(10)
Object Detection for Autonomous Driving: Motion-aid Feature Calibration Network
235
Algorithm 1: Inference Process of MFCN.
1: input: frame{I
t
} / Image sensor produce input
2: f
t
= N
ext
(I
t
) / Extract feature maps of the input frame
3: for j = t i to t do / Initiate pixel feature buffer
4: f
jt
= W ( f
j
, F (I
j
, I
t
)) / Wrap frame j to frame t
5: f
s
jt
, f
s
t
= ε( f
jt
, f
t
) / Calculate embedding features
6: ω
jt
= exp(
f
s
jt
·f
s
t
|f
s
jt
||f
s
t
|
) / Calculate weight for aggregation
7: end for
8: f
tf inal
=
t
j=ti
ω
jt
f
jt
/ Calibrate pixel features for frame t
9: d
t
= N
det
( f
tf inal
) / Produce score maps for frame t
10: for j = t i to t do / Initiate instance feature buffer
11: for i = 0 to n do / Initiate the calibration of all the n proposals
from score maps
12: (x
i
j
, y
i
j
, w
i
j
, h
i
j
) = d
i
j
/ i
th
proposal from score maps d
j
13: m
i
jt
= φ(F (I
j
, I
t
), (x
i
t
, y
i
t
, w
i
t
, h
i
t
)). / Predict the pooled features of the i
th
proposal
14: (
i
x
jt
,
i
y
jt
,
i
w
jt
,
i
h
jt
) = R(m
i
jt
) / Predict the proposal movement
15: (x
i
j
, y
i
j
, ω
i
j
, h
i
j
) = (
i
x
jt
×ω
i
t
+ x
i
t
, ...) / Calibrate the i
th
proposal
16: end for
17: end for
18: d
n
tf inal
=
t
j=ti
ψ(d
j
, (x
n
jt
, y
n
jt
, w
n
jt
, h
n
jt
)
i+1
/ Calibrate instance features for frame t
19: output: detection result {d
n
tf inal
}
In (10), d
j
denotes the position-sensitive score
maps from previous frames, ψ is a position-sensitive
pooling operation from(Dai et al., 2016), and i the cal-
ibration range (i = 3 by default).
3.2 Inference Details
We summarize the MFCN inference process in Al-
gorithm 1. The camera sensor is constantly produc-
ing image frame I
t
as input. In the pixel-feature cal-
ibration process, the feature extraction sub-networks
N
ext
first process each input frame I
t
and produce a
number of intermediate feature map f
t
(L2 in Algo-
rithm 1). Based on the designated feature calibration
range i, feature maps from frame t i to frame t are
wrapped to update the current frame t (L4 in Algo-
rithm 1). Meanwhile, we apply a three-layer fully
convolutional network to extract the embedded fea-
tures and we use them to conduct the similarity mea-
surement to compute the adaptive weight (L5 to L6 in
Algorithm 1). Pixel feature at the current frame t is
calibrated by aggregating the wrapped features with
adaptive weight (L8 in Algorithm 1). After this pro-
cess, we calculate the proposal movements and cali-
brate each proposal accordingly between frame t i
to frame t (L9 to L17 in Algorithm 1). Once we have
the proposal movement, we adopt a position-sensitive
pooling operation to process the calibrated propos-
als and the associated score maps. The final result
d
n
tf inal
is computed based on the average operation
of the instance features between frame t i and t.
In this section, we introduce the experiment setup,
ablation study, comparison with the State-of-the-art
models, architecture evaluation, and calibration range
evaluation. Collectively, the experiment design and
results demonstrate the effectiveness of our model.
4 EVALUATION AND RESULTS
4.1 Experiment Setup
We evaluate the proposed model on the KITTI
dataset(Geiger et al., 2013) and ImageNet VID
dataset(Russakovsky et al., 2015). For KITTI, we
adopt video sequences from the raw data. Note that
we simplify KITTI classes into car and person for
training and evaluation. For ImageNet VID dataset,
we purposefully select videos focusing on car and per-
son (on bicycle and motorcycle). We employ Ima-
geNet VID dataset because it includes more challeng-
ing vehicular instances with motions.
In our study, all the modules described above are
trained end-to-end. Experiments are performed on
a workstation with 8 NVIDIA GTX 1080Ti GPUs
ICAART 2020 - 12th International Conference on Agents and Artificial Intelligence
236
and Intel Core i7-4790 CPU. For training on KITTI
dataset, we employ SGD optimizer with the batch size
of 8 on 8GPUs. 20K iterations are performed with
the learning rates of 10
3
and decay rate of 10
4
.
For training on ImageNet dataset, we also employ
SGD optimizer with the batch size of 8 on 8GPUs.
30K iterations are performed with the learning rates
of 10
4
and decay rate of 10
5
. For feature extrac-
tion and motion estimation sub-networks, we adopt
a pre-trained weight from ImageNet classification for
ResNet101 and a pre-trained weight from the Flying
Chair dataset (Dosovitskiy et al., 2015) for FlowNet.
In the evaluation, we follow the protocol in (Geiger
et al., 2013) and utilize the mean average precision
(mAP) as the evaluation metric.
4.2 Ablation Study
We first conduct an ablation study to evaluate the
improvements of the proposed model from a strong
single-frame baseline and the pixel-feature aggrega-
tion approach. The results are listed in Table 1.
Method (a) is the single-frame baseline. We adopt
the state-of-the-art R-FCN (He et al., 2016) with
ResNet101 (Dai et al., 2016) for the baseline. Method
(a) achieves 77.1% and 75.7% mAP on KITTI dataset
and ImageNet dataset respectively.
Method (b) incorporates pixel-feature aggrega-
tion in addition to the baseline detector. We adopt
FlowNet2(Ilg et al., 2017) to estimate the motion of
movements. Compared to method (a), the mAP for
method (b) increases 7% and 5.9% on KITTI and Im-
ageNet respectively. Accordingly, the average preci-
sion for car and person also increases from the base-
line detection. The results indicate that pixel-feature
aggregation is helpful for promoting detection accu-
racy from the baseline detector.
Method (c) includes both pixel-feature aggrega-
tion and instance-feature aggregation but not adap-
tive weight. For method (c), we employ average op-
eration for pixel-feature calibration instead of using
the adaptive weight. Compared to the previous two
methods, method (c) achieves higher mAP as well as
the average precision (AP) in car and person on the
two datasets. The mAP for method (c) is 8.9% and
1.9% higher than method (a) and (b) respectively on
KITTI. The mAP for method (c) is 8.25% and 2.35%
higher than method (a) and (b) respectively on Ima-
geNet. By adopting instance feature aggregations, we
observe improvements in detection accuracy from the
two previous methods.
Method (d) is the proposed model which includes
all the contributing modules. Compared to method
(c), we add the adaptive weight for detection. On both
KITTI and ImageNet dataset, method (c) achieves the
highest mAP as well as the average precision (AP)
in car and person compared to its counterparts. The
mAP for method (d) is 10.3%, 1.4%, and 3.3% higher
than the previous methods respectively on KITTI. In
addition, the mAP for method (d) is 9.9%, 1.65%, and
4% higher than its counterparts respectively on Ima-
geNet. By adopting pixel and instance feature cal-
ibration along with the adaptive weight, we achieve
the best performance in detection accuracy.
The major challenges for video detection include
motion blur, video defocus, truncation, occlusion,
and rare pose of an instance. When encountering
these problems, the baseline detector generally has
incorrect inference or struggles to generate an in-
ference; the pixel-feature aggregation approach fre-
quently generates a loose bounding box which is not
accurate enough in evaluation; while the proposed
model outperforms its counterparts in such circum-
stance.
4.3 Comparison with the
State-of-the-Art Models
We further evaluate our model by comparing some
of the state-of-the-art models. We select four leading
models from KITTI benchmark report (Geiger et al.,
2013) whose code are publicly available. We imple-
ment the selected models and our model on the same
workstation in order to have fair comparison.
Table 2 summarizes the performance of each
model in detection. Based on the results, our model
achieves highest mAP compared to its counterparts.
For car class, our model achieves 1.13% higher than
the best model in this category; while for person class,
Table 1: The Validation Results from Ablation Study.
KITTI
Methods a b c d
Baseline
Pixel-feature calibration
Instance-feature calibration
Adpative Weight
AP(%) for Person 73.3 78.5
5.2
80.9
7.6
82.9
9.6
AP(%) for Car 80.9 89.7
8.8
91.1
10.2
91.9
11.0
mAP(%) 77.1 84.1
7.0
86.0
8.9
87.4
10.3
ImageNet
Methods a b c d
Baseline
Pixel-feature calibration
Instance-feature calibration
Adpative Weight
AP(%) for Person 76.3 79.5
3.2
80.1
3.8
81.7
5.4
AP(%) for Car 75.1 83.7
8.6
87.8
12.7
89.5
14.4
mAP(%) 75.7 81.6
5.9
83.95
8.25
85.6
9.9
Object Detection for Autonomous Driving: Motion-aid Feature Calibration Network
237
Table 2: Comparison of KITTI Benchmark.
Model MFCN TuSimple RRC sensekitti F-PointNet
Car 91.90% 90.77%
1
86.86%
2
86.62%
3
84.82%
4
Person 82.90% 80.26%
2
78.08%
3
70.13%
4
81.48%
1
mAP(%) 87.40% 85.52%
1
82.47%
3
78.38%
4
83.15%
2
Table 3: Calibration Range Evaluation.
Aggregation Range 1 3 (default) 5 7 9 11
mAP(%) 82.1 87.4 88.9 89.2 89.3 89.3
Runtime(fps) 36.2 35.8 25.2 17.7 9.3 5.7
our model leads the top model by 1.42% in accuracy.
The results indicate that the proposed model has the
best overall performance among the leading models
from KITTI benchmark competition. Hence we ar-
gue that our proposed approach is competitive with
the existing best approaches.
4.4 Calibration Range Evaluation
For our model, calibration range is a contributing pa-
rameter for the performance of the final detection be-
cause it decides the amount of information for prop-
agation and computation. An excessive aggregation
range may cause a un-affordable computation cost
while an insufficient aggregation range may optimize
the detection results. Hence, we want to study the
impact of aggregation range on speed-accuracy corre-
lation.
The results are summarized in Table 3. Notice that
the detection accuracy increases at a modest rate with
the calibration range increasing from 3 to 11. Consid-
ering the accuracy-runtime tradeoff, the model with
the default aggregation range has the best overall per-
formance. The runtimes for calibration range between
1 to 5 range 36.2 to 25.2 fps. The inference speed
reaches the frame rate of a live-stream video input
which generally has 25 to 30 frames per second. The
results indicate that our model is capable to process
live-stream video inputs for detection purpose.
Above results provide some useful clues for prac-
tical applications. However, MFCN is only tested on
two datasets, thus the results may be more heuristic
than general. We participate to explore the model
more thoroughly in the future.
5 CONCLUSION AND FUTURE
WORK
In this study, we propose a motion-assist calibration
network (MFCN) for video object detection, which is
based on end-to-end learning approach. Motion esti-
mation is the main tool we use to explore the tem-
poral coherence of features. Multi-frame features
with spatial changes are calibrated in a principled
way. We conducted a large number of experiments
which indicates that our model surpasses the existing
state-of-the-art model in both KITTI and Cityscapes
benchmarks. The results show that the motion mod-
eling feature calibration improves feature representa-
tion and facilitates final detection in the video.
However, video object detection still has a lot of
room for improvement. For safe autonomous driving,
proper testing should reach a higher level. Despite
this, the accuracy we achieve is still some distance
from the target. We believe that a more efficient and
accurate estimation network and object detection net-
work may be beneficial to the improvement of video
object detection. These unresolved issues may be in-
structive for some future work.
REFERENCES
Chen, X., Ma, H., Wan, J., Li, B., and Xia, T. (2017). Multi-
view 3d object detection network for autonomous
driving. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages
1907–1915.
Dai, J., Li, Y., He, K., and Sun, J. (2016). R-fcn: Object de-
tection via region-based fully convolutional networks.
In Advances in neural information processing sys-
tems, pages 379–387.
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas,
C., Golkov, V., Van Der Smagt, P., Cremers, D., and
Brox, T. (2015). Flownet: Learning optical flow with
convolutional networks. In Proceedings of the IEEE
international conference on computer vision, pages
2758–2766.
Du, Y., Yuan, C., Li, B., Hu, W., and Maybank, S. (2017).
Spatio-temporal self-organizing map deep network for
dynamic object detection from videos. In Proceedings
of the IEEE Conference on Computer Vision and Pat-
tern Recognition, pages 5475–5484.
Feichtenhofer, C., Pinz, A., and Zisserman, A. (2017). De-
tect to track and track to detect. In Proceedings of the
IEEE International Conference on Computer Vision,
pages 3038–3046.
Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. (2013).
Vision meets robotics: The kitti dataset. The Inter-
national Journal of Robotics Research, 32(11):1231–
1237.
Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE
international conference on computer vision, pages
1440–1448.
Han, W., Khorrami, P., Paine, T. L., Ramachandran, P.,
Babaeizadeh, M., Shi, H., Li, J., Yan, S., and Huang,
T. S. (2016). Seq-nms for video object detection.
arXiv preprint arXiv:1602.08465.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Spatial pyra-
mid pooling in deep convolutional networks for visual
ICAART 2020 - 12th International Conference on Agents and Artificial Intelligence
238
recognition. IEEE transactions on pattern analysis
and machine intelligence, 37(9):1904–1916.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 770–778.
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A.,
and Brox, T. (2017). Flownet 2.0: Evolution of optical
flow estimation with deep networks. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition, pages 2462–2470.
Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T.,
Zhang, C., Wang, Z., Wang, R., Wang, X., et al.
(2018). T-cnn: Tubelets with convolutional neural net-
works for object detection from videos. IEEE Trans-
actions on Circuits and Systems for Video Technology,
28(10):2896–2907.
Kang, K., Ouyang, W., Li, H., and Wang, X. (2016). Ob-
ject detection from video tubelets with convolutional
neural networks. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition,
pages 817–825.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-
agenet classification with deep convolutional neural
networks. In Advances in neural information process-
ing systems, pages 1097–1105.
Lee, B., Erdenee, E., Jin, S., Nam, M. Y., Jung, Y. G., and
Rhee, P. K. (2016). Multi-class multi-object tracking
using changing point detection. In European Confer-
ence on Computer Vision, pages 68–83. Springer.
Lin, T.-Y., Doll
´
ar, P., Girshick, R., He, K., Hariharan, B.,
and Belongie, S. (2017a). Feature pyramid networks
for object detection. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition,
pages 2117–2125.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll
´
ar, P.
(2017b). Focal loss for dense object detection. In
Proceedings of the IEEE international conference on
computer vision, pages 2980–2988.
Liu, D., Wang, Y., Chen, T., and Matson, E. T. (2019). Ap-
plication of color filter and k-means clustering filter
fusion in lane detection for self-driving car. In Pro-
ceedings of The Third IEEE International Conference
on Robotic Computing.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,
Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot
multibox detector. In European conference on com-
puter vision, pages 21–37. Springer.
Ma, C., Huang, J.-B., Yang, X., and Yang, M.-H. (2015).
Hierarchical convolutional features for visual track-
ing. In Proceedings of the IEEE international con-
ference on computer vision, pages 3074–3082.
Ramanagopal, M. S., Anderson, C., Vasudevan, R., and
Johnson-Roberson, M. (2018). Failing to learn: au-
tonomously identifying perception failures for self-
driving cars. IEEE Robotics and Automation Letters,
3(4):3860–3867.
Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster
r-cnn: Towards real-time object detection with region
proposal networks. In Advances in neural information
processing systems, pages 91–99.
Ren, S., He, K., Girshick, R., Zhang, X., and Sun, J. (2017).
Object detection networks on convolutional feature
maps. IEEE transactions on pattern analysis and ma-
chine intelligence, 39(7):1476–1481.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-
stein, M., et al. (2015). Imagenet large scale visual
recognition challenge. International journal of com-
puter vision, 115(3):211–252.
Simonyan, K. and Zisserman, A. (2014). Very deep con-
volutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556.
Uijlings, J. R., Van De Sande, K. E., Gevers, T., and
Smeulders, A. W. (2013). Selective search for object
recognition. International journal of computer vision,
104(2):154–171.
Wang, Y., Liu, D., Jeon, H., Chu, Z., and Matson, E.
(2019). End-to-end learning approach for autonomous
driving: A convolutional neural network model. In
Proceedings of the 11th International Conference
on Agents and Artificial Intelligence - Volume 2:
ICAART, pages 833–839. INSTICC, SciTePress.
Zeng, X., Ouyang, W., Yang, B., Yan, J., and Wang, X.
(2016). Gated bi-directional cnn for object detection.
In European Conference on Computer Vision, pages
354–369. Springer.
Zhang, X., Zhou, X., Lin, M., and Sun, J. (2018). Shuf-
flenet: An extremely efficient convolutional neural
network for mobile devices. In Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition, pages 6848–6856.
Zhu, X., Dai, J., Yuan, L., and Wei, Y. (2018). Towards high
performance video object detection. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition, pages 7210–7218.
Zhu, X., Wang, Y., Dai, J., Yuan, L., and Wei, Y. (2017a).
Flow-guided feature aggregation for video object de-
tection. In Proceedings of the IEEE International
Conference on Computer Vision, pages 408–417.
Zhu, X., Xiong, Y., Dai, J., Yuan, L., and Wei, Y. (2017b).
Deep feature flow for video recognition. In Proceed-
ings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 2349–2358.
Zitnick, C. L. and Doll
´
ar, P. (2014). Edge boxes: Locating
object proposals from edges. In European conference
on computer vision, pages 391–405. Springer.
Object Detection for Autonomous Driving: Motion-aid Feature Calibration Network
239