Joint Semantic and Motion Segmentation for Dynamic Scenes using Deep

Convolutional Networks

Nazrul Haque, Dinesh Reddy and K. Madhava Krishna

International Institute of Information Technology, Hyderabad, India

nazrul.athar@research.iiit.ac.in, dinesh.andromeda@gmail.com, mkrishna@iiit.ac.in

Keywords:

Monocular Semantic Motion Segmentation, Scene Understanding, Convolutional Neural Networks.

Abstract:

Dynamic scene understanding is a challenging problem and motion segmentation plays a crucial role in solving

it. Incorporating semantics and motion enhances the overall perception of the dynamic scene. For applications

of outdoor robotic navigation, joint learning methods have not been extensively used for extracting spatio-

temporal features or adding different priors into the formulation. The task becomes even more challenging

without stereo information being incorporated. This paper proposes an approach to fuse semantic features and

motion clues using CNNs, to address the problem of monocular semantic motion segmentation. We deduce

semantic and motion labels by integrating optical ﬂow as a constraint with semantic features into dilated

convolution network. The pipeline consists of three main stages i.e Feature extraction, Feature ampliﬁcation

and Multi Scale Context Aggregation to fuse the semantics and ﬂow features. Our joint formulation shows

signiﬁcant improvements in monocular motion segmentation over the state of the art methods on challenging

KITTI tracking dataset.

1 INTRODUCTION

Visual understanding of dynamic scenes is a critical

component of an autonomous outdoor navigation sys-

tem. Interpreting a scene involves associating a se-

mantic concept, also referred to as a label with each

image pixel. These semantics can then be incorpo-

rated in a higher-level to reason about the image holis-

tically. Traditional scene understanding approaches

(Chen et al., 2014)(Athanasiadis et al., 2007) (Shotton

et al., 2008) have focused on extracting pixel-level se-

mantic labels, and have demonstrated superior perfor-

mance in static scenes. Motion and semantics provide

complementary cues about a dynamic scene, and can

be used to generate a comprehensive understanding

of the scene. Some recent approaches (Reddy et al.,

2014) (Wedel et al., 2009) leverage stereo information

to incorporate motion cues into the scene understand-

ing framework.

We focus on the problem of obtaining semantic

motion segmentation from monocular images. Recent

success in scene understanding using convolutional

neural networks, motivated us to extend existing mod-

els that perform semantic segmentation to incorporate

motion cues. The success of deep neural network ar-

chitectures can be attributed to the efﬁcient learning

and inference mechanisms employed. Learning in-

volves determining a set of parameters using multi-

ple iterations of stochastic gradient descent over ran-

domly sampled batches of labeled images, and infer-

ence on a target image involves only a forward pass

of the image through the network.

Figure 1: The eventual output of our Semantic Motion

Segmentation approach. Semantic labels get preﬁxed with

motion labels such as Moving Car and Stationary Pedes-

trian.(Best viewed in color).

Deep learning architectures used for scene un-

derstanding incorporate semantic labels for learning

scene descriptions. We aim to generate richer descrip-

tions by preﬁxing motion labels to semantics such

as ’Moving Car’ and ’Stationary Car’, and do so

in a joint framework. Currently, deep architectures

Haque N., Reddy D. and Madhava Krishna K.

Joint Semantic and Motion Segmentation for Dynamic Scenes using Deep Convolutional Networks.

DOI: 10.5220/0006129200750085

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 75-85

ISBN: 978-989-758-226-4

model either motion (Fischer et al., 2015) or seman-

tics (Long et al., 2015) in an exclusive manner. To

the best of our knowledge, this is the ﬁrst effort to-

wards seamlessly integrating motion cues with deep

architectures that are trained to predict only seman-

tics. The proposed joint learning pipeline is efﬁcient,

and learning can be performed end-to-end. Fig. 1

shows a sample output of the proposed framework.

In settings where images are obtained from a

monocular camera, motion detection has been tack-

led by taking into account the optical ﬂow between

two subsequent images, which tends to fail with large

camera displacements. For outdoor robotic vision,

the camera displacement is unavoidable. Although,

this has been tackled in (Tourani and Krishna, 2016)

where motion models are generated and merged using

trajectory clustering into different motion afﬁne sub-

spaces. The moving object proposals generated from

the prior model are sparse collection of points lying

on the object, resulting into a sparse motion segmen-

tation.(Fragkiadaki et al., 2015) exploit appearance

similarity to capture parts of moving objects using

two stream CNN with optical ﬂow and rank spatio-

temporal segments over a video sequence by map-

ping clustered trajectories to the pixel tubes. In con-

trast, our approach performs joint optimization for

pixel wise motion and semantic labels, owing to the

fact that they are interrelated. An intuitive example

to demonstrate the relation is that the likelihood of

a moving car or moving pedestrian is more than that

of a moving tree or wall. To exploit the correlation,

our pipeline proposes integration of semantic and mo-

tion cues in three stages, namely, Feature extraction,

Feature ampliﬁcation and multi-scale context aggre-

gation. The proposed approach is shown to be ef-

fective for motion segmentation even with a moving

camera, on outdoor scenes.

In summary, following are the key contributions

of our work.

• We present an end-to-end convolutional neural

network architecture that performs joint learning

of motion and semantic labels, from monocular

images.

• We provide a novel method for seamless integra-

tion of motion cues with networks trained for pre-

dicting semantic labels.

• We present results on several sequences of the

challenging KITTI benchmark and achieve results

superior to the state of the art.

The remainder of the paper is organized as fol-

lows. Section 3 presents the architecture and ap-

proach used for joint learning of motion and semantic

labels. In section 4, we summarize the experiments

carried out, dataset used and training procedure for

our joint module. We also show evaluation and com-

parison of our approach in section 4.3.

2 RELATED WORK

Fair amount of literature has been done in the ﬁeld of

semantic and motion understanding of scene. Tradi-

tional approaches for semantic segmentation involve

extracting features from a image and use different

methods to classify each pixel. Multiple works have

been used to train for semantic labels (Fields, 2001)

(Reddy et al., 2014) (Russell et al., 2009) (Koltun,

2011). However, with the emerging era of Deep

Learning, there has been a large amount of litera-

ture in the ﬁeld of semantic segmentation which has

shown large improvements compared to the previous

baselines. Approaches using deep convolutional neu-

ral networks (LeCun et al., 1989) have shown to out-

perform most of the methods in all the basic problems

of vision. The literature includes works by (Lin et al.,

2015) (Liu et al., 2015)(Dai et al., 2015), where tech-

niques such as bounding box, Deep net followed by

CRF formulation and MRFs were put to use, achiev-

ing signiﬁcant results. Further, (Long et al., 2015)

adapted the VGG Net model (Simonyan and Zisser-

man, 2014b) to predict pixel-to-pixel semantic labels,

with fusion at pool layers for output up-sampling. Yu

and Koltun(Yu and Koltun, 2015) proposed an adap-

tation of VGG-16 architecture for systematic expan-

sion of receptive ﬁelds using dilated convolutions for

dense image segmentation, giving more accurate re-

sults than prior adaptations. The approach involves

carrying over a global perspective without loss in res-

olution using repetitive deep convolutional layers.

Motion segmentation has been extensively ad-

dressed, particularly for outdoor robotic navigation.

Most of the works use geometric constraints to at-

tain signiﬁcant accuracy. In the seminal work con-

tributed by (Elhamifar and Vidal, 2009), trajectory

points were modeled as sparse combination of evalu-

ated trajectories. (Tourani and Krishna, 2016) used in

frame shear constraints to generate and merge afﬁne

models, achieving state of art results in sparse mo-

tion segmentation using monocular camera. Recently

many deep convolution nets have been used to learn

motion labels (Rozantsev et al., 2014) (Fragkiadaki

et al., 2015) (Tokmakov et al., 2016) for motion seg-

mentation. Although they work very well, they suffer

from unavailability of large datasets or rely on stereo

information, therefore proving ineffective for monoc-

ular systems. (Fragkiadaki et al., 2015) presents state

of art results in the detection of per frame moving ob-

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

Figure 2: Illustration of the proposed approach. Images at t and t+1 are provided to the network(A). The dilated network

undergoes ﬁne tuning with addition of motion labels and the last Conv. features are extracted(B).Optical ﬂow between the

two frames(C) is scaled and resized to the size of feature maps(D). The dilated features are ampliﬁed using optical ﬂow

magnitude by element wise product(E). Further, convolution layers are freezed and fully connected layers are ﬁne tuned.

The augmented feature maps are further enhanced with end-to-end training with the Context Module, learning dependencies

between object class and motion labels. The predictions obtained from the softmax layer are upsampled to give a joint label

to each pixel(F).(Best viewed in color).

ject proposals. The work emphasizes segmentation on

monocular uncalibrated video sequences by a ranking

heuristics and regression using a two stream network

with optical ﬂow, followed by supervoxel projection.

Joint classiﬁcation of semantic and motion labels

is relatively new in the ﬁeld, and much of the work has

been carried out by (Reddy et al., 2014) using dense

CRF joint formulation on stereo image sequences.

This however would prove ineffective for monocular

situations as it heavily relies on the depth information.

We draw analogy from works (Fischer et al., 2015)

(Simonyan and Zisserman, 2014a) (Karpathy et al.,

2014) (Park et al., 2016) where two parallel streams

of convolution neural networks are fused for action

recognition in videos or generating optical ﬂow. Due

to unavailability of large scale datasets for semantic

motion segmentation, training a neural network from

scratch becomes unfeasible. However, we adapt the

concept of feature ampliﬁcation highlighted in (Park

et al., 2016) to our problem in a joint formulation ap-

proach, resulting in an end-to-end model for seman-

tic motion segmentation. We outperform state of art

results for monocular motion segmentation using our

joint model.

3 MONOCULAR SEMANTIC

MOTION SEGMENTATION

In this section, we present our semantic motion seg-

mentation framework. A joint formulation is pro-

posed for the overall learning task and is composed of

three main modules, viz. features from dilated convo-

lutions, feature ampliﬁcation, and multi-scale context

aggregation. We also provide an illustration of our

approach in Fig. 2.

3.1 Features from Dilated Convolutions

To obtain semantic features, we use a neural net-

work architecture which employs dilated convolu-

tions, speciﬁcally engineered for dense predictions.

Originally proposed in (Yu and Koltun, 2015), a di-

lated convolution operator is a traditional convolution

operator modiﬁed to apply a ﬁlter at different ranges

using different dilated factors.

In relation to a discrete function H : Z

→ R

and q : Ω

→ R , a discrete ﬁlter with size (2s + 1)

where Ω

= [−s, s]

∩ Z

, the convolution operator is

deﬁned as:

(H ∗

q)(c) =

∑

r+dt=c

H(r)q(t) (1)

where, d is the dilation factor. Such an operator ∗

is referred to as d-dilated convolution. The operator

can be intuitively understood as follows. Given a 1D

signal f and a kernel q, with dilated convolution the

kernel touches the signal at every d

entry.

Expansion of receptive ﬁelds in existing pooled

architectures leads to an ungainly increase in parame-

ters to the same extent. The architecture proposed by

Fisher et. al. (Yu and Koltun, 2015) is inspired from

the fact that dilated convolutions sustain exponential

expansion of the effective receptive ﬁeld without loss

in coverage area. While pooling architectures leads

to loss in resolution, the dilated architecture enables

initialization with the same parameters and producing

higher resolution output.

Joint Semantic and Motion Segmentation for Dynamic Scenes using Deep Convolutional Networks

Figure 3: Network Architecture - w × h s: Layer with ker-

nels of width w, height h, and stride s. The dilated factor

in layers, if any, is shown on the top of each layer. Number

of channels in the outputs from each layer is depicted be-

low each layer. For instance, the fully connected has 4096

channels in its output block.

3.1.1 Network Architecture

Our network architecture is primarily adapted from

the VGG-16 framework proposed by (Simonyan and

Zisserman, 2014b), with modiﬁcations applied from

the work by Fisher on dilated convolutions. The

VGG-16 architecture incorporates a stack of convolu-

tions, followed by three fully-connected layers. This

was tailored for dense predictions by Long et al.

(Long et al., 2015). The architecture proposed by

Long includes two major shifts. First, the inner prod-

uct layers are converted to convolutions. This over-

comes the restriction on the size of the input image

owing to the fact that the architecture does not con-

tain any inner product layers. Second, an upsampling

layer is introduced, which brings back the spatial res-

olution of the output through a learned operation. The

upsampling operation is carried out at different inter-

mediate layers and are fused to obtain dense predic-

tions. This allows the architecture to predict ﬁner de-

tails with global or high level information in place.

We adapt the fully convolutional network of Long.

and integrate modiﬁcations proposed by (Yu and

Koltun, 2015). Our network architecture is shown in

Fig. 3. The last two pooling layers in the VGG-16 ar-

chitecture (Simonyan and Zisserman, 2014b) were re-

moved. Furthermore, for each of the removed layers,

the following convolution layers are replaced with a

dilation factor of 2. This enables the network to gen-

erate high resolution features with the same initializa-

tion parameters.

3.1.2 Network Initialization

We present a novel method for initialization of the

ConvNet for obtaining convolution features. We use

the model by (Yu and Koltun, 2015), pre-trained for

semantic labels. For training with joint semantic and

motion labels, we modify the ﬁnal convolution layer

and change the number of outputs to (C+M), where C

is the number of semantic labels predicted by the di-

lated ConvNet and M is the number of motion labels.

For instance, M can be 2, with two labels being mov-

ing car and moving Pedestrians in an outdoor scene.

Furthermore, we copy the weights from the pretrained

dilated ConvNet to the modiﬁed network for all layers

except the ﬁnal layer. For the ﬁnal convolution layer,

we copy the weights in the given fashion:

weights[

f inal

][i] → data[1 : C, :, :, :]

= weights

[

f inal

][i] → data

where, i ∈ {0, 1}, 1 for weights and 0 for bias,

weights

is the pre-trained weights array of the

dilated network for semantic features and weights is

the weights array of the modiﬁed network. Weights

for M motion labels in the ﬁnal convolution layer

are initialized using Xavier initialization (Glorot and

Bengio, 2010).

We propose that the initialization scheme works

well for training with fairly small annotated datasets.

The proposed initialization subjugates the limitation

of unavailability of large scale annotated dataset

to perform training for joint semantic and motion

labels. The network is trained on our annotated

dataset with the given initialization. Furthermore, the

’Convolution features’(see Fig.3) from the network

are extracted for joint learning with ﬂow features.

Also, the joint labels obtained from the softmax layer

forms our Baseline results for future comparisons.

3.2 Feature Ampliﬁcation

We leverage optical ﬂow for learning motion cues in

an image. Conventionally, training two stream net-

works is found useful to the task where one is focused

on learning semantic features using RGB image in-

put, while the other is tasked for learning motion cues.

The features from the two streams are fused at an in-

termediate layer for joint learning. However, unavail-

ability of a large annotated dataset with joint semantic

and motion labels is a major bottleneck for learning

with two stream architectures. Akin to the ideas pro-

posed in (Park et al., 2016), we present an approach

for learning relationship between semantic and mo-

tion class of an object. The method proposed in (Park

et al., 2016) is used primarily for action recognition

tasks. Features from the last convolution layer in a

Convolutional Network tasked for learning semantic

features is ampliﬁed using optical ﬂow magnitude to

identify the moving parts in an image before the fully

connected layers are evaluated. However, we extend

the underlying idea for the task of semantic motion

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

segmentation. An intuitive reasoning behind such an

adaptation is the similarity in recognition of motion

cues and integration with semantic features in both

the problems.

Figure 4: Fine tuning fully connected layers using ampliﬁed

features as described in (Park et al., 2016). Ampliﬁed fea-

tures are obtained by taking element wise product of scaled

optical ﬂow magnitude with the feature maps obtained from

last convolution layer.

We propose a method to augment feature maps

obtained from the last convolution layer of our di-

lated network (see Fig 3), for incorporating motion

cues. Optical ﬂow is generated between the consec-

utive frames at t and t+1. Next, we compute Eu-

clidean norm of the ﬂow vector and normalize the

magnitudes in the range 1-2. With the ﬂow infor-

mation in hand, we quantize the scaled magnitudes

and subsequently convert it to grayscale image. The

image is further resized to the size of feature maps

of the last convolution layer obtained from our spa-

tial network. Given a 900x900 RGB image as in-

put, our dilated network outputs feature maps of di-

mension 512x90x90. Hence, ﬂow image is resized to

90x90 dimension. Thereafter, element wise product

is performed between the ﬂow image and each feature

map in the stack. The intuition of scaling the magni-

tudes from 1 rather than 0 is to not zero out the fea-

ture values obtained from the spatial network, which

is equally important. Further, we freeze the convolu-

tion layers of the network and ﬁne tune the fully con-

nected layers with the ampliﬁed feature maps as input

to the fully connected layers. The ampliﬁcation pro-

cess is visualized in Fig. 4. The semantic features are

enhanced with motion cues, as a consequence of fea-

ture ampliﬁcation. We beneﬁt with the ampliﬁcation

due to incorporated temporal consistency with optical

ﬂow and difference in ﬂow magnitude between mov-

ing objects and it’s surroundings. Also, object bound-

aries are retained due to ampliﬁcation over baseline

semantic motion features, thereby handling disorien-

tation in optical ﬂow boundaries. Label probabilities

obtained after ﬁne tuning from the softmax layer are

up-sampled to obtain dense predictions with joint la-

bels. Image predictions obtained forms our Joint re-

sults for evaluations.

3.3 Multi-Scale Context Aggregation

We use the context module introduced by Fisher (Yu

and Koltun, 2015) for enhancement of the ampliﬁed

features. The architecture was proposed as an exten-

sion to existing CNN architectures for overall increase

in accuracy for dense predictions. The module im-

proves upon the feature maps, by successive dilated

convolutions, supporting exponential expansion of re-

ceptive ﬁeld, without losing resolution. This is effec-

tuated by continuous increase in dilation with increas-

ing layer depth. The architecture consists of 7 convo-

lution layers. The layers are dilated with factors - 1,

1, 2, 4, 8, 16 and 1, and each of these layers apply 3x3

convolutions with the speciﬁed dilation factors. The

module aggregates contextual information at multiple

scales and outputs feature maps of the same size as

that of input by padding the intermediate layers.

The weights in this module are initialized with a

form of identity initialization, commonly used for re-

current networks. In mathematical terms:

(t, i) = 1

t=0

i= j

(2)

where i and j are index of input and output feature

maps respectively. The identity initialization of such

a form, initiates ﬁlters which can relay the inputs to

the next layer.

We learn the parameters for the context module

with our ampliﬁed feature maps as input to the mod-

ule. Fully connected and softmax layers from our net-

work is appended to the module. We obtain joint label

predictions from the softmax layer.

4 EXPERIMENTS AND

EVALUATION

Experiments were carried out with pre-existing archi-

tectures, adapted to our problem. Concept of two

stream architectures have been recently used in the

ﬁeld of action recognition, where spatial and tempo-

ral nets are combined at the fully connected layer. We

tailor the architecture to our problem. Two VGG-

16(Simonyan and Zisserman, 2014b) networks with

image and optical ﬂow as input to the respective net-

works were trained on our annotated dataset. The

weights for both the streams were initialized with

VGG models pre-trained for semantic segmentation

task. We also inspect and implement Flow net (Fis-

cher et al., 2015) to our problem, which has shown to

outperform state of art in learning optical ﬂow. The

network was initialized with pretrained FlowNet-C

weights and trained on our annotated dataset with in-

puts as image at t and t+1 respectively for the two

Joint Semantic and Motion Segmentation for Dynamic Scenes using Deep Convolutional Networks

streams. However, both the formulations did not

work well in combining motion cues with semantic

information in hand. This is attributed to the fail-

ure of CNNs in learning and extracting useful features

with smaller datasets. Collection of large scale scene

datasets with joint semantic and motion labels is very

expensive. In contrast, our joint learning approach

reduces the burden of learning motion features from

scratch with large labeled datasets and proves effec-

tive with fairly smaller annotated datasets.

In this section, we describe the details of ConvNet

training and evaluations on KITTI tracking dataset.

4.1 Dataset

We have used renowned KITTI dataset (Geiger et al.,

2012), for evaluation of our approach. The dataset

contains over 40,000 images taken by a camera

mounted on a driving car through European Roads.

The driving sequences contain images from resi-

dential and urban scenes posing it as a challenging

dataset. The dataset was chosen to showcase proﬁ-

ciency of our approach with multiple moving cars for

outdoor scenes, which is uncommon in other datasets.

40 images were chosen from ﬁve sequences each, giv-

ing 200 images for training. Each of the images were

manually annotated with 13 labels. To be speciﬁc, the

labels given were Building, Vegetation, Sky, Car, Sign,

Road, Pedestrian, Fence, Pole, Sidewalk, Cyclist and

Moving Car, Moving Pedestrian for objects in mo-

tion. For testing, 60 images from KITTI tracking se-

quences were chosen as validation set and annotated

with the given label spectrum. For validation set, we

use challenging sequences with multiple moving cars

and ensure no overlap between train and validation

deck. We have used DeepFlow(Weinzaepfel et al.,

2013) for dense optical ﬂow computation, known for

its state of art results for KITTI benchmark dataset.

We plan to release the code, trained models and

dataset with joint labels to encourage future work in

the ﬁeld.

4.2 Learning

In this section we describe the training procedure for

our proposed approach. Our implementation is based

on publicly available Caffe(Jia et al., 2014) frame-

work. First, we describe the input to the data chan-

nel in the network. This applies to all modules in our

proposed method. Input image resolution is 1242 x

375, obtained from KITTI tracking dataset. Images

are padded using reﬂection padding and 900x900 ran-

dom crops are sampled. It then undergoes randomized

horizontal ﬂipping. Further, each input batch contains

crops from randomly selected images from the train-

ing dataset. This shapes the input to the module.

Training: Training is performed in three stages.

At ﬁrst, the network architecture (see Fig. 3) is ﬁne

tuned with motion labels added, to obtain convolu-

tion features and weights initialization for joint train-

ing with ﬂow features. Learning rate and momentum

was set to 10

−4

and 0.9, respectively. Training was

carried out for 10,000 iterations with batch size 1, us-

ing stochastic gradient descent. The dense predictions

obtained from the module forms our baseline for fur-

ther comparisons. We use these learned weights to

train the joint model with augmented feature maps as

input. Optical ﬂow magnitude is computed between

the frame at t and t+1. Flow image is padded and

cropped to 900x900, with respect to the RGB crop.

Furthermore, the convolution layers are freezed and

the network is trained with the ampliﬁed feature maps

as input to the fully connected layers. Training was

carried out for 10,000 iterations. Other parameters

stay the same.

Then, the context model is plugged into the ar-

chitecture and end-to-end training is performed for

20,000 iterations with batch size 1. Learning rate and

momentum is set to 10

−5

and 0.99, respectively. We

refer joint label predictions obtained from the softmax

layer of this model as Joint+Context.

+Context Joint Baseline INPUT

Figure 5: Figure outlining the labels from each stage of our

end-to-end module. Image is taken from our KITTI track-

ing test dataset. The baseline predictions outputs wrong la-

beling to moving car patches and the motion labels of the

car[Cyan] improve signiﬁcantly using our joint model.(Best

viewed in color).

4.3 Results

We evaluate the proposed approach on our manually

annotated KITTI Tracking test dataset. The testing

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

RGBGT - MSTMOP - MOURS - M

Figure 6: Qualitative evaluation - Motion segmentation on our KITTI test dataset. On the left, the images, consist of single

moving car. On the contrary, we have multiple cars in the image, on the right. Blue pixels represent stationary and red pixels

depict motion. We compare our approach with STMOP-M moving object proposals (Fragkiadaki et al., 2015). In the ﬁgure,

GT - M is ground truth motion annotation, STMOP - M is output from (Fragkiadaki et al., 2015) and OURS - M is the motion

segmentation obtained from the proposed approach. In contrast to STMOP-M where over-segmentation and False Positive

cases are observed on the roads and fence, our proposed approach yields better segmentation and motion boundaries with cars

in motion. (Best viewed in color).

images(see Sec. 4.1) chosen from different sequences

pose challenging scenarios for motion segmentation

with multiple moving objects. Also, there are promi-

nent cases where moving cars lie in the camera sub-

space. To demonstrate qualitative results we take four

sequences consisting of 116, 143, 309 and 46 images.

Qualitative results are provided in Fig. 7, Fig. 8 and

on complete sequences in the supplementary video.

To the best of our knowledge, there are no avail-

able monocular joint semantic and motion baseline.

Hence, we show independent semantic and motion

evaluation with the existing state of art in the respec-

tive ﬁelds. For instance, for a pixel bearing joint label

- ’Moving Car’, we say ’Car’ as the semantic label or

object class and ’Moving’ as the motion class of the

pixel. Comparative evaluations are carried out for se-

mantic segmentation and monocular motion segmen-

tation. However, for joint semantic and motion labels,

we demonstrate evaluations against manually anno-

tated Ground Truth labels.

4.3.1 Qualitative Evaluation

In this section, we show our results with joint labels

for different stages proposed in the paper, in compar-

ison to Ground Truth. We also show qualitative as-

Table 1: Quantitative evaluation on our KITTI test tracking

dataset. We compare PPV (Positive predicted value) from

our approach with the state of the art sparse motion segmen-

tation SHEAR-M(Tourani and Krishna, 2016) and STMOP-

M(Fragkiadaki et al., 2015). We achieve 4.9% gain in the

metric over the existing state of art.

Model Stationary Moving

STMOP-M 98.34 83.91

SHEAR-M 99.85 84.37

Ours ( Joint+Context ) 99.55 89.28

sessment of motion segmentation in monocular set-

tings with STMOP-M. (Fragkiadaki et al., 2015). In

the Figures, ’Stationary Car’ and ’Stationary Pedes-

trian’ labels are abbreviated as ’Car’ and ’Pedestrian’

respectively, while the label is preﬁxed with ’Moving’

in case of motion.

Motion: We show improvements over our base-

line results in Fig. 5. Baseline results labels parts

of moving car as stationary. However, with optical

ﬂow based feature ampliﬁcation, pixels for cars in

motion are rectiﬁed as moving. Further, via feature

enhancement with Context Module, labels improve

signiﬁcantly. We attribute the improvements shown

by using feature ampliﬁcation, to the fact that tem-

poral consistency has been incorporated using opti-

Joint Semantic and Motion Segmentation for Dynamic Scenes using Deep Convolutional Networks

Sequence 1 Sequence 2 Sequence 3

RGBGT

Baseline

Joint

+Context

Figure 7: Qualitative evaluation of joint labels with Ground Truth annotations on our KITTI test dataset. Top to Bottom:

(1) Input image from KITTI sequences (2) Ground Truth for semantic motion segmentation (3) Baseline predictions: joint

labels using dilated convolution. (4) Joint Module: Results obtained after feature ampliﬁcation with optical ﬂow.(5) Context

Module: joint predictions after feature enhancement with context module.(Best viewed in color).

cal ﬂow into the baseline. Further, the proposed fea-

ture ampliﬁcation has clear demarcation between the

boundaries of the moving objects and stationary sur-

roundings due to variance in ﬂow vector magnitude,

which is being incorporated into the ﬁnal segmenta-

tion.

STMOP (Fragkiadaki et al., 2015) generates mov-

ing object proposals on video sequences. We use

the code available and generate proposals on KITTI

sequences. For fair comparison, we take the pro-

posals with best supervoxel projection on the ob-

jects. We show our monocular motion segmen-

tation results in comparison to Ground Truth and

STMOP(Fragkiadaki et al., 2015) moving object pro-

posals. In Fig. 6, consisting of images with single

and multiple moving cars, STMOP-M leads to over

segmentation, while our approach correctly segments

the moving car, also removing extra segments of road

and fence. In the above cases, STMOP fails in out-

door robotic scenarios essentially due to large camera

motion and optical ﬂow bleeding, while our approach

uses semantic priors and beneﬁts from motion and se-

mantic correlation.

Joint Semantic and Motion: We also evaluate

our approach with the Ground Truth semantic motion

labeling. In the sequences demonstrated in Fig. 7,

the Baseline results incorrectly labels patches of the

moving car closer to the camera(in Sequence 1) as

stationary(seen with Violet color). Similar observa-

tion is found in the baseline results of the moving car

in Sequence 2. The patches are rectiﬁed as moving

as a consequence of joint training with ampliﬁed fea-

tures. This again reiterates the utility of joint learn-

ing and inference between motion and semantic cues.

The improper patches on the moving cars in the se-

quences are further rectiﬁed by the context module us-

ing multi scale context aggregation. To perceive joint

segmentation, other than car scenes, we consider a se-

quence(Sequence 3) from KITTI with Moving Pedes-

trians. The results for each stage are depicted in Fig.

7. Parts of moving pedestrians on the left are labeled

as stationary in baseline results. The joint learning

with context aggregation corrects the motion domain

of pedestrians. Also, for pedestrians far away from

the camera, false positive cases are observed from our

approach in tiny patches due to inconsistency in op-

tical ﬂow magnitude of the pedestrian with large dis-

tance from the camera. Further, for consistency of our

joint labels in challenging outdoor scenes, we show

joint semantic motion results on both highway and

city street scenes in Fig. 6.

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

Table 2: Quantitative analysis of motion label predictions with STMOP. Left: On our annotated Kitti(tracking sequence 4)

test dataset - consisting of lone moving object. Right: On our annotated kitti images, consisting of multiple moving cars. We

compare our results with (Fragkiadaki et al., 2015) moving object proposals.

Model Stationary Moving

(Fragkiadaki et al., 2015) 97.75 62.97

Baseline 99.44 76.36

Joint 99.35 81.94

Joint+Context 99.28 83.69

Model Stationary Moving

(Fragkiadaki et al., 2015) 97.63 44.53

Baseline 99.05 66.23

Joint 99.03 70.67

Joint+Context 98.97 71.98

Table 3: Quantitative evaluation of semantic label predictions from our proposed approach - Joint+Context (Ours - S) on

our KITTI test dataset. We compare our method with DeepLab-LFOV(Chen et al., 2014) and Segnet(Badrinarayanan et al.,

2015), known for semantic segmentation on outdoor driving scenes.

Method

Building

Vegetation

Sky

Car

Sign

Road

Pedestrian

Fence

Pole

Sidewalk

Cyclist

mean IOU

Segnet 66.70 78.11 89.32 69.74 12.45 71.69 12.09 25.03 21.12 44.01 11.2 45.61

Deeplab 73.35 84.17 91.33 70.76 7.66 69.63 24.41 68.30 16.51 26.14 13.53 49.62

Ours-S 78.52 84.99 90.07 88.18 19.28 75.82 8.46 76.60 29.31 36.84 66.70 59.53

4.3.2 Quantitative Evaluation

In this section, we perform a quantitative assessment

of both semantic and motion segmentation. We show

evaluations with (Tourani and Krishna, 2016) and

(Fragkiadaki et al., 2015). For semantic segmenta-

tion we compare our results with (Chen et al., 2015)

and (Badrinarayanan et al., 2015), which have shown

results for semantic segmentation on driving scenes.

Motion: For quantitative evaluation of motion

segmentation, we compare our results with STMOP

moving object proposals. Evaluation is staged by

cross veriﬁcation of each predicted pixel with cor-

responding ground truth motion label - stationary or

moving. The evaluation is unfolded in two mod-

els. First, we compare our dense motion segmenta-

tion with STMOP moving object proposal. We use

intersection over union as the evaluation metric for

dense motion segmentation. The metric is deﬁned as

TP/(TP+FP+FN), where TP denotes true positive, FP

false positive and FN false negative. Table 2 sum-

marizes our quantitative motion segmentation evalua-

tion.

The assessment is done in two broad cate-

gories,i.e, on annotated sequences with lone moving

object and sequences with multiple objects in mo-

tion. In the case with single moving car, we achieved

70.67% accuracy in detection of the moving car from

our joint module, while STMOP yields 59.97% detec-

tion accuracy. The increase in accuracy is attributed

to incorporated label and motion correlation. Fur-

ther, using context aggregation, the context module

yields further improvement in the efﬁciency. In case

of multiple moving objects, STMOP yields 41.53%

efﬁciency. The decrease in accuracy from STMOP is

due to large camera motion observed in the scenes,

while our joint module provides 70.67% success rate.

The joint learning exploits the fact that the likelihood

of a moving tree or pole is less compared to a mov-

ing car or moving person, resulting in substantial im-

provement in motion segmentation.

Another keynote observation would be a slight de-

crease in stationary accuracy over our baseline results.

This is due to the fact that different objects can exhibit

different optical ﬂow depending on the depth from the

camera, even though they share the same global mo-

tion. The decrease is although marginal as shown in

Table 2. We also show motion segmentation evalu-

ation with existing state of art in sparse monocular

motion segmentation. The IOU metric used for dense

motion segmentation is not known to be used in case

of sparse evaluations. Therefore, for fair comparison

with sparse segmentation, we use positive predictive

value (PPV) or precision- (TP/TP+FP) as the evalua-

tion metric. The results are summarized in Table 1.

We gain 4.9% in motion label precision over the state

of art SHEAR-M(Tourani and Krishna, 2016) on our

test dataset.

Semantics: For quantitative evaluation of seman-

tic image segmentation, we use per class Intersection

over Union similar to the metric used for dense mo-

tion segmentation evaluation. This is done for 11

semantic labels on our KITTI test dataset. We per-

form quantitative semantic evaluation of our approach

against Segnet(Badrinarayanan et al., 2015), which

has shown results on outdoor driving scenes such as

Joint Semantic and Motion Segmentation for Dynamic Scenes using Deep Convolutional Networks

Figure 8: Joint Semantic and motion labeling obtained from the proposed approach on challenging urban scenes. Speciﬁcally,

in the ﬁgure, from Left to Right: Highway scene, City Streets and a drive scene with relatively less trafﬁc. The joint labels

obtained in these settings depict robustness and consistency of our proposed approach.(Best viewed in color).

KITTI, and DeepLab-LFOV (Chen et al., 2015). For

comparison with (Chen et al., 2015) we use the pub-

licly available pre-trained model on PASCAL dataset

and ﬁne tune it on our KITTI training dataset. We

run both the algorithms, Segnet and DeepLab, on our

KITTI test dataset. The semantic label accuracy of

the models on the test set is reported in Table 3. Our

approach (Joint+Context) outperforms the other two

architectures. This is due to the fact that dilated ar-

chitecture produces higher resolution output crucial

to dense prediction in comparison to the strided and

pooled architectures in the former propositions.

5 CONCLUSIONS

In this paper, we have proposed a joint approach to

predict semantic and motion labels using a monoc-

ular camera. We incorporate spatial and temporal

information to learn object class and motion labels

jointly. Evaluations show an increase in pixel wise

motion segmentation accuracy without using stereo

information. We learn pixel wise labels without the

need for training temporal networks for motion cues,

which has proved to be a pitfall with unavailability of

large annotated datasets. To contribute and encourage

future works on monocular semantic motion segmen-

tation, we plan to release the annotated dataset and

trained models.

We believe that the proposed work will be ex-

tended for pixel-wise labelling of individual moving

objects. The end-to-end system can be used for bet-

ter dynamic scene understanding in complex outdoor

environments.

ACKNOWLEDGEMENTS

We would like to thank J. Krishna Murthy for provid-

ing insights into the formulation. We are also grate-

ful to Parv Parkhiya and Aman Bansal for help with

dataset annotation on KITTI Tracking benchmark.

REFERENCES

Athanasiadis, T., Mylonas, P., Avrithis, Y., and Kollias, S.

(2007). Semantic image segmentation and object la-

beling. IEEE transactions on circuits and systems for

video technology, 17(3):298–312.

Badrinarayanan, V., Handa, A., and Cipolla, R. (2015). Seg-

net: A deep convolutional encoder-decoder architec-

ture for robust semantic pixel-wise labelling. arXiv

preprint arXiv:1505.07293.

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and

Yuille, A. L. (2014). Semantic image segmentation

with deep convolutional nets and fully connected crfs.

arXiv preprint arXiv:1412.7062.

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and

Yuille, A. L. (2015). Semantic image segmentation

with deep convolutional nets and fully connected crfs.

In ICLR.

Dai, J., He, K., and Sun, J. (2015). Boxsup: Exploiting

bounding boxes to supervise convolutional networks

for semantic segmentation. In Proceedings of the

IEEE International Conference on Computer Vision,

pages 1635–1643.

Elhamifar, E. and Vidal, R. (2009). Sparse subspace clus-

tering. In Computer Vision and Pattern Recognition,

2009. CVPR 2009. IEEE Conference on, pages 2790–

2797. IEEE.

Fields, R. (2001). Probabilistic models for segmenting and

labeling sequence data. In ICML 2001.

Fischer, P., Dosovitskiy, A., Ilg, E., H

ausser, P., Hazırbas¸,

C., Golkov, V., van der Smagt, P., Cremers, D.,

and Brox, T. (2015). Flownet: Learning optical

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

ﬂow with convolutional networks. arXiv preprint

arXiv:1504.06852.

Fragkiadaki, K., Arbel

aez, P., Felsen, P., and Malik, J.

(2015). Learning to segment moving objects in

videos. In 2015 IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 4083–4090.

IEEE.

Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready

for autonomous driving? the kitti vision benchmark

suite. In Conference on Computer Vision and Pattern

Recognition (CVPR).

Glorot, X. and Bengio, Y. (2010). Understanding the dif-

ﬁculty of training deep feedforward neural networks.

In Aistats, volume 9, pages 249–256.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,

Girshick, R., Guadarrama, S., and Darrell, T. (2014).

Caffe: Convolutional architecture for fast feature em-

bedding. arXiv preprint arXiv:1408.5093.

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Suk-

thankar, R., and Fei-Fei, L. (2014). Large-scale video

classiﬁcation with convolutional neural networks. In

Proceedings of the IEEE conference on Computer Vi-

sion and Pattern Recognition, pages 1725–1732.

Koltun, V. (2011). Efﬁcient inference in fully connected

crfs with gaussian edge potentials. Adv. Neural Inf.

Process. Syst.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard,

R. E., Hubbard, W., and Jackel, L. D. (1989). Back-

propagation applied to handwritten zip code recogni-

tion. Neural computation, 1(4):541–551.

Lin, G., Shen, C., Reid, I., et al. (2015). Efﬁcient piece-

wise training of deep structured models for semantic

segmentation. arXiv preprint arXiv:1504.01013.

Liu, Z., Li, X., Luo, P., Loy, C.-C., and Tang, X. (2015). Se-

mantic image segmentation via deep parsing network.

In Proceedings of the IEEE International Conference

on Computer Vision, pages 1377–1385.

Long, J., Shelhamer, E., and Darrell, T. (2015). Fully con-

volutional networks for semantic segmentation. In

Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition, pages 3431–3440.

Park, E., Han, X., Berg, T. L., and Berg, A. C. (2016). Com-

bining multiple sources of knowledge in deep cnns for

action recognition. In 2016 IEEE Winter Conference

on Applications of Computer Vision (WACV), pages

1–8. IEEE.

Reddy, N. D., Singhal, P., and Krishna, K. M. (2014). Se-

mantic motion segmentation using dense crf formula-

tion. In Proceedings of the 2014 Indian Conference

on Computer Vision Graphics and Image Processing,

page 56. ACM.

Rozantsev, A., Lepetit, V., and Fua, P. (2014). Flying ob-

jects detection from a single moving camera. arXiv

preprint arXiv:1411.7715.

Russell, C., Kohli, P., Torr, P. H., et al. (2009). Associative

hierarchical crfs for object class image segmentation.

In 2009 IEEE 12th International Conference on Com-

puter Vision, pages 739–746. IEEE.

Shotton, J., Johnson, M., and Cipolla, R. (2008). Semantic

texton forests for image categorization and segmen-

tation. In Computer vision and pattern recognition,

2008. CVPR 2008. IEEE Conference on, pages 1–8.

IEEE.

Simonyan, K. and Zisserman, A. (2014a). Two-stream con-

volutional networks for action recognition in videos.

In Advances in Neural Information Processing Sys-

tems, pages 568–576.

Simonyan, K. and Zisserman, A. (2014b). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Tokmakov, P., Alahari, K., and Schmid, C. (2016). Weakly-

supervised semantic segmentation using motion cues.

arXiv preprint arXiv:1603.07188.

Tourani, S. and Krishna, K. M. (2016). Using in-frame

shear constraints for monocular motion segmentation

of rigid bodies. Journal of Intelligent & Robotic Sys-

tems, 82(2):237–255.

Wedel, A., Meißner, A., Rabe, C., Franke, U., and Cremers,

D. (2009). Detection and segmentation of indepen-

dently moving objects from dense scene ﬂow. In Inter-

national Workshop on Energy Minimization Methods

in Computer Vision and Pattern Recognition, pages

14–27. Springer.

Weinzaepfel, P., Revaud, J., Harchaoui, Z., and Schmid, C.

(2013). Deepﬂow: Large displacement optical ﬂow

with deep matching. In Proceedings of the IEEE

International Conference on Computer Vision, pages

1385–1392.

Yu, F. and Koltun, V. (2015). Multi-scale context ag-

gregation by dilated convolutions. arXiv preprint

arXiv:1511.07122.

Joint Semantic and Motion Segmentation for Dynamic Scenes using Deep Convolutional Networks