Independently Moving Object Trajectories

from Sequential Hierarchical Ransac

Mikael Persson

and Per-Erik Forss

Department of Electrical Engineering, Link

oping University, Sweden

Keywords:

Robot Navigation, Moving Object Trajectory Estimation, Visual Odometry, SLAM.

Abstract:

Safe robot navigation in a dynamic environment, requires the trajectories of each independently moving object

(IMO). We present the novel and effective system Sequential Hierarchical Ransac Estimation (Shire) designed

for this purpose. The system uses a stereo camera stream to ﬁnd the objects and trajectories in real time.

Shire detects moving objects using geometric consistency and ﬁnds their trajectories using bundle adjustment.

Relying on geometric consistency allows the system to handle objects regardless of semantic class, unlike

approaches based on semantic segmentation. Most Visual Odometry (VO) systems are inherently limited to

single motion by the choice of tracker. This limitation allows for efﬁcient and robust ego-motion estimation in

real time, but preclude tracking the multiple motions sought. Shire instead uses a generic tracker and achieves

accurate VO and IMO estimates using track analysis. This removes the restriction to a single motion while

retaining the real-time performance required for live navigation. We evaluate the system by bounding box

intersection over union and ID persistence on a public dataset, collected from an autonomous test vehicle

driving in real trafﬁc. We also show the velocities of estimated IMOs. We investigate variations of the system

that provide trade offs between accuracy, performance and limitations.

1 INTRODUCTION

Navigation in an environment with uncoordinated, In-

dependently Moving Objects (IMOs) is a challeng-

ing problem. A key component to the biological so-

lution of this problem is stereo vision. Humans ro-

bustly identify moving objects. We do this for ob-

jects in categories never before seen, and use what is

seen over time to predict where objects are moving. It

should therefore be possible to ﬁnd, estimate and pre-

dict moving objects in a manner useful for navigation

using a stereo camera.

In this work we focus on the real-time detection

and trajectory estimation of moving objects in stereo

video. In the image space this gives us instance

segmentation for rigid independently moving objects

as shown in Figure 1. We seek the trajectories in

6D pose-space over time in order to be able to plan

around them. While a robot would beneﬁt from a

multitude of sensors such as lidar and IMU, we limit

ourselves to a stereo sensor. This reduces the com-

plexity of the system and the number of dependencies.

https://orcid.org/0000-0002-5931-9396

https://orcid.org/0000-0002-5698-5983

Figure 1: Shire result. World tracks in red, IMOs marked by

bounding boxes along with the tracks that are considered as

belonging to the objects. Their full 6D trajectories in pose

space are estimated in each frame. Shire does not use deep

networks or semantic segmentation.

We further limit ourselves to rigid objects, such as ve-

hicles, as they are of key importance to autonomous

driving. The principle however generalizes to piece-

wise rigid objects and temporarily rigid objects. This

covers many real obstacles, as can be seen in Figure 4.

We present a novel, fast, accurate, and robust so-

lution for real stereo sequences. The solution is based

on a causal greedy probabilistic approach to the as-

signment and cluster count problems. The solution

exploits the sequential nature of the track data and the

722

Persson, M. and Forssén, P.

Independently Moving Object Trajectories from Sequential Hierarchical Ransac.

DOI: 10.5220/0010253407220731

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 5: VISAPP, pages

722-731

ISBN: 978-989-758-488-6

typical distribution of moving objects in real scenes.

The proposed system extends the feature tracking

type of visual odometry system (Klein and Murray,

2007; Badino et al., 2013; Mur-Artal and Tard

os,

2017; Persson et al., 2015; Cvi

c et al., 2017) to

detect and estimate multiple trajectories. The imple-

mentation is based on CV4X (Persson et al., 2015)

which achieved state of the art results on the KITTI

egomotion benchmark and can thus provide reliable

egomotion. We now extend this system to also es-

timate trajectories of independently moving objects.

The proposed system uses geometric consistency over

time to detect objects. This takes the form of repro-

jection error minimization to assign tracks and esti-

mate object states over time. As the world is also de-

tected as a rigidly moving object, a useful by-product

of the system is the ego-motion trajectory. We call the

system described in Section 3 Sequential Hierarchi-

cal Ransac Estimation (Shire). Shire performs well

in practice, in particular for nearby and fast moving

objects. These objects are exactly those that are im-

portant for dynamic obstacle avoidance. Shire is real-

time (30 Hz) on a standard desktop CPU.

The system is evaluated using a novel dataset,

which we make available (Persson, 2020). Ideally

we would have liked to evaluate our method by com-

paring the trajectories of the estimated IMOs to their

ground truth. To the best of our knowledge no suitable

dataset is available for this, nor can we generate such

ground truth for our dataset. The dataset is collected

from our experimental vehicle, when driving in real

trafﬁc. It covers inner city, country road, and highway.

We evaluate using bounding box instance segmenta-

tion of moving objects. This acts as a proxy for de-

tection, ID persistence, and estimation. This dataset

has been preprocessed and we provide rectiﬁed im-

ages, estimated disparity and semantic segmentation

as well as manually annotated IMOs. At the dataset

link, you can also ﬁnd evaluation software, the Shire

code, and an example video.

2 RELATED WORK

IMO detection and trajectory estimation could be ap-

proached by scene-ﬂow methods. Scene-ﬂow is the

observed 3D motion per pixel. These methods im-

plicitly segment the ﬂow into rigid objects. Scene-

ﬂow approaches can be categorized as classic or deep

learning based methods.

Classic methods are well represented by Piece-

wise Rigid Scene Flow (Vogel and Roth, 2015). The

method uses classic ﬂow, classic stereo and (clas-

sic) superpixel-segmentation. These are used to form

a regularized sceneﬂow cost function. This cost is

then optimized over a single stereo image pair using

Gauss-Newton (GN). The recent Deep Rigid Scene-

ﬂow (Ma et al., 2019) method is similar in many ways.

The method uses a ﬂow network, a stereo network,

and a semantic-segmentation network. The networks

are used to form a similar cost function. The cost

is then optimized over a single stereo image pair us-

ing GN, unrolled for differentiability. Replacing each

component with its deep learning based variants re-

quires supervised training. This implies the need for

large datasets with both semantic and 3D correspon-

dences ground truth. The cost of such is the main dis-

advantage of the modern method. However, compar-

ison on the KITTI sceneﬂow benchmark shows that

the latter method signiﬁcantly outperforms the for-

mer. However, while the deep learning approach is

faster at 746ms compared to 3 minutes on 0.5Mpixel

images, both methods are far from real time. Both

methods could hypothetically be extended to use mul-

tiple images. However, it is unclear how to do so with-

out further increasing the computational cost. The

methods are potentially useful as input to the pro-

posed system. We conclude that they are currently

to slow for our target application.

Another deep learning approach which targets a

similar problem is MOTSFusion (Luiten et al., 2019).

This method aims to identify and separate cars, both

still and moving. The method uses deep optical

ﬂow, bounding box detections, semantic segmenta-

tion, deep stereo. The method also performs ego-

motion compensation using ORBSLAM. Next they

use they use a per track geometrically aware boot-

strap tracking method method in order to associate

tracks over time. This results in strong id propagation

performance for the object detections. The method

also achieves good results on the MOTS benchmark.

The purpose here is tracking rather than 6D trajec-

tory estimation however, though the method could be

extended or used as input. Similar to the deep rigid

sceneﬂow the issues are with general moving objects

and computational cost. MOTSFusion operating at

0.5MPixel takes 440ms per frame, and while this may

be applicable in some cases, it is a long time in colli-

sion avoidance.

The good performance of the deep learning ap-

proaches comes at a price. Relying on bounding box

and/or segmentation limits the systems to the classes

for which data is available. For new cameras or scene

content, the deep learning methods require ﬁnetuning,

which requires ground truth. Even if only to adapt to

the minor differences in resolution, scene and image

characteristics, this requires ground truth data we do

not have. By contrast, the classic methods typically

Independently Moving Object Trajectories from Sequential Hierarchical Ransac

723

only require manual tuning of a few parameters.

A different approach to the problem of IMO state

estimation is found in the 6DVision system (Rabe,

2012). The system fuses stereo and KLT tracks over

time using a set of Extended Kalman ﬁlters over 3D

position (with a constant velocity state). The each ﬁl-

ter set is initialized using nearby features states. The

ﬁlters of each ﬁlterset have varying state priors, mea-

surement noise and process noise. The main advan-

tages of this method are low computational cost. The

main downside is that individual tracks are noisy and

that the ﬁlter sets take a long time to converge. In

practice, the state estimates are strongly smoothed.

This leads to decent results for long tracks, but the

system often fails to account for rapid changes.

In this work we introduce a new driving dataset

for the purposes of evaluating the system. Compared

to KITTI our dataset has higher resolution (2Mpixel

vs 0.5Mpixel) and higher framerate (30fps vs 10fps).

Using our own dataset allows us to ensure highly ac-

curate calibration which is required for our method.

We cannot evaluate on the sceneﬂow benchmarks

such as KITTI as it requires dense sceneﬂow, which

our method does not provide. Nor can we meaning-

fully submit to the KITTI (Geiger et al., 2012) MOTS

benchmark. This is because the MOTS benchmark

does not distinguish moving objects from stationary

ones. In practice the latter overwhelmingly outnum-

ber the former. We have however run our system on

the these datasets. See section 4.3 for results and fur-

ther discussion.

Real time stereo VO systems lie on a scale from

direct to feature based. Direct systems estimate ge-

ometry directly from pixel intensities and pixelwise

depths. Feature based systems use higher level im-

age features such as patches and descriptors. The

association process of direct systems relies heavily

on assumed geometry. This is because pixel inten-

sities are ambiguous, and the geometry constrains the

problem. Relying on predicted geometry in this way

is a form of geometrically supported tracking. The

simplest case of which would be to search for cor-

respondences along a predicted epipolar line. The

DSO systems of (Engel et al., 2018) and the stereo

variant by (Wang et al., 2017) are typical direct VO

systems. A major advantage of single geometry sup-

ported tracking is that it allows the use of line fea-

tures. This improves the VO systems robustness to

inarticulate environments and simpliﬁes the tracking.

The downside is that it fundamentally limits this type

of system. While DSO could hypothetically be re-

cursively applied in a similar manner to our system,

in practice we have found that initializing the DSO

requires a single dominant motion in view which pre-

cludes this.

Shire is an extension and generalization of the

well studied point feature type visual odometry sys-

tem class. The basis for our method is found in

systems such as PTAM (Klein and Murray, 2007),

MFI (Badino et al., 2013), ORBSLAM (Mur-Artal

and Tard

os, 2017), Cv4x (Persson et al., 2015) and

SOFT(Cvi

c et al., 2017). These methods emphasize

the use of inlier/outlier identiﬁcation and windowed

bundle adjustment. They also use measurement virtu-

alization, keyframing, and as well as non-causal pro-

cessing. Finally most point feature based systems use

geometrically supported tracking. Rather than as a

core of the tracker, the focus is often on improved ro-

bustness against structural aliasing. But it is also used

to extend tracks which in turn reduces trajectory drift.

A core assumption of these methods is that they are

limited to a single rigid object in view. In turn this

means that the tracking can and should detectably fail

for everything else. Due to the features used, feature

based systems are inherently robust against the ba-

sic aperture problem. This means that the integration

with the tracker is less restrictive. For all VO sys-

tems, geometrically supported tracking is key to good

performance. How limiting this is varies however.

Out of the systems considered, the bootstrap track-

ing by matching of Cv4x, is perhaps the least restric-

tive. Cv4x essentially iterates selecting the best corre-

spondence and ﬁnding the pose. This places places no

restrictions on the initial search. Without the hard re-

liance on the single motion assumption, it is suited for

our purpose. We strip down these systems into a sim-

ple core suited for fast causal local mapping and gen-

eralize this core system to support IMOs. We also in-

vestigate how to apply geometrically supported track-

ing to Shire with Shire-retrack.

3 METHOD

3.1 Visual Odometry Overview

Our system is an extension of the feature tracking

type of visual odometry system (Klein and Murray,

2007; Badino et al., 2013; Mur-Artal and Tard

os,

2017; Persson et al., 2015; Cvi

c et al., 2017). Tracks

are key to such systems and can be deﬁned as the

sequence of image coordinates of a repeatably lo-

calizable feature (landmark). An example would be

the projections of a 3D point with distinct appear-

ance. Landmark tracking visual odometry systems

exploit that the tracks lie on a single static rigid ob-

ject (world). This facilitates identifying outliers and

reducing tracking errors.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

724

First landmarks are detected and then tracked in

latter frames. A cost corresponding to Eqn 1 is then

minimized. This computes the egomotion trajectory

∀t and local 3d point cloud x

∀i.

min

∑

t,i

ρ(y

t,i

−℘

)) (1)

where ρ is a robust error function, y

t,i

is the obser-

vation of landmarks x

, at time t and P

is the cam-

era pose, i.e. the rigid transform, at time t for t ∈

, ..., t

} and ℘

is the projection operator suited

for x

where θ is the intrinsic parameters of the cam-

era.

In the simplest case, x ∈ R

, is a 3D point in world

coordinates. The observation noise is Gaussian plus

outliers, the cameras are calibrated, and the corre-

spondences are given.

Typically, ρ, is the L

loss combined with an out-

lier rejection scheme based on reprojection error. The

orientation part of the pose is represented as a unit

quaternion. Primarily to avoid the gimbal lock prob-

lem of Euler angles.

The cost Eqn 1 is non-convex, but optimization

will generally converge if the data is balanced and

well initialized. A good way to initialize is by se-

quentially growing the cost i.e. adding one image at a

time. This is further improved by computing PnP to

ﬁnd the initial pose of each new frame before bundle

adjustment.

3.2 Stereo Point Feature Visual

Odometry

Shire is best understood as an extension to point fea-

ture visual odometry, and for that reason we will ﬁrst

introduce a base method.

Given a sequence of rectiﬁed, calibrated stereo

images, iterating Algorithm 1 provides a basic, but

good, ego-motion trajectory estimate as indicated by

(Persson et al., 2015; Cvi

c et al., 2017; Mur-Artal

and Tard

os, 2017).

Speciﬁcally we use: the SGM from (Hirschm

uller,

2008) to compute disparity. The KLT implementa-

tion from (Rabe, 2012). The P3P and PnP meth-

ods from (Persson and Nordberg, 2018), and ANMS

from (Gauglitz et al., 2011).

The stereo reprojection error, ε, Eqn 2 for each

track includes disparity d(y) ≥ 0. It also requires the

points to be in front of each observing camera.

ε(y, d, P, x) =(y −℘

(Px))

λ(d − [℘

(Px) −℘

Px)]

)

(2)

where Θ

, Θ

are the camera intrinsics, and P

is the

relative pose of the stereo rig. Subscript 0 denotes

using the horizontal coordinate only.

Algorithm 1: Stereo Point Feature Visual Odometry.

1: Let: Left, right images at time t: LR

2: Let: Tracks of object o at time t: tracks

o,t

3: Let: Pose at time t: Pose

4: # Compute disparity

5: depth

= SGM(LR

)

6: # Track features using KLT

7: tracks

← tracker(LR

,tracks

t−1

), depth

(LR

)

8: # Compute pose: Ransac P3P and reﬁnement

9: P

← robust PnP(tracks

)

10: windowed bundle adjustment(P

, x

)

11: tracks

← inliers(tracks

: ε(track

) < τ

inlier

)

12: # Detect new features

13: tracks

∪ ← detector(LR

, depth

)

14: tracks

← ANMS(tracks

)

The parameter λ = 0.5 is a balancing weight cor-

recting for the relative variance of the tracker and dis-

parity methods. We found this value by comparing

the relative errors when viewing a static scene. The

method is fairly robust to different values however.

λ ∈ [0.1, 2] provides similar results if the other thresh-

olds are adjusted accordingly.

We minimize the squared L2 norm of the stereo

reprojection error, as it achieved good results and per-

formance:

min

∑

ε(y

t,i

, d

t,i

, P

, x

))∀t, i (3)

We have found that the quality and robustness of

Algorithm 1 method is bounded by the tracker and

disparity estimates (assuming pixel accurate calibra-

tion and typical scenes).

Algorithm 1 is not limited by geometrically sup-

ported tracking. Therefore we can extend it to support

multiple independently moving objects.

Many different tracking, disparity or optical ﬂow

methods could be used, but this is not the focus of this

work. Instead we use real time implementations of

standard methods. It is common to use keyframing to

facilitate loop closure or reduce computational cost,

but for clarity this is omitted here.

Adaptive Non-Maxima Suppression (ANMS) is

often overlooked in VO descriptions (Gauglitz et al.,

2011). We explicitly include ANMS as it is key to

good performance both computationally and in terms

of accuracy. ANMS conditions the optimization prob-

lem, as nearby observations will have correlated er-

rors. The importance of ANMS also indicates that

there is a limit to the accuracy gain possible by in-

creasing the track density. In other words, observa-

tions must cover a minimum amount of image for

accurate pose estimation, regardless of track density.

This applies to both Shire and Algorithm 1. Typical

ANMS parameters are radius = 1% image width, 100

Independently Moving Object Trajectories from Sequential Hierarchical Ransac

725

minimum tracks, and a strength score which is (track

age, cornerness score).

We use the high performance sparsity aware Ceres

Solver (Agarwal et al., 2020) to optimize the cost.

3.3 Extension to IMOs

We view reprojection error based inlier/outlier clas-

siﬁcation as a world/not world assignment operator.

Thus we extend this notion to the simultaneous as-

signment and optimization of multiple rigid objects.

A simple idea, which never the less results in a chal-

lenging optimization problem:

min

c,t

∑

t,i∈c

ρ(y

t,i

, d, P

c,t

, x

c,i

) +C(c

) (4)

Here ρ is a robust and weighted stereo reprojection

error, similar to Eqn 2. The pose P

c,t

is that of the

camera in IMO c:s coordinate system at time t and x

c,i

is the 3D point coordinates in IMO c:s coordinates.

The set of features belonging to object c is c

. Finally

the clustering penalty C(c

) accounts knowledge of

the distribution of objects. Without which a global

minimum is trivially found by putting every point in

its own object.

From an abstract perspective the problem is one of

association, clustering and minimization. The cluster-

ing space is non-metric due to the use of reprojection

errors as distance to cluster. This violates the assump-

tions of many greedy solver such as KMeans. Further

complicating things, the number of clusters and their

distribution is unknown. In short the problem is chal-

lenging.

Restricting ourselves to rigid objects, there is no

fundamental difference between the estimation of the

IMO trajectory with respect to the static world, and

the estimation of the IMO trajectory with respect to

the moving rigid object. Thus if the cluster associa-

tion of each track is known, the stereo VO method of

subsection 3.2 can be applied to each IMO indepen-

dently.

Similar to the static world inlier/outlier case we

can assign tracks to IMOs using a noise model,

and when there is sufﬁcient evidence, initialize new

IMOs. The key observation on which the system is

built, is that while tracker errors leave many track as-

signments ambiguous, an often sufﬁcient subset can

be unambiguously assigned in each frame. Thus by

accumulating information over time we causally de-

tect, assign and estimate in real time. In essence we

have three possible states for a track, inlier, ambigu-

ous, and outlier.

3.4 Shire

Given a sequence of rectiﬁed, calibrated stereo im-

ages, Shire iterates the steps in Algorithm 2. This

algorithm segments the scene into rigid IMOs and es-

timates their full 6D pose trajectories. The variable

objects consists of both the object pose trajectory and

the 3D coordinates of landmarks associated with the

object.

Algorithm 2: Shire.

1: Let: Objects at time t: ob jects

2: Let: Tracks of object o at time t: tracks

o,t

3: Let: Reprojection error of track tr for object given

optimized track state: ε

o,tr

4: score(o,tr) = max

(ε

o,t

(tr))

5: inlier(o, tr) = score(o,tr) < τ

inlier

6: f ound(o) = #tracks

o,t

> IMO

minimum tracks

7: Track, depth, detect (as per 3.2)

8: Unassigned: us

← us

t−1

∪ new tracks

9: for each o ∈ ob jects

t−1

10: poses

o,t

← PnP(ob jects

0:t−1

, tracks

o,t

)

11: tracks

o,t

← tracks

o,t−1

ininlier(o, tracks

o,t−1

)

12: us

← tracks

o,t−1

not in inlier(o, tracks

o,t−1

)

13: ob jects

∪ ← f ound(o)

14: windowed bundle adjustment(o)

15: os = not inlier(o,tracks

o,t

)

16: if o 6= world then

17: Split Tracks: us

∪ ← lasttwo(os)

18: end if

19: end for

20: Let: cs ← Potential IMO candidates

21: for each tr ∈ us

22: rs ← sort(score(o,tr)∀o ∈ ob jects

)

23: if rs[0] ≥ τ

inlier

then cs

∪ ← tr

24: else if rs[1] > τ

outlier

then tracks

o,t

∪ ← tr

25: end if

26: end for

27:

28: Initialize New Object(cs

)

29: cs

.x ← disparity

t−1

(cs

)

30: candidate: c ← PNP(cs

)

31: inliers: ins ← cs

s.t.rpe

c,cs

32: if #inliers > τ

creation count

then

33: Optimize candidate trajectory and points

34: ob jects

o,t

∪ ← candidate

35: end if

Additional details in the source code.

3.4.1 Assignment

The track score for a trajectory is the maximum of

the reprojection errors among the observations given

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

726

the optimal state. The optimal state is the state

which minimizes the squared sum of reprojection er-

rors for that trajectory over the optimization win-

dow. Computing track scores for all trajectories is

comparatively costly. Thus this is only performed

for tracks which are unassigned. The inlier test re-

quires that the reprojection error satisﬁes the inlier

threshold, and is not ambiguous with respect to the

second best possible assignment. Additionally for

IMOs, it is required that the point is within three

meters of the object center. For an IMO we re-

quire a minimum track count of τ

creation count

= 20.

For the inlier, and outlier thresholds we set τ

inlier

2.5 pixels, and τ

outlier

= 4 pixels. The thresholds

were found by looking at error rates of tracks with

known assignment. They coarsely correspond to a

probability of P(observation|from object) = 0.8 and

P(observation|from object) = 0.05.

3.4.2 Detection

The Ransac PNP solver is likely to ﬁnd IMOs in size

order, beginning with the ﬁrst object which is con-

sidered the world. Visualization aside it is generally

treated the same as the IMOs with two exceptions:

Outliers which successfully pass the inlier check for

the world, are assigned to it regardless of ambiguity

and regardless of distance to its mass center. This pre-

vents the formation of competing static objects should

an IMO come to a full stop.

Assignment to IMOs is only considered if the

IMO median world coordinate mass center is within

ﬁve meters of the points world coordinates in the lat-

est frame.

Initialization of a new object is performed as for

the ﬁrst two frames in stereo VO, its origin set to iden-

tity at the timestamp of the previous frame. After ini-

tialization, any object for which the tracks do not sat-

isfy the assignment test to itself are unassigned. Ob-

jects with less than ten remaining tracks are discarded,

and its tracks are split and unassigned.

IMOs typically cover a small part of the image

limiting the number of tracks available. This in turn

necessitates denser tracking to meet minimum robust

track count for detection and estimation. To account

for this we use a small ANMS radius 12 pixels Ex-

perimentally we have found that this is a good trade-

off between speed and robustness. Increasing the ra-

dius increases computational speed, but impacts the

detection of smaller objects if increased above 20 pix-

els. Lowering the radius increases computational cost

without an observed associated gain in performance.

The system initializes at most one new object each

frame in order to limit max per frame processing time.

Should performance permit, it is straightforward to

initialize multiple objects by iterating this step.

Ransac PNP with nl-reﬁnement is used because

it is robust against the high outlier ratios that occurs

when searching among candidates formed from a sev-

eral not yet identiﬁed IMOs.

3.5 Computational Speed

The tracking and disparity can be performed in ad-

vance in parallel on CUDA, taking taking 30ms for

2Mpixel images.

In order to speed up optimization, bundle adjust-

ment of trajectories is performed over a maximum of

500 tracks. These tracks are found by increasing ra-

dius ANMS. The remainder of the tracks are subse-

quently optimized using ﬁxed poses. To further im-

prove performance, all optimizations and error com-

putations use a window size of ﬁve.

The baseline version of Shire operates at 30fps

with two frames of latency between input and trajec-

tory estimates. Shire retrack requires an additional

12ms.

3.6 Semantic Segmentation

While we generally argue in favor of the generic geo-

metric approach, in the speciﬁc case of vehicles, fast

semantic segmentation networks are available. There-

fore we investigate how the method can be improved

by semantic labels. In the variant of our method called

Shire-labels we use the FCN (Cordts, 2017; Long

et al., 2015). The segmentation network was trained

on Cityscapes (Cordts et al., 2016).

In Shire-labels, only tracks that lie on vehicles can

be assigned to IMOs. The semantic segmentation was

performed ofﬂine in batches, but may be feasible to

run in real time online. The evaluation shows a minor

improvement to intersection over union (IoU) scores.

Qualitatively we see a signiﬁcant reduction in false

positives.

4 EXPERIMENTS

We investigate four variations of the Shire sys-

tem. Shire is the baseline and the fastest variant.

Shire+labels uses semantic segmentation to limit the

search for candidate tracks. Shire+labels only consid-

ers tracks where the starting pixel is labeled as a car

or truck as IMO candidates. Shire+retrack uses a sec-

ond tracking pass to reﬁne the track associations for

lost tracks. The second tracking pass uses predictions

based on the estimated world and object motions. The

Independently Moving Object Trajectories from Sequential Hierarchical Ransac

727

new track result is kept if consistent with its predic-

tion. This retrack tracking step comes at the cost of

an additional tracking pass. However, unlike the base

tracking pass, this only uses the bottom pyramid level

and is slightly faster. Shire+retrack+labels uses both

labels and the retrack step.

4.1 Dataset and Evaluation

The lack of benchmarks for the identiﬁcation of mov-

ing objects makes evaluation of IMO estimation a

challenge. Object motion 6D trajectories are gener-

ally difﬁcult to acquire for real sequences. The avail-

able benchmarks are focused on bounding box seg-

mentation of semantic classes. This makes them ill

suited for evaluating the quality of a system which

needs motion, but not semantic class. We propose to

evaluate the system by bounding box IoU and num-

ber of ID switches. Speciﬁcally we are interested in

the results on the 2Mpixel stereo dataset we collected

from a Daimler test vehicle. We annotate 7k images

for quantitative evaluation.

We performed manual bounding box instance seg-

mentation for IMOs. The criteria for an object to be

annotated as an IMO were as follows: The object

is a car or truck. The object boundary must move

more than two pixels from frame to frame. This is af-

ter manual egomotion compensation i.e. compared to

known static background. The object must be bigger

than 5% of the image and closer than 70 meters. The

distance was determined using manually veriﬁed dis-

parity. Further, the object must be in unoccluded view

for at least ﬁve sequential frames. Objects that come

to a complete stop for more than ﬁve frames are no

longer considered IMOs. Unoccluded view requires

that 70% or more of the object must be visible. Any

IMO which fully leaves view is considered as having

a new IMO ID. Not all IMOs are annotated.

This IMO selection is conservative in the sense

that the system can detect smaller, shorter lived and

occluded IMOs. The IMO requirements are made

in part because the annotation of such IMOs is of-

ten ambiguous. Their prevalence in the dataset would

also make annotation too costly. This means that false

positives cannot be automatically evaluated. As a re-

sult the primary scores of interest are intersection over

union and false negatives.

Bounding boxes for each proposed object are

computed in each frame as the range of the associated

tracks positions. The proposals are then associated to

the ground truth boxes by IoU order in a one to one

greedy fashion. The greedy assignment is in descend-

ing GT bounding box area order with the added con-

straint of a minimum IoU score of > 0.2. The result

Table 1: Per frame computational cost. Note that this is

only the cost of the Shire algorithm. The ﬁrst tracking, dis-

parity and semantic segmentation are excluded as these can

be performed in parallel one frame in advance. However,

the second tracking pass required by the retrack method is

included in the timing as this cannot be performed in paral-

lel.

Method median time mean time

shire 27 ms 30 ms

+labels 26 ms 29 ms

+retrack 43 ms 45 ms

+retrack+labels 40 ms 42 ms

Table 2: Number of ID switches, and number of false nega-

tives. Total nr of annotation bounding boxes is 9380. Total

images 7000.

Method switches missed

shire 4 1430

+labels 4 1730

+retrack 3 1505

+retrack+labels 11 902

of which is summarized in Figure 2.

We compute ID switches as the number of times

the IMO ID of the IoU order top choice for a GT IMO

changes. An object must be found at least once to

be included in the computation. If an object is found

lost and found repeatedly again any number of frames

later, each repetition counts as one ID change. The

evaluation software implementation is available.

4.2 Results

Table 1 shows that our method performs in real time

on the i7 3GHz CPU. Table 2 shows that the number

of ID switches is infrequent, but also indicates that

more id switches happen for harder to detect IMOs.

Figure 2 shows excellent performance for the

combined variant. It also shows that the labeled vari-

ant is better at ﬁnding object boundraries. Without la-

bel information slightly more IMOs are found at low

IoU scores. Most of these are false positives that hap-

pen to overlap with a gt object. Overall performance

is fairly good for detected IMOs. Unfortunately a sig-

niﬁcant fraction are not detected. IMO shadows play

a role in reducing IoU accuracy as shown in Figure 5.

The experiments show that the purely geometric

Shire achieves moderate success in the generic set-

ting. This result is signiﬁcantly improved when com-

bined with the segmentation labels, though this is hid-

den by partial overlaps with false positives at lower

IoU scores. This result is further improved by retrack-

ing, though less so in the purely geometric case.

The main differences between the using segmen-

tation and not lies in the effect of shadows and outliers

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

728

0 0.2 0.4 0.6 0.8 1

Threshold

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Best iou > threshold

Intersection Over Union (IoU) Scores

Shire

Shire-labels

Shire-bootstrap

Shire-bootstrap-labels

Figure 2: Intersection over Union (IoU) for the best one

to one match for each GT IMO. We see that the using the

label information (Red) provides good results when a high

overlap is required but misses slightly more objects 30%.

Bootstrap (Blue) provides a direct improvement over Shire

(Yellow). Combined they achieve a signiﬁcant improve-

ment (Purple). Note that not all IMO are detected in every

frame. Roughly 16-30% are not found, most of which lie at

far distances at the beginning or end of a GT IMO.

Figure 3: Shire results. On the left top to bottom. First

frame where cyan car is found. Last frame before it is lost.

First frame where the blue car is found. Last frame be-

fore it is lost. On the right. The two cars mass centers in

world coordinates over time shown as dots and the Vehi-

cle egomotion poses shown as blue/red coordinate systems.

Egomotion is from the bottom left to the upper right, IMOs

are driving in the opposite direction. Best viewed electron-

ically.

on bounding boxes. There is also a small impact on

speed and an increase in the number of false positives.

A simple heuristic was used to reduce the impact

of outliers that ﬁt the geometry. Tracks used for ﬁnd-

ing the bounding box must be tracked for at least three

frames. This heuristic applies only if the at least half

Figure 4: Shire result. A pedestrian IMO detected by the

system. While violating the rigid object assumption, peo-

ple, and most other non rigid IMOs, are often locally rigid.

At 100x200 pixels the pedestrian is too small to be frac-

tured into several IMOs. Instead it is intermittently lost and

re-detected. The ability to detect generic IMOs is an advan-

tage of the purely geometric method.

Figure 5: Shire results. Red tracks are world, cyan, yellow

and pink are IMOs. The track on the cyan car’s shadow

causes the IMO bounding box to be signiﬁcantly extended

for the car. This happens because the sun is distant, and

the car moves in the plane the shadow is projected on top.

A downside of the purely geometric method, as we do not

collide with shadows.

the tracks on the object have been tracked for three

frames. We note that better heuristics could be de-

vised. In particular an analysis of the plane of the

motion could be used to identify tracks on shadows.

However this is left as future work.

We observe that the system excels at detecting and

segmenting fast or nearby objects, but struggles with

distant ones. This expected result is never the less

beneﬁcial in terms of which objects are important for

safe navigation.

Like most feature tracking VO systems, the

method is bounded in performance by the tracker. The

tracker struggles if the object tracks lengths near or

exceed the maximum supported distance. This is in

part due to direct search distance limitation and due to

large appearance changes due to perspective. The fea-

ture detector also struggles with low contrast or blurry

images, such as caused by dark regions in otherwise

sunny scenes and rain.

The system provides decent ID propagation for

IMOs. The primary error source being when two

Independently Moving Object Trajectories from Sequential Hierarchical Ransac

729

Figure 6: Starting bottom left, the RGB coordinate axis

shows the car trajectory. The colored dots show the various

IMOs encountered and their estimated trajectories. Some

are driving in the lane to the left, others along the same lane.

Best viewed electronically.

IMOs cover separate parts of and therefore compete

for a single GT object. More concerningly the system

fails to detect a fair number of the annotated objects.

There are two main causes for false negatives. The

ﬁrst case is a insufﬁcient deviation from the world ge-

ometry tracking. This happens for objects that are too

far away, especially if they are driving away from the

observer. A single car driving away in the same lane

at a distance above 35m results in nearly a third of

all missing detections. The second case is that the

tracker or stereo has failed, which happens for objects

that are moving or warping too fast in the image. This

is a fundamental failure of the tracker and could be

addressed by better tracking.

4.3 MOTS and KITTI Results

Shire sample results on the MOTS and KITTI se-

quences are shown in Figure 7. The Shire parameters

are based on tracker and disparity system properties.

Therefore we expect they should work for the MOTS

and KITTI dataset as well as our. Thus the same pa-

rameters are used throughout.

Qualitatively the performance is slightly worse

compared to results on our own dataset. Primarily in

terms of the number of false positives. This may be

partially explained by the frame-rate difference, as the

Figure 7: Top image shows performance on the KITTI

benchmark, bottom image shows performance on the

MOTS benchmark. Note that the color image from MOTS

was converted to grey scale for illustration purposes.

expected parallax increases. Though the difference in

sensor characteristics, baseline and resolution may be

contributing. We also note that performance is sub-

par to the MOTSFusion system, though we do note

that our system is more than ten times faster. The per-

formance on the MOTS varies signiﬁcantly between

sequences, but not by scene complexity as may be ex-

pected. A possible explanation is that the calibration

is slightly worse for some sequences. A calibration

error of 1-3 pixels instead of 0-1 pixel error would

be sufﬁcient, as Shire is sensitive to calibration errors

due to the assignment mechanism. This problem does

not appear to apply to the KITTI sequences which use

the grey scale cameras instead of the RGB cameras.

Since the grey cameras were calibrated for the VO

challenge, but the RGB cameras have primarily been

used for deep learning, we would not be surprised if

they are better calibrated.

4.4 Conclusion

We have presented stereo based Sequential Hierarchi-

cal Ransac Estimation (Shire). Shire detects and esti-

mates, in realtime, the 6D pose trajectories of multiple

moving objects, and the egomotion using geometric

consistency alone. The system can beneﬁt from se-

mantic segmentation, but, unlike for similar methods,

it is not required. Shire can be used to predict object

trajectories for path planning with dynamic obstacles.

It can also detect spatiotemporally separable objects,

which form an interesting semantic class, or be used

to generate ground truth for optical ﬂow including that

on moving objects. Though we leave the training of

self supervised systems using Shire as future work.

We provide the core Shire code, and the dataset used

for evaluation, as well as annotation and evaluation

tools (Persson, 2020).

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

730

REFERENCES

Agarwal, S., Mierle, K., and Others (2020). Ceres solver.

http://ceres-solver.org.

Badino, H., Yamamoto, A., and Kanade, T. (2013). Visual

odometry by multi-frame feature integration. In First

International Workshop on Computer Vision for Au-

tonomous Driving at ICCV.

Cordts, M. (2017). Understanding Cityscapes: Efﬁcient

Urban Semantic Scene Understanding. PhD thesis,

Technische Universit

at, Darmstadt.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler,

M., Benenson, R., Franke, U., Roth, S., and Schiele,

B. (2016). The cityscapes dataset for semantic urban

scene understanding. In Proc. of the IEEE Conference

on Computer Vision and Pattern Recognition (CVPR).

Cvi

c, I.,

Cesi

c, J., Markovi

c, I., and Petrovi

c, I. (2017).

Soft-slam: Computationally efﬁcient stereo visual

slam for autonomous uavs.

Engel, J., Koltun, V., and Cremers, D. (2018). Direct sparse

odometry. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 40(3):611–625.

Gauglitz, S., Foschini, L., Turk, M., and Hollerer, T. (2011).

Efﬁciently selecting spatially distributed keypoints for

visual tracking. In Image Processing (ICIP), 2011

18th IEEE International Conference on, pages 1869–

1872.

Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready

for autonomous driving? the kitti vision benchmark

suite. In Conference on Computer Vision and Pattern

Recognition (CVPR).

Hirschm

uller, H. (2008). Stereo processing by semi-global

matching and mutual information. in IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

30:328–341.

Klein, G. and Murray, D. (2007). Parallel tracking and

mapping for small AR workspaces. In Proc. Sixth

IEEE and ACM International Symposium on Mixed

and Augmented Reality (ISMAR’07), Nara, Japan.

Long, J., Shelhamer, E., and Darrell, T. (2015). Fully con-

volutional networks for semantic segmentation. In

The IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR).

Luiten, J., Fischer, T., and Leibe, B. (2019). Track to recon-

struct and reconstruct to track.

Ma, W.-C., Wang, S., Hu, R., Xiong, Y., and Urtasun, R.

(2019). Deep rigid instance scene ﬂow.

Mur-Artal, R. and Tard

os, J. D. (2017). ORB-SLAM2:

an open-source SLAM system for monocular, stereo

and RGB-D cameras. IEEE Transactions on Robotics,

33(5):1255–1262.

Persson, M. (2020). IMO dataset. https://www.cvl.isy.liu.

se/research/datasets/imo-dataset/.

Persson, M. and Nordberg, K. (2018). Lambda twist:

An accurate fast robust perspective three point (p3p)

solver. In The European Conference on Computer Vi-

sion (ECCV).

Persson, M., Piccini, T., Mester, R., and Felsberg, M.

(2015). Robust stereo visual odometry from monoc-

ular techniques. In IEEE Intelligent Vehicles Sympo-

sium.

Rabe, C. (2012). Detection of Moving Objects by Spatio-

Temporal Motion Analysis: Real-time Motion Estima-

tion for Driver Assistance Systems. S

udwestdeutscher

Verlag f

ur Hochschulschriften. ISBN 978-3-8381-

3219-8.

Vogel, C. and Roth, S. (2015). 3d scene ﬂow estimation

with a piecewise rigid scene model. International

Journal of Computer Vision, 115.

Wang, R., Schw

orer, M., and Cremers, D. (2017). Stereo

DSO: Large-scale direct sparse visual odometry with

stereo cameras. In International Conference on Com-

puter Vision (ICCV17).

Independently Moving Object Trajectories from Sequential Hierarchical Ransac

731