Monocular 3D Detection and reID-enhanced Tracking of Multiple Trafﬁc

Participants

Alexander Sing

, Csaba Beleznai

and Kai G

obel

Center for Vision, Automation and Control, AIT Austrian Institute of Technology, Gieﬁnggasse 6, 1210 Vienna, Austria

Keywords:

Monocular 3D Detection, Multi-target Tracking, Target Re-ID, KITTI Tracking Benchmark.

Abstract:

Autonomous driving is becoming a major scientiﬁc challenge and applied domain of signiﬁcant impact, also

triggering a demand for the enhanced safety of vulnerable road users, such as cyclists and pedestrians. The

recent developments in Deep Learning have demonstrated that monocular 3D pose estimation is a potential

detection modality in safety related task domains such as perception for autonomous driving and automated

trafﬁc monitoring. Deep Learning offers enhanced ways to represent targets in terms of their location, shape,

appearance and motion. Learning can capture the signiﬁcant variations seen in the training data while retaining

class- or target-speciﬁc cues. Learning even allows for discovering speciﬁc correlations within an image of a

3D scene, as a perspective image contains many hints about an object’s 3D location, orientation, size and iden-

tity. In this paper we propose an attention-based representational enhancement to enhance the spatial accuracy

of 3d pose and the temporal stability of multi-target tracking. The presented methodology is evaluated on the

KITTI multi-target tracking benchmark. It demonstrates competitive results against other recent techniques,

and when compared to a baseline relying solely on a Kalman-Filter-based kinematic association step.

1 INTRODUCTION

Spatial awareness and reasoning are fundamental

traits of modern vision-based robotic systems. How-

ever, monocular (single view) vision-based percep-

tion is associated with ambiguities such as depth-scale

ambiguity or viewpoint invariance. These ambiguities

arise from projecting the 3D world onto a 2D imaging

plane, where multiple 3D scene conﬁgurations can re-

sult in the same projected image if scale information

is a priori not known. Many ambiguities associated

with a single view can be mitigated if the views cor-

respond to a street-level observer observing common

object types with more-or-less known dimensions. In

such cases, the learning task can be formulated such

that a 2D image content can be regressed to a set of 3d

object locations on a ground plane with an estimated

heading orientation.

Image-based 3D object detection and pose pa-

rameter regression are typical multi-task learning

problems as they require classifying image content

into classes while also regressing their 3D bounding

https://orcid.org/0000-0002-3340-1789

https://orcid.org/0000-0003-1880-2979

https://orcid.org/0000-0001-5074-3652

boxes. Association of detected objects to consistent

motion trajectories can be facilitated by including a

re-identiﬁcation task (reID), which distils each ob-

ject’s appearance into a compact and discriminative

feature set (Wang et al., 2020).

This work proposes a representation-enhanced

end-to-end Deep Learning approach for 3D pose-

aware multiple-object detection and tracking, using

only monocular RGB images as input. Its represen-

tational concept is based on an encoder-decoder type

multi-task learning scheme while also integrating re-

cent representational breakthroughs devised explicitly

for coping with spatial ambiguities and association

uncertainties. The tracking integrates a reID approach

that utilizes a Transformer Encoder (Vaswani et al.,

2017) with deformable attention (Zhu et al., 2020) to

obtain target-speciﬁc appearance features using a spa-

tially delocalized exploration and correlation scheme.

The proposed methodology is evaluated for the multi-

target tracking task employing the KITTI benchmark-

ing scheme and compared to several recent competing

algorithmic concepts.

The paper is structured as follows: in Section 2

we describe related works. Section 3 presents the pro-

posed methodology, which is evaluated and discussed

in Section 4. Finally, Section 5 concludes the paper.

426

Sing, A., Beleznai, C. and Göbel, K.

Monocular 3D Detection and reID-enhanced Tracking of Multiple Trafﬁc Participants.

DOI: 10.5220/0011527300003332

In Proceedings of the 14th International Joint Conference on Computational Intelligence (IJCCI 2022), pages 426-434

ISBN: 978-989-758-611-8; ISSN: 2184-3236

2 RELATED WORK

This section provides a brief overview on relevant

state-of-the-art approaches of monocular 3D object

detection and tracking. First, basic concepts of 2D

object detection schemes are presented. Next, based

on the described core concepts, prevailing represen-

tational extensions towards monocular 3D detection

and parameter regression are characterized.

Object Detection: Object detection frameworks can

be categorized by their architectural aspects. Accord-

ingly, two-stage, single-stage and anchorless detec-

tors can be distinguished.

Two-stage Detectors: As the earliest detection con-

cepts, two-stage detectors ﬁrst generate region pro-

posals and then classify them as either one of the pre-

deﬁned object categories or background. Classiﬁed

regions of interest (ROIs) can be further reﬁned, and

additional attributes can be predicted. The earliest

model using this approach is the Region-based con-

volutional neural network (R-CNN) (Girshick et al.,

2014), followed by the Fast R-CNN (Girshick, 2015)

and Faster R-CNN (Ren et al., 2017) models.

Single-stage Detectors: Considerations of computa-

tional simpliﬁcation have led to single-stage detec-

tors, which combine the two steps into a single step

and view object detection as a regression problem.

This concept was introduced by YOLO (Redmon

et al., 2016) which divides the image into an S×S grid

and predicts for each cell class probabilities. Combin-

ing both stages into one yielded an improved run-time

performance while retaining competitive accuracy. Its

successor versions (Redmon and Farhadi, 2017; Red-

mon and Farhadi, 2018) still achieve state-of-the-art

results.

Anchorless Detectors: Both, R-CNN and YOLO are

anchor-based detectors, as they operate with a ﬁxed

number of region candidates for predicting objects

and reﬁning their delineations. Anchorless detectors

on the other hand do not use such pre-deﬁned boxes.

Instead, they formulate box parameters (such as the

center or corners) as key-points and regress corre-

sponding bounding box parameters directly. Among

the ﬁrst models adopting this approach were Cor-

nerNet (Law and Deng, 2018) and CenterNet (Zhou

et al., 2019). In these works, network output yields

a down-sampled dense grid and predictions of rele-

vant object points at each cell. This simple infer-

ence scheme is complemented by additional object

attribute estimators. CenterNet achieved state-of-the-

art results while signiﬁcantly reducing run-time.

Monocular 3D Object Detection: Common 3D ob-

ject detection schemes predict 3D object attributes

on a common ground plane, hence extending the 2D

Figure 1: Projection of a 3D camera-centered point p

onto

the sensor plane at p. [x

] are the camera coordinate

system, [x

] the image coordinate system.

bounding box representation. The correlation be-

tween the image space and 3D world space is learned

end-to-end from annotated training data. In the next

section, a concise overview on different representa-

tions used for monocular 3D detection is given.

Concepts: Image formation can be commonly ap-

proximated by a linear pinhole camera model, pro-

jecting points from a 3D space onto a 2D image plane

(Figure 1). Thereby, camera-centric world coordi-

nates are commonly used (Szeliski, 2022) to deﬁne

the spatial relation with respect to an object. Given a

set of image-space coordinates for an object, the re-

construction of the corresponding 3D coordinates re-

quires the depth (distance) of the object location. This

information, however, is not contained in a monocular

image and therefore needs to be directly estimated via

a learned model. Additionally, there is an ambiguity

between object size and depth, as there is no way of

telling only from the image whether an object is small

or just far away and vice versa.

Representations: Early neural approaches for monoc-

ular 3D object detection used a two-stage approach

containing a template matching step (Chabot et al.,

2017). Image content within detected objects were

matched with templates, leading to inherent limita-

tions regarding the number of template models. Other

approaches transform the street-level front-view im-

age into another data spaces, such as a birds-eye-view

(BEV) (Kim and Kum, 2019; Roddick et al., 2018;

Srivastava et al., 2019) or a pseudo point cloud (Weng

and Kitani, 2019); the latter computed via a monocu-

lar depth estimation network (Godard et al., 2017). In

the BEV space, the task is reduced to an oriented 2D

bounding box detection, while the pseudo point cloud

representation enables the use of off-the-shelf LiDAR

detection schemes.

Our proposed method also follows the strategy

of directly regressing spatial and class-speciﬁc object

Monocular 3D Detection and reID-enhanced Tracking of Multiple Trafﬁc Participants

427

properties. Typical schemes adopting the regression

concept are CenterNet, regressing 2D bounding box

parameters and SS3D (J

orgensen et al., 2019), which

additionally regresses distance, dimensions, observa-

tion angle and projected 3D bounding box corners.

Multi-Target Tracking: Multi-target tracking aims

to associate and partition time-consecutive detection

responses such that each partition belongs to the same

target identity. Its computational scheme can be on-

line or off-line, depending on whether an access only

to the most recent, or also to all previous observations

is given. In the followings typical location-, motion-

and appearance-based cues are described which sup-

port the underlying association task:

Location- and Motion-based Methods: Physical con-

straints governing object motion typically deﬁne

strong criteria which aid association. A common ap-

proach to treat tracking as a ﬁltering process, where

association (Kuhn, 1955) and prediction steps (e.g.,

by a Kalman Filter (Kalman et al., 1960)) alternate

during the tracking process. Modern variants, such as

SORT (Bewley et al., 2016) and AB3DMOT (Weng

et al., 2020) employ a similar concept. Motion-based

approaches tend to have issues in crowded scenes and

in presence of low framerates, where location/motion-

correlation of targets degrades.

Appearance-based Methods: The abstraction capabil-

ity of neural representations offers powerful means to

capture the appearance of targets and use it in an asso-

ciation step. Architectural concepts range from a se-

quential multi-stage (detect, encode and associate) to

parallel, multi-branch (Voigtlaender et al., 2019) ap-

proaches. Such feature-based similarities or embed-

ding can again be used in a conventional association

step. Contrary to motion, appearance-based methods

tend to be robust to detection gaps and larger inter-

frame displacements.

Since object detection and tracking are mutually

supporting intermediate processing steps, it is intu-

itive to formulate them as a jointly trained multi-task

problem. In our work we adopt a tight coupling be-

tween these tasks, by using a common backbone for

feature computation, followed by task-speciﬁc sub-

networks. Our proposed scheme devises mutually

supporting representations, which attempt to simul-

taneously meet criteria of 2d projective, 3D BEV and

target speciﬁcity constraints.

3 METHODS

In this section, we describe the proposed monocular

3D multi-target object detection and tracking frame-

work, including the proposed enhancements regard-

ing the computational backbone network and the reID

branch. Additionally, the loss functions used for the

optimization process are detailed.

Monocular 3D Multi-target Detection & Tracking:

The aim of the proposed detection and tracking net-

work is to simultaneously predict the 3D location, di-

mension, orientation, and ID of objects given a se-

quence of monocular RGB images as input. The over-

all network architecture is extended from CenterNet,

which utilizes object centers to identify objects and

detection responses are obtained via predicting a con-

ﬁdence heatmap. Several regression heads are used to

obtain the desired 3D bounding box of each object.

A Transformer Encoder network calculates an em-

bedding for each object that aids in identifying each

unique object instance across a sequence. The overall

architecture is illustrated in Figure 2.

Attention Enhanced Backbone: The proposed frame-

work uses the hierarchical layer fusion network DLA-

34 (Yu et al., 2018) as its computational backbone net-

work. The hierarchical aggregation connections are

replaced by deformable convolution (Zhu et al., 2019)

layers as in CenterNet. To further improve the ability

to consider long-range dependencies within the im-

age, the deformable convolution layers have been en-

hanced with an additional attention layer that only uti-

lizes the information contained in the key as the query

content and the relative position are already covered

by the deformable convolution (Zhu et al., 2019).

Object Detection and Representation: The 3D infor-

mation of an object is encoded via seven parameters:

[x,y,z,h,w, l,α]. Here, x, y and z are the 3D location

of the object center, h, w and l represent the object

dimensions and α denotes the apparent yaw angle of

the object on the ground plane. Object centers are

predicted using the position on the heatmap combined

with a regressed offset. To obtain more accurate pre-

dictions in cases where objects are located near the

image border, the decoupled representation in com-

bination with the edge fusion module proposed by

(Zhang et al., 2021a) was also incorporated into the

model. In addition to the location prediction from the

heatmap, the offset due to discretization is predicted

as well. The network predicts the apparent yaw angle

directly. The depth is predicted using two approaches,

mutually supporting an accurate depth estimation: On

one hand, the depth is directly regressed. On the other

hand, the corner points of the 3D bounding box are

utilized to validate a second depth prediction. This

is achieved by using the relative proportion between

pixel height and estimated object height. Given the

cameras focal length, the depth of the vertical line

from a top corner to the corresponding bottom corner

ROBOVIS 2022 - Workshop on Robotics, Computer Vision and Intelligent Systems

428

Figure 2: Illustration of the overall network architecture. Highlighted boxes emphasize the proposed novelties: yellow rep-

resents the attention-enhanced backbone, red illustrates the Robust Kullback-Leibler loss and green the Transformer Encoder

reID subnetwork.

of the bounding box can be calculated as:

f ×H

(1)

where f is the focal length, H the predicted object

height, h

the pixel height of the vertical edge of

the bounding box and z

the resulting depth. The

weighted sum of the calculated depths is the ﬁnal pre-

diction. Here, the weights are given by the inverse of

an uncertainty prediction, that the model regresses for

each depth.

ReID and Tracking. The proposed method follows

a joint detection and embedding approach and incor-

porates a reID branch atop of the backbone feature

extractor, like FairMOT (Zhang et al., 2021c). How-

ever, instead of using a Convolutional Neural Net-

work (CNN) as a subnetwork to extract the embed-

dings, a deformable Transformer Encoder (Zhu et al.,

2020) is used to capture long-range relationships be-

tween the extracted features. The extracted features

are enriched with a positional embedding to preserve

spatial relations. Three separate linear layers gener-

ate sampling offsets, attention weights and features

values. The sampling offsets are used to obtain key

sample values, which are then multiplied by the at-

tention weights and their aggregation is performed. A

ﬁnal linear layer generates the outputs. The distances

between the generated embeddings are used to calcu-

late a cost matrix between existing tracks and new de-

tection responses. Assignment is then accomplished

using the Hungarian algorithm (Kuhn, 1955).

Loss Functions. A weighted sum of multiple loss

functions is used to train the proposed framework. In

this section, the individual loss components are de-

scribed.

Like in (Zhou et al., 2019), the penalty-reduced focal

loss is used for object center estimation. It is deﬁned

as:

f =

−1

∑

(1 −

xyc

)

log(

xyc

) (2)

g =

−1

∑

(1 −Y

xyc

)

(

xyc

)

log(1 −

xyc

) (3)

centre

(

f if Y

xyc

= 1

g otherwise

(4)

, where α and β are hyperparameters which deﬁne the

degree of penalty reduction and down-weighting of

easy examples. For all experiments, α = 2 and β = 4

are used.

To compensate for the discretization, spatial offsets

are learned and optimized as in (Zhang et al., 2021a):

offset

(

inside

−δ

inside

| if inside

log(1 + |

outside

−δ

outside

|) otherwise

(5)

To obtain the orientation, the apparent angle

α is di-

rectly regressed and clamped to [−π, π]. It is opti-

mized using an L2 loss as follows:

angle

= (sin (

α) −sin(α))

+ (cos(

α) −cos(α))

(6)

Monocular 3D Detection and reID-enhanced Tracking of Multiple Trafﬁc Participants

429

with α being the ground truth angle.

The dimensions are regressed as offsets

deviating

from the class average, using the L1 loss:

dimension

∑

k∈{h,w,l}

−k| (7)

The corner points of the projected 3D bounding box

are regressed as offsets from the discretized object

center and optimized using L1 loss. However, only

those corner points that are visible in the image are

penalized in the loss function, which is indicated by

inside

). The loss function becomes:

keypoints

∑

i=0

inside

−δ

∑

i=0

inside

)

(8)

Additionally including a loss function for a 2D

bounding box regression task has been shown to also

improve the 3D detection performance (Zhang et al.,

2021a). Therefore, the 2D bounding box estimation

task is also added, which predicts spatial offsets from

the object center. The generalized Intersection-over-

Union loss (GIoU) is used, yielding a 2D bounding

box loss L

GIoU = IoU −

−U

(9)

= 1 −GIoU (10)

For the depth estimation, instead of the Laplacian

Kullback-Leibler loss (such as in MonoFLEX), we

use the Robust Kullback-Leibler loss, proposed by

(Chen et al., 2021), to overcome the issue of an in-

creasing gradient during training due to the reduced

uncertainty. It is deﬁned as:

robust KL

ˆw

(

+ logσ |e| <=

√

2|e|−1 + log σ |e| >

√

(11)

ˆw ← α ˆw + (1 −α)

∑

i=0

(12)

where e = |ˆy −y|/σ is the L1 error of the prediction,

σ denotes the predicted uncertainty in the estimation,

and N is the number of predictions made. α is a hy-

perparameter that determines the impact of new ob-

servations on the exponential moving average of the

inverse of the uncertainties ˆw.

Finally, the eight corner points ˆv

of the 3D bounding

box which are compared in terms of a 3D spatial de-

viation to the corresponding ground truth 3D corner

points. The resulting spatial discrepancy gives rise

to an L1 loss, whose optimization enforces a strong

depth-based criterion, thus supporting different sub-

tasks:

bounding box

∑

i=0

|ˆv

−v

| (13)

The reID branch is treated as a classiﬁcation during

training, where the generated embeddings are the in-

put to a linear layer that outputs a probability for ev-

ery unique object instance K in the training dataset.

To optimize this task, the cross-entropy loss is used:

reID

= −

∑

i=0

∑

k=0

(k) log(p(k)) (14)

with p(k) being the predicted probability that the de-

tection is object k, L

(k) the one-hot encoded repre-

sentation of the ground truth label and N the total

number of detection responses in the image.

4 RESULTS & DISCUSSION

This section presents training details and obtained ex-

perimental results of the proposed framework. Based

on quantitative and qualitative results for the detection

and tracking quality, we discuss the observed algo-

rithmic qualities and encountered failure modes. Fi-

nally, the impact of the proposed additions is demon-

strated in an ablation study.

4.1 Training & Evaluation Metrics

For all experiments, the model was trained on the

KITTI dataset (Geiger et al., 2012), using the train-

ing (5,027 images) / validation (2,981 images) split

proposed by (Voigtlaender et al., 2019). The input im-

ages were padded to 384 ×1280 px and AdamW was

used as the optimizer with an initial learning rate of

3 ×10

−4

, decaying by a factor of 10 at epochs 80 and

90. Overall, the network was trained for 100 epochs.

The input data was augmented using random horizon-

tal ﬂips. Each of the prediction heads consists of two

convolutional layers with a batch norm and a ReLU

activation in between. Training involves three object

categories: passenger cars, cyclists, and pedestrians.

Training of a multi-task regression network nat-

urally faces the complexity of a speciﬁc data need.

Joint optimization for detection and tracking in a

3D space requires datasets, which provide at the

same time category labels, 3D spatial annotations and

tracking information. The employed dataset provides

this information, however, more data and/or increased

diversity would probably lead to an even better accu-

racy. For fair comparison, all experiments and com-

parisons use the same data and data split.

ROBOVIS 2022 - Workshop on Robotics, Computer Vision and Intelligent Systems

430

Table 1: The monocular 3D object detection results of the proposed method compared to state of the art methods.

Method

Easy Moderate Hard

SMOKE (Liu et al., 2020) 14.76 12.85 11.50

MonoGeo (Zhang et al., 2021b) 18.45 14.48 12.87

Ground-aware Monocular 3D Obj. Det. (Liu et al., 2021) 23.63 16.16 12.06

MonoFlex (Zhang et al., 2021a) 23.64 17.51 14.83

Proposed Method 20.56 15.00 11.79

For object detection, the average precision on the

car class for the three different difﬁculty levels (easy,

medium, hard as deﬁned by the KITTI benchmark) is

reported (see Table 1). The IoU threshold used is 0.7.

The main evaluation metric for tracking used is the

higher order tracking accuracy (HOTA) (Luiten et al.,

2021). Additionally, the detection accuracy (DetA)

and the association accuracy (AssA) are reported to

better grasp the contributions of the two components.

The three metrics are deﬁned as follows:

HOTA

∑

c∈{TP}

A(c)

|TP|+ |FN|+ |FP|

(15)

A(c) =

|TPA(c)|

(16)

DetA

|TP|

|TP|+ |FN|+ |FP|

(17)

AssA

|TP|

∑

c∈{T P}

A(c) (18)

α is the conﬁdence threshold at which detections

are counted as positive. TP, FN and FP are true pos-

itive, false negative and false positive detections re-

spectively, while TPA(c), FNA(c) and FPA(c) denote

the true positive, false negative and false positive as-

sociations of a true positive detection. c are the ele-

ments of the set of true positive detections. The ﬁnal

metric is approximated by averaging over 19 different

thresholds instead of using the integral:

HOTA = ≈

∑

α∈{0.05,...,0.95}

HOTA

(19)

DetA = ≈

∑

α∈{0.05,...,0.95}

DetA

(20)

AssA = ≈

∑

α∈{0.05,...,0.95}

AssA

(21)

4.2 Quantitative Results

Table 1 presents a detection accuracy comparison in

terms of AP scores to some recent competing ap-

proaches. Scores in this table originate from an

IoU evaluation within a metric 3D space, therefore

slight regression errors in depth, location or orienta-

tion quickly lead to a strong decay in the computed

score. As it can be seen from the table, the obtained

detection accuracy is comparable to the current state-

of-the art. For hard examples represent the proposed

scheme exhibits some weakness, mainly due to ob-

jects which are small-sized and near the image border,

where enforcing depth constraints is less effective.

Table 2 compares the tracking results on the car

class to a baseline method, which solely relies on a

track association based on IoU with predictions from

a Kalman ﬁlter. The results show that the proposed

method is an improvement over the baseline, espe-

cially regarding the AssA criterion, where the dif-

ference was the most pronounced. We attribute this

improvement to the introduced ReID representation

which generates an improvement especially for vehi-

cles, which are in the view periphery, far from the

image center. In this region, upon translatory mo-

tion or in case of a turning camera (observer vehicle),

motion-based association quickly degrades and ReID

feature mitigate this problem to a certain extent.

Table 2: The monocular 3D multi-target tracking results of

the proposed method compared to the Kalman Baseline.

Method HOTA DetA AssA

Baseline (Kalman Filter) 30.86 22.79 42.68

Proposed Method 30.96 22.89 42.81

4.3 Qualitative Results

In Figure 3, qualitative results on the KITTI vali-

dation dataset are shown. Additionally, in Figure 4

the proposed method is compared qualitatively to the

Kalman Filter baseline. As it can be seen from these

ﬁgures, a situation of a rapid change in the underlying

motion characteristics of the observed vehicle leads

to association problems. The Kalman Filter, as it is

based on purely kinematic attributes, struggles with

such sudden changes in motion. In contrast, the reID

network manages to correctly match the detection re-

sponses and establish consistent trajectories.

Monocular 3D Detection and reID-enhanced Tracking of Multiple Trafﬁc Participants

431

Figure 3: Qualitative tracking results of the proposed method on the KITTI validation set. Same coloured bounding boxes

denote the same track identity assigned.

Table 3: The impact of the proposed changes to the 3D multi-target tracking results.

Method HOTA DetA AssA

Without attention in backbone 30.04 21.88 42.34

Without Robust KL loss 29.57 21.23 42.57

With vanilla CNNs instead of Transformer Encoder 30.25 21.77 43.5

Full model 30.96 22.89 42.81

4.4 Ablation Study

To demonstrate the impact of the various proposed

improvements, Table 3 shows the results obtained

without the individual changes. All experiments were

conducted on the KITTI validation set and to combat

run-to-run variance, the results have been averaged

over three runs. As one can see, the additional at-

tention in the backbone feature extractor showed the

most impact in the DetA metric. The use of the ro-

bust Kullback-Leibler loss led to the greatest overall

improvement and affected the DetA the most. Finally,

the standard CNNs had a better AssA than the pro-

posed method but due to their negative impact on the

ROBOVIS 2022 - Workshop on Robotics, Computer Vision and Intelligent Systems

432

Figure 4: Comparison between proposed reID approach (left) and simple Kalman Filter (right) on the KITTI validation set.

DetA, the overall HOTA was still lower compared to

the Transformer Encoder reID network.

5 CONCLUSION

In this work, we proposed a representation-enhanced

end-to-end Deep Learning approach for 3D multi-

target detection and tracking, that utilizes a Trans-

former Encoder sub-network to extract representative

reID features. This approach facilitates an improved

tracking performance compared to a motion-based

baseline on the KITTI benchmark dataset. Regarding

monocular 3D object detection, the proposed method

is competitive with current SOTA models.

ACKNOWLEDGEMENTS

This work was carried out within the Bike2CAV

project (project Nr. 879632), which is funded by the

Austrian Federal Ministry for Climate Action, En-

vironment, Energy, Mobility, Innovation and Tech-

nology (BMK) under the “Future Mobility” program

and is managed by the Austrian Research Promotion

Agency (FFG).

REFERENCES

Bewley, A., Ge, Z., Ott, L., Ramos, F., and Upcroft, B.

(2016). Simple online and realtime tracking. In IEEE

International Conference in Image Processing (ICIP),

pages 3464–3468.

Chabot, F., Chaouch, M., Rabarisoa, J., Teuliere, C., and

Chateau, T. (2017). Deep manta: A coarse-to-ﬁne

many-task network for joint 2d and 3d vehicle anal-

ysis from monocular image. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 2040–2049.

Chen, H., Huang, Y., Tian, W., Gao, Z., and Xiong, L.

(2021). Monorun: Monocular 3d object detection

by reconstruction and uncertainty propagation. In

Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition (CVPR), pages

10379–10388.

Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready

for autonomous driving? the kitti vision benchmark

suite. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 3354–3361.

Girshick, R. (2015). Fast r-cnn. In Proceedings of the

IEEE/CVF International Conference on Computer Vi-

sion (ICCV), pages 1440–1448.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detection

and semantic segmentation. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 580–587.

Monocular 3D Detection and reID-enhanced Tracking of Multiple Trafﬁc Participants

433

Godard, C., Mac Aodha, O., and Brostow, G. J. (2017). Un-

supervised monocular depth estimation with left-right

consistency. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 270–279.

orgensen, E., Zach, C., and Kahl, F. (2019). Monocu-

lar 3d object detection and box ﬁtting trained end-to-

end using intersection-over-union loss. arXiv preprint

arXiv:1906.08070.

Kalman, R. E. et al. (1960). A new approach to linear ﬁl-

tering and prediction problems. Journal of Basic En-

gineering, 82(1):35–45.

Kim, Y. and Kum, D. (2019). Deep learning based vehi-

cle position and orientation estimation via inverse per-

spective mapping image. In IEEE Intelligent Vehicles

Symposium (IV), pages 317–323.

Kuhn, H. W. (1955). The hungarian method for the assign-

ment problem. Naval Research Logistics Quarterly,

2(1-2):83–97.

Law, H. and Deng, J. (2018). Cornernet: Detecting objects

as paired keypoints. In European Conference on Com-

puter Vision (ECCV), pages 734–750.

Liu, Y., Yixuan, Y., and Liu, M. (2021). Ground-

aware monocular 3d object detection for autonomous

driving. IEEE Robotics and Automation Letters,

6(2):919–926.

Liu, Z., Wu, Z., and T

oth, R. (2020). Smoke: Single-stage

monocular 3d object detection via keypoint estima-

tion. In IEEE/CVF Conference on Computer Vision

and Pattern Recognition Workshops (CVPRW), pages

4289–4298.

Luiten, J., Os Ep, A. A., Dendorfer, P., Torr, P., Geiger, A.,

Leal-Taix

e, L., and Leibe, B. (2021). Hota: A higher

order metric for evaluating multi-object tracking. In-

ternational Journal of Computer Vision, 129(2):548–

578.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time ob-

ject detection. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 779–788.

Redmon, J. and Farhadi, A. (2017). Yolo9000: better, faster,

stronger. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 7263–

7271.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental

improvement. arXiv preprint arXiv:1804.02767.

Ren, S., He, K., Girshick, R., and Sun, J. (2017). Faster r-

cnn: Towards real-time object detection with region

proposal networks. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 39(6):1137–1149.

Roddick, T., Kendall, A., and Cipolla, R. (2018). Ortho-

graphic feature transform for monocular 3d object de-

tection. arXiv preprint arXiv:1811.08188.

Srivastava, S., Jurie, F., and Sharma, G. (2019). Learning 2d

to 3d lifting for object detection in 3d for autonomous

vehicles. In IEEE/RSJ International Conference on

Intelligent Robots and Systems (IROS), pages 4504–

4511.

Szeliski, R. (2022). Computer Vision. Springer Interna-

tional Publishing, Cham.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in Neural

Information Processing Systems, 30.

Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B.

B. G., Geiger, A., and Leibe, B. (2019). Mots: Multi-

object tracking and segmentation. In Proceedings of

the IEEE/CVF Conference on Computer Vision and

Pattern Recognition (CVPR), pages 7942–7951.

Wang, Z., Zheng, L., Liu, Y., Li, Y., and Wang, S. (2020).

Towards real-time multi-object tracking. In European

Conference on Computer Vision (ECCV), pages 107–

122.

Weng, X. and Kitani, K. (2019). Monocular 3d object de-

tection with pseudo-lidar point cloud. In IEEE/CVF

International Conference on Computer Vision Work-

shop (ICCVW), pages 857–866.

Weng, X., Wang, J., Held, D., and Kitani, K. (2020).

Ab3dmot: A baseline for 3d multi-object track-

ing and new evaluation metrics. arXiv preprint

arXiv:2008.08063.

Yu, F., Wang, D., Shelhamer, E., and Darrell, T. (2018).

Deep layer aggregation. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition (CVPR), pages 2403–2412.

Zhang, Y., Lu, J., and Zhou, J. (2021a). Objects are differ-

ent: Flexible monocular 3d object detection. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition (CVPR), pages 3289–

3298.

Zhang, Y., Ma, X., Yi, S., Hou, J., Wang, Z., Ouyang, W.,

and Xu, D. (2021b). Learning geometry-guided depth

via projective modeling for monocular 3d object de-

tection. arXiv preprint arXiv:2107.13931.

Zhang, Y., Wang, C., Wang, X., Zeng, W., and Liu, W.

(2021c). Fairmot: On the fairness of detection and

re-identiﬁcation in multiple object tracking. Inter-

national Journal of Computer Vision, 129(11):3069–

3087.

Zhou, X., Wang, D., and Kr

ahenb

uhl, P. (2019). Objects as

points. arXiv preprint arXiv:1904.07850.

Zhu, X., Hu, H., Lin, S., and Dai, J. (2019). Deformable

convnets v2: More deformable, better results. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition (CVPR), pages 9308–

9316.

Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J.

(2020). Deformable detr: Deformable transform-

ers for end-to-end object detection. arXiv preprint

arXiv:2010.04159.

ROBOVIS 2022 - Workshop on Robotics, Computer Vision and Intelligent Systems

434