BASE: Probably a Better Approach to Visual Multi-Object Tracking

Martin Vonheim Larsen

1,2,∗ a

, Sigmund Rolfsjord

1,2 b

, Daniel Gusland

1,2 c

, J

orgen Ahlberg

3,2 d

and Kim Mathiassen

1,2 e

Norwegian Defence Research Establishment, Norway

University of Oslo, Norway

Link

oping University, Norway

ﬁ

Keywords:

Visual Multi-Object Tracking, Probabilistic Tracking, Distance-Aware, Association-Less Track Management.

Abstract:

The ﬁeld of visual object tracking is dominated by methods that combine simple tracking algorithms and ad

hoc schemes. Probabilistic tracking algorithms, which are leading in other ﬁelds, are surprisingly absent from

the leaderboards. We found that accounting for distance in target kinematics, exploiting detector conﬁdence

and modelling non-uniform clutter characteristics is critical for a probabilistic tracker to work in visual track-

ing. Previous probabilistic methods fail to address most or all these aspects, which we believe is why they fall

so far behind current state-of-the-art (SOTA) methods (there are no probabilistic trackers in the MOT17 top

100). To rekindle progress among probabilistic approaches, we propose a set of pragmatic models addressing

these challenges, and demonstrate how they can be incorporated into a probabilistic framework. We present

BASE (Bayesian Approximation Single-hypothesis Estimator), a simple, performant and easily extendible vi-

sual tracker, achieving state-of-the-art (SOTA) on MOT17 and MOT20, without using Re-Id.

Code available at https://github.com/fﬁ-no/paper-base-visapp-2024.

1 INTRODUCTION

INSTICC:Fx) Visual Multi-Object Tracking (VMOT)

is the task of estimating the location of objects over

time in a video sequence while maintaining a unique

ID for each target. Popular VMOT benchmarks (Leal-

Taix

e et al., 2015; Dendorfer et al., 2020; Sun et al.,

2022) are currently dominated by methods which

combine simple tracking algorithms with a stack of

ad hoc specializations to visual tracking (Zhang et al.,

2021; Du et al., 2022; Aharon et al., 2022; Yang et al.,

2023). These simple-yet-effective trackers cut corners

using hard logic, for instance by ignoring less-certain

detections, leaving performance on the table. Mean-

while, probabilistic trackers are ubiquitous in more

mature ﬁelds such as radar- and sonar tracking, as

they avoid most of these hard choices and can better

https://orcid.org/0000-0002-3008-7712

https://orcid.org/0009-0004-6118-8593

https://orcid.org/0009-0001-6351-8128

https://orcid.org/0000-0002-6763-5487

https://orcid.org/0000-0003-3747-5934

∗

This work was funded by The University Center at

Kjeller, and by projects 1505 and 1688 at the Norwegian

Defence Research Establishment.

exploit available information. This raises the ques-

tion: Why are probabilistic methods outperformed by

ad hoc approaches on VMOT? We believe the prob-

abilistic approaches have overlooked a few key as-

pects speciﬁc to VMOT in their adaptations of exist-

ing tracking theory.

In visual tracking, the perspective imaging results

in target kinematics and clutter (false detections) char-

acteristics that are radically different from those seen

in radar or sonar tracking. When representing tar-

get kinematics in image plane coordinates, we should

expect objects near the camera to appear more agile

than those we see from afar. Similarly, we should ex-

pect the density of new targets and clutter detections

to be much greater for distant objects, based on the

simple fact that objects take up less space in the im-

age when they are farther away. Accounting for these

non-uniform effects is necessary to succeed in proba-

bilistic visual tracking.

The dominating ad hoc trackers are built pragmat-

ically from the ground up for VMOT. One example

of this is the ubiquitous use of intersection-over-union

(IOU) as a similarity metric for detection-to-track

matching. Although chosen for its simplicity, the IOU

metric has a nice side-effect in that it gives more lee-

110

Larsen, M., Rolfsjord, S., Gusland, D., Ahlberg, J. and Mathiassen, K.

BASE: Probably a Better Approach to Visual Multi-Object Tracking.

DOI: 10.5220/0012386600003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 4: VISAPP, pages

110-121

ISBN: 978-989-758-679-8; ISSN: 2184-4321

Figure 1: MOTA comparison of our proposed method BASE and top-performing trackers on the MOT17 and MOT20

benchmarks. Trackers in green use a ﬁxed set of parameters across the test set, while those in grey tune parameters for each

test sequence. Yellow trackers do not report whether parameters are kept constant on the test set, and have not published code

reproducing the results.

way for larger bounding boxes than for smaller ones.

Leading approaches (Aharon et al., 2022; Yang et al.,

2023) also use object size to change model dynamics

when estimating target motion. Avoiding the draw-

backs of modeling distance explicitly, these pragmatic

mechanisms lead to target models that are distance-

aware, with good performance both for near and dis-

tant objects.

The advantage probabilistic trackers hold over ad

hoc methods lies in their ability to better balance dif-

ferent types of information. Which detection is likely

to originate from the current object? The one clos-

est in position, the one with the highest conﬁdence, or

perhaps the one most similar in size or appearance?

Ad hoc methods either ignore parts of this informa-

tion, or resort to some arbitrary weighting between

them. Similarly, ad hoc methods typically accept tar-

gets once they have been detected a ﬁxed number of

times, ignoring how consistent the detections were, or

what conﬁdence level they had. A strong probabilistic

approach would instead model the relevant aspects of

this information, and take decisions based on what is

most likely.

Despite the more complex structure of probabilis-

tic trackers, we ﬁnd that the leading approaches (Fu

et al., 2019; Song et al., 2019; Jinlong Yang et al.,

2022; Baisa, 2021b) omit, or fail, to model the as-

pects we see as critical to visual tracking. To lift prob-

abilistic trackers to the performance of the leading

ad hoc methods, we propose Bayesian Approxima-

tion Single-hypothesis Estimator (BASE), a minimal-

ist probabilistic take on visual tracking. As outlined

in Fig. 2, BASE replaces the key components of a tra-

ditional single-hypothesis tracker (SHT) with proba-

bilistic counterparts. The novelty of our approach is

accounting for the non-uniform kinematics and distri-

bution of clutter experienced in visual tracking, in a

uniﬁed probabilistic manner. Speciﬁcally, our main

contributions are:

• An efﬁcient distance-aware motion model.

• Pragmatic modeling of new targets and clutter de-

tections, suited for VMOT.

• A new association-less probabilistic track man-

agement scheme, suited for crowded scenes.

• Methods for automatically estimating model pa-

rameters from training data.

Section 2 gives a brief review of related work, fol-

lowed by Sec. 3 which revisits the traditional single-

hypothesis tracking pipeline. Section 4 describes our

proposed method, BASE, and Sec. 5 details our ex-

periments performed on MOT17, MOT20 and Dance-

Track. Section 6 contains a summary and conclu-

sions.

2 RELATED WORK

Visual Multi-Object Tracking. The simple and

performant trackers that are currently state-of-the-

art (SOTA) are mostly based on or inspired by Sim-

ple Online and Realtime Tracking (SORT) (Bewley

et al., 2016). SORT (Bewley et al., 2016) started a

strong line of pragmatic trackers (Wojke et al., 2018;

Du et al., 2022; Zhang et al., 2021; Aharon et al.,

2022; Cao et al., 2022; Nasseri et al., 2022; Yang

et al., 2023; Stadler and Beyerer, 2022). These pub-

lications have a strong empirical focus, identifying

BASE: Probably a Better Approach to Visual Multi-Object Tracking

111

Figure 2: BASE builds on a traditional single-hypothesis tracker (SHT) architecture, but uses probabilistic formulations for

all aspects of gating, association and track management, as well as a distance-aware motion model. These probabilistic

formulations enable BASE to express nuances in detection conﬁdence and detection-to-track match, which threshold-based

approaches lack. For a given usecase, BASE requires modeling the motion, clutter, and target birth characteristics of the

camera and detection algorithm used, which we can automatically ﬁt using typical training datasets.

more or less standalone components that improve per-

formance. Based on their research, a set of core

components have emerged that are common among

strong visual trackers. Some of these components

are Global Nearest Neighbor (GNN) matching based

on IOU-related association metrics, Kalman ﬁltering

for track prediction, two-stage track formation, ego-

motion compensation, post-interpolation of missed

segments, and most importantly, a strong detector.

End-to-End-Trackers: trackers are very attractive,

as they can be adapted to new applications without

architectural changes. They can also take advantage

of information in ways that are hard to express in

code. Although many end-to-end trackers exhibit im-

pressive results (Sun et al., 2020; Zhu et al., 2021;

Zeng et al., 2021; Yang et al., 2021; Sun et al.,

2020; Yan et al., 2022; Wang et al., 2020; Zhang

et al., 2023), they lag behind the best tracking-by-

detection methods on many VMOT benchmarks like

MOTChallenge. This is perhaps due to limited train-

ing data and difﬁculties in training motion models

with a short time horizon (Zhu et al., 2021). These is-

sues may be resolved in the future, but we still believe

there is room for other approaches, as data restrictions

will likely persist in many niche applications.

Probabilistic Visual Tracking. Many probabilistic

trackers in the MOTChallenges build on the ideas of

Bar-Shalom (Bar-Shalom and Tse, 1975; Bar-Shalom

et al., 2007) and Blackman (Blackman and Popoli,

1999). While they mostly rely on advanced state man-

agement schemes like MHT (Reid, 1979; Kim et al.,

2015) or PHD (Mahler and Martin, 2003), they of-

ten resort to non-probabilistic methods for associa-

tion and for combining different types of informa-

tion, such as appearance or shape. In fact, the most

successful probabilistic attempts fall back to non-

probabilistic association methods like IOU matching.

Meanwhile, approaches which rely solely on proba-

bilistic schemes (Baisa, 2021b), score slightly worse

than even the extremely simple IOUTracker (Bochin-

ski et al., 2017).

Surprisingly, most probabilistic approaches do not

leverage detector conﬁdence beyond basic detection

thresholding (Sanchez-Matilla et al., 2016; Fu et al.,

2019; Jinlong Yang et al., 2022; Fu et al., 2018;

Aguilar et al., 2022; Baisa, 2021b). Some approaches

utilize conﬁdence score for track initiation only (Song

et al., 2019; Baisa, 2021a; Baisa, 2019). Meanwhile,

Wojke et al. (Wojke and Paulus, 2017) demonstrated

a signiﬁcant boost in performance by integrating de-

tector conﬁdence into a PHD ﬁlter.

3 PROBABILISTIC

SINGLE-HYPOTHESIS

TRACKING (SHT) REVISITED

To bridge the gap between the highly specialized

ad hoc methods and the overly general probabilis-

tic trackers, the single-hypothesis tracker (SHT) is a

good starting point. It is arguably one of the simplest

tracking methods that can also be made to leverage

most of the key building blocks of probabilistic track-

ers.

Traditional SHT can be summarized as developing

a single set of track hypotheses through the following

steps for each new piece of sensor data:

1. Predict existing tracks to the current timestep.

2. Gate the detections by disregarding detection-to-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

112

track pairs that are considered too unlikely.

3. Associate detections to existing track hypotheses,

choosing the overall most likely set of matches.

4. Update the state estimate of existing tracks using

measurement data from the associated detections.

5. Manage Tracks by initializing new candi-

date tracks for unmatched detections, validating

promising tracks, and discarding unlikely tracks.

This structure is popular among the top-performing

visual trackers, which implement most (or all) these

steps using ad hoc modelling (Zhang et al., 2021;

Aharon et al., 2022; Du et al., 2022). Probabilistic

SHT aims to formulate gating, association and track

management in terms of probability. These formula-

tions often also require strict probabilistic modeling

of the underlying state (predict and update). Black-

man (Blackman and Popoli, 1999) gives an excellent

in-depth walkthrough of probabilistic SHT, but we

will here go through the most critical aspects.

3.1 Probabilistic Gating and

Association

The ad hoc trackers discussed in Sec. 2 typically

perform gating by thresholding the detection-to-track

IOU distance, and association by minimizing the

overall IOU distance. The beneﬁt of formulating

these mechanisms in terms of probability is that it of-

fers a clear path to incorporate aspects such as the

quality of the track estimates, the detector perfor-

mance, and the clutter characteristics.

For a given true track state x

, we assume to have

an estimate

∼ N (x

, P

) and that corresponding de-

tections z

are generated as

= Hx

+ w

, w

∼ N (0, R

). (1)

Here, H is the measurement function, w

is white

measurement noise, and P

and R

are known covari-

ances. We can then consider the detection-to-track

innovation

i j

= z

− H

∼ N (0, S

i j

), (2)

with S

i j

= R

+ HP

⊤

For gating, an traditional approach (see (Black-

man and Popoli, 1999, Sec. 6.3.2) for details) is to

threshold the match-to-noise likelihood ratio:



i j

; 0,S

i j



+ λ

≥

1 − P

, (3)

where λ

and λ

are the densities of new targets and

clutter, and P

is the desired gating conﬁdence level.

For probabilistic association, we want to ﬁnd the

“most likely” set of detection-to-track pairs A =

{( j,i)}. A common approach, which ignores the de-

tector performance and clutter characteristics, is to

formulate this as linear-sum assignment problem over

the negative logarithm of the innovation likelihood:

∗

= argmin

∑

( j,i)∈A

⊤

i j

−1

i j

+ log |S

i j

| (4)

3.2 Probabilistic Track Management

Where ad-hoc trackers employ schemes such as

counting the number of recent detections to assess

whether a track hypothesis should be discarded, prob-

abilistic SHT estimates the probability that each hy-

pothesis is valid. This is done by assessing the event

: x

was either not detected, or explained by a

detection from a real target, in each frame.

For each track i we then maintain a likelihood-ratio

(LR) LR

, weighing evidence for- and against H

, as

≜

p(Z

)





·· ·

p(Z

)









, (5)

where Z

represents the sensor data at time k, and H

is the logical complement of H

. At each timestep k,

depending on whether the track was detected or not,

Blackman writes the corresponding LR factor as

p(Z

)





≜







(

i j

)

p(z

|Det,H )



|Det,H



, if assoc. to j

1 − P

, otherwise.

(6)

Here, P

is prior detection probability, which is typ-

ically modelled as a constant. LR

p(z

|Det,H )



|Det,H



the “signal-related” LR, typically derived from the

SNR of the given detection method. We then con-

sider a track hypothesis x

to be “unconﬁrmed” until

passes some given threshold, and discard the hy-

pothesis if LR

falls below some other threshold.

4 THE BASE VISUAL TRACKER

In this section, we present BASE, a minimalist prob-

abilistic take on visual tracking. We design BASE

as a probabilistic SHT with the necessary extensions

to sufﬁciently model the visual tracking problem, as

shown in Figure 2.

To accommodate the non-uniform motion and

clutter encountered in visual tracking, we develop

BASE: Probably a Better Approach to Visual Multi-Object Tracking

113

the distance-aware 2D motion model in Sec. 4.4 and

model detector performance and clutter behaviour in

Sec. 4.5. In Sec. 4.6 we propose an automatic proce-

dure to estimate the parameters of these models. The

traditional SHT pipeline discussed in Sec. 3 cannot

fully exploit these models in all aspects of gating, as-

sociation and track management. We therefore com-

pute the association probability in Sec. 4.1, before us-

ing it to reformulate gating and association in Sec. 4.2

and track management in Sec. 4.3.

4.1 Explicitly Approximating the

Association Probability

To account for non-uniform densities of new tar-

gets and clutter, the traditional association score from

Eq. (4) is insufﬁcient. Consider a track which is pre-

sented with two measurements that have identical sta-

tistical distance, but where one measurement has far

greater risk of being clutter than the other. Intuitively,

we should prefer to associate the track with the mea-

surement least likely to be clutter.

Instead of using the traditional association score

from Eq. (4), we shall compute the full association

probability for each detection/track pair. First, we de-

ﬁne the event

i j

: z

originates from the real target represented by x

In our single-hypothesis paradigm, each detection z

must originate either from a target for which we have

an hypothesis, a newly appeared target, or from clut-

ter. Given the corresponding measurement z

we use

) = λ

) + λ

) (7)

to denote the corresponding extraneous measurement

density. Modelling λ

and λ

is detailed in Sec. 4.5.

We can now write the association probability as



i j



i j

)

) +

∑

i j

)

. (8)

Here, p (

i j

) is the joint likelihood of the cur-

rent track state

and the observed measurement z

assuming that z

originates from x

. Inspired by

(Blackman and Popoli, 1999) we model our measure-

ments to consist of a state-related measurement (mod-

elled in Sec. 4.4) and an independent conﬁdence mea-

surement (modelled in Sec. 4.5). Since we will only

model the bounding box state of each track, we write

i j

) ≜ p



(bb)

i j





(c)



. (9)

4.2 Probabilistic Gating and

Association

Instead of using the traditional Eq. (3) for gating and

Eq. (4) for association scores, we will base both gat-

ing and scoring on Eq. (8). We compute association

scores as

−log Pr



i j



(10)

and perform gating using



i j



≥

1 − P

. (11)

The introduction of Eq. (8) results in stricter than

before gates both for measurements in crowded re-

gions and less conﬁdent measurements. For associa-

tion, however, Eq. (8) also takes the extraneous mea-

surement density into account, critical to properly bal-

ance between measurements that have vastly different

. Using Eq. (10), the linear-sum assignment will

now ﬁnd the set of associations with the overall lowest

probability of containing a misassociation, whereas

the traditional variant (Eq. (4)) ﬁnds the most likely

association only with regards to the predicted state vs

the observed measurements.

4.3 Probabilistic Track Management

In BASE, we will build the track management around

the probability that at least one of the measurements

originates from a given track x

in the current frame,

namely

= 1 −

∏



1 − Pr



i j



. (12)

To rewrite Eq. (5) using

, we ﬁrst deﬁne the event

(k)

: x

was detected in frame k.

(13)

Re-ordering Eq. (5) with Bayes’ rule, we can write

the LR-contribution of frame k as

p(H

)







(k)



+ p



(k)





(k)



+ p



(k)



(14)

(k)

+ (1 − P

)



1 −

(k)





1 −

(k)



, (15)

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

114

where we have used that D

(k)

⊂ H

, and that

≜ Pr

(k)

. (16)

Where the traditional LR update (Eq. (6)) only

uses the associated measurement, Eq. (15) collects

contributions from all measurements, normalized

across all tracks. This association-less track LR bet-

ter handles the cases where several tracks have signif-

icant stakes in a given measurement, which is often

the case in crowded visual tracking, as illustrated in

Fig. 3. Since this computation is well suited for GPU

acceleration and can be run in parallel with the asso-

ciation, the increased computational burden is more

than made up for in practice.

4.4 A Distance-Aware Planar Motion

Model for Bounding-Box Objects

We shall stick to a traditional linear Kalman ﬁlter

(KF) setup to model measurements and target mo-

tion. For each target, we model the bounding box

center, width and height in the current image plane

using pixel coordinates. We assume a nearly constant

velocity (NCV) model for the bounding box center

, and use a nearly constant state model for the

box size. As a state vector we choose

x =



, c

, ˙c

, w, h



, (17)

where ˙c

, ˙c

denote the center velocity, and w and h

is the width and height. We model each NCV block

using



1 δ

0 1





/3 δ

/2 δ



, (18)

for a given timestep δ

, as in (Blackman and Popoli,

1999, Sec. 4.2.2). We use F

∗2

and Q

∗2

to denote the

2D composition of F

and Q

We correct for camera ego-motion by ﬁrst align-

ing each new image to the previous one using (Evan-

gelidis and Psarakis, 2008). Between each pair of fol-

lowing images we obtain a transform which predicts

pixels in time as

= W

k−1

+ t

. (19)

By using the notation T

= diag (W

) and

= diag



∗2



, we can write the state transition

= T

k−1

+ v

) + I

6×2

, (20)

with white v

∼ N (0, Q

), where

= σ

⊤

diag



∗2



. (21)

The key to making the model distance-aware, is here

to scale σ

with the previous object width, w

k−1

, sim-

ilar to what is done in (Aharon et al., 2022):

= w

k−1



, σ



, (22)

where σ

and σ

are the standard deviation of the

center acceleration noise and the size rate noise, re-

spectively.

For the sensor model we assume that we make ob-

servations corrupted by white Gaussian noise as

= (x

)

+ w

, w

∼ N (0, R

). (23)

The sensor and transition models together de-

scribe a linear Gaussian system, which is suitable for

estimation using a KF. When initializing new tracks,

we will in addition to the above employ



z, 0, 0



(24)

= diag(R, P

) (25)

as a prior for the center rate, where R and P

are the

to-be-estimated measurement- and initial center rate

covariances.

4.5 Modelling Detector Performance

In probabilistic tracking, the extraneous measurement

density (λ

from Eq. (7)) and the detection conﬁ-

dence (p



(c)



from Eq. (9)) are typically ignored or

treated as constants. However, these quantities de-

scribe critical aspects of the detector performance that

should affect the tracking. Properly modeling these

parameters can allow us to quickly establish track in

the simple cases, while still avoiding false tracks in

questionable cases.

We begin by modeling the detector conﬁdence z

(c)

in Eq. (9) through histogram binning of inlier and all

detections on the training set (see Sec. 4.6 for details)



(c)



≜

hist

inlier

)

hist

all

)

, (26)

where w

is the measured width, and c

is the pre-

dicted conﬁdence from the detector. Figure 5 shows

the resulting likelihood on MOT17, which is clearly

nonuniform.

Next, we shall model λ

(z), which is the den-

sity of clutter measurements and measurements due to

newly appeared targets. Through experimentation, we

have found that object size alone is a good discrimi-

nator for λ

. We therefore use scaled histogram bin-

ning over object width across all training set detec-

tions as a pragmatic model:

) ≜ c

hist

), (27)

BASE: Probably a Better Approach to Visual Multi-Object Tracking

115

(a) Frame 118: Association-less. (b) Frame 118: Association-based.

Figure 3: Our proposed association-less track management from Eq. (15) vs. traditional association-based track management

on “MOT20-07”. Select tracks are highlighted in color based on track-ID, while boxes of unrelated tracks are drawn in gray.

In crowded areas, the detector tends to generate clutter detections, which can lead to false tracks, as in Fig. 3b. Meanwhile,

detections of partially occluded objects tend to have low conﬁdence, which can cause the tracker to require more frames

before a track is established, as in Fig. 3d. Our association-less track management enables the tracker to be conservative in

establishing tracks in crowded areas, and at the same time aggressive for solitary objects. Meanwhile, association-based track

management must more carefully balance the risk of false tracks and delayed track establishment.

where w

is the measured width and c

is a constant

we estimate in Sec. 4.6. Figure 6 shows a log-plot

of hist

for MOT17, which is clearly skewed towards

smaller boxes.

4.6 Automatically Estimating Model

Parameters from Training Data

To start using the proposed motion and sensor model,

we need the parameters σ

, σ

, R and P

. For-

tunately, these can be estimated from a dataset con-

sisting of detections and ground-truth tracks, such as

those provided in the MOTChallenges.

The ground truth bounding boxes are given as

(k)

= (c

,w, h) for each true target x

in each frame

k where x

is present. We start with the prior center

rate covariance P

, which can be estimated from the

ground truth tracks alone. To avoid potential errors in

the camera ego-motion correction tainting the ground

truth data, we only use sequences where the camera is

stationary. We then estimate P

−1

∑

)

−g

)

)(g

)

−g

)

⊤

)

−t

)

, (28)

where k

and k

index the frames where target i ap-

pears for the ﬁrst and second time, t

)

− t

)

is the

time elapsed between said frames, and n

is the total

number of targets.

The detection bounding boxes are given as a set

= (

h) ∈ Z for each frame. To leverage the

detections in parameter estimation, we ﬁrst attempt to

associate detections z

to ground truth targets g

using

IOU. We only consider pairs where iou(z

) > 0.7,

and match z

to g

if z

is the closest to g

and vice-

versa. We denote the resulting set of associations A =





Using the associated detections and ground truth

targets, we once again use stationary sequences, and

estimate R as

R =

−1

∑



(k)



∈A

(k)

− g

(k)

)(z

(k)

− g

(k)

)

⊤

, (29)

where n

is the total number of associations.

The histograms over all detections as a function

of box width (hist

(w)) and as a function of both pre-

dicted conﬁdence and box width (hist

all

(c,w)) can be

computed directly from the training set. We compute

inlier histogram (hist

all

(c,w)) using only the associ-

ated detections in A.

To estimate σ

and σ

we employ maximum

likelihood estimation (MLE) based on A and the pro-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

116

posed motion- and sensor models from Sec. 4.4, as

outlined in (Brekke, 2019). Finally, we ﬁnd c

by a

parameter search where we run the tracker on the full

training set.

5 EXPERIMENTS

To assess the effectiveness of our proposed minimal-

ist probabilistic visual trackerm, we evaluate BASE

on the MOT17 (Milan et al., 2016), MOT20 (Dendor-

fer et al., 2020) and DanceTrack (Sun et al., 2022)

benchmarks. Since we are primarily interested in

validating the probabilistic backbone, we opt not to

use Re-Id or other appearance features. We will fo-

cus on the MOTChallenge benchmarks, which are

ideal for demonstrating a minimal probabilistic visual

tracker as they contain single-camera footage with

simple ego-motion and no complex movement pat-

terns. The DanceTrack dataset contains much more

sudden movement of arms and legs, and would bene-

ﬁt from a more specialized motion model.

5.1 MOTChallenge Caveats

Although the MOTChallenge benchmarks enable ob-

jective comparison of tracking algorithms, there are a

few noteworthy differences in practices that color the

results.

Not All Results Use Global Parameters. As dis-

cussed in (Cetintas et al., 2023), several submissions

boost performance by tuning separate parameter sets

for each sequence in the test set. Unless explicitly

stated or evident from published source code, we

cannot ascertain which practice is used for a given

method.

Post-Tracking Interpolation across frames where

objects are not observed, is ubiquitous among all

top-scoring methods on both MOT17 and MOT20.

Approaches that interpolate over a ﬁxed number of

frames, still seem to consider themselves “online”.

The public Detection Leaderboard seems useful to

compare trackers on equal terms. However, all the top

submissions in this category still use image data to ex-

tract additional detections or Re-ID, greatly occluding

the results.

5.2 Detector

We use the YOLOX (Ge et al., 2021) detector with

a detection threshold of 0.1 for all benchmarks. For

DanceTrack we use the ofﬁcially trained weights. For

MOT17 and MOT20 we trained the model in a sim-

ilar fashion as ByteTrack (Zhang et al., 2021), with

Figure 4: An image from “MOT17-05-FRCNN“ processed

using the pretrained detector from ByteTrack (Zhang et al.,

2021). This detector is overﬁtted to “see through” severe

occlusions, resulting in a large number of clutter detections.

Figure 5: The p



(c)



histogram from Eq. (26) for the

MOT17 training set. Each cell is the number of inlier de-

tections (as deﬁned in Sec. 4.6) divided by the number of

detections within the corresponding boxwidth/score bin.

a combination of MOT17, Cityperson (Zhang et al.,

2017), Crowdhuman (Shao et al., 2018) and ETHZ

(Ess et al., 2008). However, as illustrated in Fig. 4, the

original ByteTrack detector has been severely overﬁt-

ted to propose detections even for occluded objects.

We therefore excluded fully occluded and crowd tar-

gets in the training process to avoid this type of over-

ﬁtting.

5.3 Parameter Estimation

For each of the three benchmarks we select a global

set of parameters that are used across all sequences.

Following the scheme outlined in Secs. 4.5 and 4.6

we compute P

, R, p



(c)



and histograms. σ

and

are estimated using MLE on inlier detections, and

is found by a parameter search using full tracking

on the respective training sets.

For all benchmarks we use P

= 0.95. MOT17

and MOT20 use a canonical P

= 10

−3

. DanceTrack

has very few distracting elements, and we have seen

improved performance using P

= 10

−6

BASE: Probably a Better Approach to Visual Multi-Object Tracking

117

Table 1: SOTA and select methods on key benchmarks, sorted by MOT17 MOTA. The column with grey background shows

top results using strictly public detections on MOT17. Results which tune parameters for each sequence in the test set are

shown in grey font.

MOT17 MOT20 Dancetrack MOT17 Public

Method ReID MOTA HOTA MOTA HOTA MOTA HOTA MOTA HOTA

PHD GM (Sanchez-Matilla et al., 2020) - - - - - - - 48.8 -

GMPHDO. (Song et al., 2019) - - - - - - - 49.9 -

OCSORT (Cao et al., 2022) - 78.0 63.2 75.7 62.4 92.0 55.7 - -

MOTRv2 (Zhang et al., 2023) ✓ 78.6 62.0 76.2 60.3 92.1 73.4 - -

StrongSort (Du et al., 2022) ✓ 79.6 64.4 73.8 62.6 - - - -

FineTrack (Ren et al., 2023) ✓ 80.0 64.3 77.9 63.6 89.9 52.7 - -

ByteTrack (Zhang et al., 2021) - 80.3 63.1 77.8 61.3 90.9 51.9 - -

BoT-SORT (Aharon et al., 2022) ✓ 80.5 65.0 77.8 63.3 - - - -

SUSHI (Cetintas et al., 2023) ✓ 81.1 66.5 74.3 64.3 88.7 63.3 - -

MotionTrack (Qin et al., 2023) ✓ 81.1 65.1 78.0 62.8 - - - -

CBIoU (Yang et al., 2023) ✓ 81.1 64.1 - - 91.6 60.6 - -

ImprAsso (Stadler and Beyerer, 2023) ✓ 82.2 66.4 78.6 64.6 - - - -

BASE (ours) - 81.9 64.5 78.2 63.5 91.7 56.4 51.8 43.6

Figure 6: The hist

) histogram from Eq. (27) for the

MOT17 training set. Each cell is just the density of detec-

tions within the corresponding boxwidth bin.

5.4 Post-Processing

The MOTChallenge ground truth used for scoring

contains both visible and occluded targets, which

makes it important to maintain tracks through occlu-

sions. True real-time trackers will always have a dis-

advantage in this regard, as they cannot correct al-

ready reported trajectories when old targets reappear.

To counter this, it has become a common practice

among the top performing tracker to apply interpo-

lation across such occlusion gaps in a post-processing

step. With this post-processing the tracker can still

run live, but will have some ﬁxed delay.

We employ the interpolation post-processing as

described in (Zhang et al., 2021), in addition to what

we call look-ahead. With look-ahead we delay re-

porting of tracks by a ﬁxed number of frames, but

use the estimated track hypothesis likelihood of the

newest processed frame to determine whether the

track should be reported. The primary beneﬁt of this

is that we can report tracks with certainty upon ﬁrst

detection, even though the track hypothesis likelihood

requires a few frames to accumulate. Without look-

ahead, the metrics used in the MOTChallenges force

the tracker to establish tracks on the very ﬁrst detec-

tions, severely limiting the ability of the probabilistic

track management to ﬁlter clutter.

5.5 MOT17 Results

Our overall results for MOT17 are presented in Fig. 1

and Tab. 1. On the testset, BASE (excl. the detector)

ran at 331Hz on an AMD 5950x. Among all sub-

missions with publications and which use global pa-

rameters, BASE ranks ﬁrst with an 81.9 MOTA score

and third with a 64.5 HOTA score. In MOTA, BASE

is only surpassed by ImprAsso (Stadler and Beyerer,

2023), which transparently report using per-sequence

tuning on the test set. We strongly suspect that BASE

would receive a signiﬁcant performance boost using a

similar tuning scheme, but we insist on using global

parameters. Compared to leading trackers, like BoT-

SORT (Aharon et al., 2022), BASE performs exceed-

ingly well on sequences with small objects and per-

sistent clutter detections, like sequence 01 (72.4 vs

63.4 MOTA) and 14 (68.1 vs 53.5 MOTA). We be-

lieve this owes to BASE’s probabilistic track manage-

ment, which better captures the nuance between faint

detections of small objects and more inconsistent clut-

ter detections.

Compared to BoT-SORT, BASE performs worse

on sequences with long occlusions paired with large

camera motion, like sequence 06 (64.8 vs 66.6

MOTA). We suspect this is due to our lack of Re-ID,

as other motion-only approaches, such as ByteTrack

and OC-SORT, also perform poorly on this sequence

(60.2 and 57.3 MOTA).

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

118

Table 2: Ablation results on the MOT17 validation set. The experiments quantify the effects of our proposed dynamic clutter

density model, the distance-aware motion model, the probabilistic association and the use of detector conﬁdence.

Dynamic clutter Distance-aware Association type Detection conﬁdence MOTA HOTA

1 - ✓ Probabilistic - 69.3 65.7

2 - - Probabilistic - 71.7 65.7

3 - - IOU - 73.7 65.4

3 ✓ - Probabilistic - 75.6 66.4

4 ✓ - IOU - 76.0 66.1

5 ✓ ✓ IOU - 76.5 66.2

6 ✓ ✓ Probabilistic - 77.1 68.2

BoT-SORT (Aharon et al., 2022) 78.5 69.2

7 ✓ ✓ Probabilistic Raw 80.8 69.8

8 ✓ ✓ Probabilistic Calibrated 81.6 70.2

The gray section of Tab. 1 reports results strictly

using only the public detections of MOT17. Here,

BASE outperforms vastly more complicated and

computationally costly probabilistic trackers, even

without visual odometry (VO).

5.6 MOT20 Results

The results for MOT20 are also shown in Fig. 1

and Tab. 1. Among submissions with publications

and which use global parameters, BASE ranks ﬁrst

in MOTA, and third in HOTA. This is somewhat sur-

prising, given that several of the other top methods

employ Re-Id, which seems particularly promising on

MOT20. The high HOTA-score might indicate that an

empirically tuned motion model helps prevent mixing

up targets during occlusions. On the testset, BASE

(excl. the detector) ran at 39Hz on an AMD 5950x.

Several of the top-performing trackers employ

per-sequence parameters, which seems to be partic-

ularly beneﬁcial on MOT20. Sequences 04 and 07,

the scenes of which are also featured in the training

set, seem to warrant conservative tracker parameters.

Meanwhile, sequences 06 and 08 seem to beneﬁt from

a more sensitive tracker.

5.7 DanceTrack

DanceTrack (Sun et al., 2022) is an interesting dataset

as it poses quite different challenges than MOT17 and

MOT20, with highly irregular motion but often rela-

tively easily detectable targets. Since BASE’s motion

model assumes continuous motion and slow changes,

the sudden streching of arms and changes in posture

typical for this dataset seems particularly ill-suited

for our model. Table 1 shows our results, as well

as the SOTA methods also using the public detec-

tor. Our method outperforms ByteTrack(Zhang et al.,

2021) and FineTrack(Ren et al., 2023) but falls behind

C-BIoU, which leverages a more specialized motion

model. OCSORT(Cao et al., 2022), which also uses a

motion model adapted to DanceTrack, performs sim-

ilarly to BASE. MOTRv2(Zhang et al., 2023) outper-

forms all these approaches by a large margin, illus-

trating that an end-to-end approach may be a particu-

larly good ﬁt for DanceTrack. However, as they use

additional training data, the MOTRv2 results are not

directly comparable to the other methods.

5.8 Ablation Study

In this section, we study the effects of the key compo-

nents of BASE, namely the probabilistic association

(vs IOU-based association), the distance-aware mo-

tion model (vs naive motion model), dynamic clut-

ter estimation (vs constant λ

) and the histogram-

calibrated detector conﬁdence (vs ignoring or using

raw conﬁdence). We use the same YOLOX abla-

tion model from (Zhang et al., 2021), so the abla-

tion results are directly comparable to those of BoT-

SORT (Aharon et al., 2022) and ByteTrack (Zhang

et al., 2021). The model was trained on Crowdhuman

(Shao et al., 2018) and the train half of the MOT17

training set. We ﬁt all BASE-speciﬁc parameters us-

ing only train from MOT17, while the experiments

were run on the val half of the training set. The re-

sults are shown in Tab. 2.

Comparing rows 1 and 2, we see that when using

a constant λ

instead of dynamic clutter (Eq. (27)),

the distance-aware motion model actually makes the

method perform worse. Meanwhile, comparing rows

3 and 6, we see that the distance-aware motion model

gives a signiﬁcant boost once the dynamic clutter

model is in place. Since the distance-aware motion

model increases the position uncertainty for tracks

with large bounding boxes, such tracks struggle to

build conﬁdence and match with detections when the

clutter density is kept constant.

We also see that probabilistic association with a

naive motion model (row 3) is outperformed by IOU-

based association (row 4). The IOU-based association

intrinsically offers some compensation for distance,

since larger boxes are allowed to miss by more pixels

while still achieving the same IOU as smaller boxes.

BASE: Probably a Better Approach to Visual Multi-Object Tracking

119

The poor result with the naive motion model (row

3) indicates that having some distance-aware mech-

anism is indeed necessary in visual tracking. Mean-

while, we also see that our proposed distance-aware

motion model with probabilistic association (row 6)

performs even better, indicating that BASE is able to

further exploit the distance information.

We observe a signiﬁcant improvement in perfor-

mance when calibrated conﬁdence is used (row 8)

compared to raw conﬁdence (row 7). Ignoring the

conﬁdence score altogether (row 6) results in worse

performance than BoT-SORT. A possible explanation

for this is that the BoT-SORT dual threshold approach

can extract some, but not all, of the conﬁdence score

information. Since all proposed components are nec-

essary to reach SOTA in the probabilistic paradigm,

we consider BASE as a minimalist approach.

6 CONCLUSION

This paper demonstrates that a probabilistic tracker

can achieve SOTA on popular VMOT benchmarks.

Our proposed method, BASE, is the top-performing

method on the MOT17 and MOT20 benchmarks

and has competitive results on the more specialized

DanceTrack benchmark. Through our ablation study,

we show that a distance-aware motion model is nec-

essary for probabilistic association to perform well in

visual tracking, and that a dynamic clutter model is

needed to make such motion models work. Previous

attempts at probabilistic visual trackers omit (at least)

the distance compensation, which we believe is why

they fall behind more ad-hoc visual trackers that em-

ploy IOU-based association.

With BASE, we are merely scratching the surface

of what is possible with probabilistic approaches to

visual tracking. Starting from a minimalist proba-

bilistic SHT foundation, we show that properly mod-

eling motion, clutter, and detector conﬁdence is all

it takes for a probabilistic tracker to surpass the cur-

rent SOTA. The probabilistic framework opens the

gate for advanced core tracking algorithms and opti-

mally exploiting multiple object features, such as vi-

sual appearance/Re-ID. Therefore, BASE is probably

a better starting point for new visual trackers.

REFERENCES

Aguilar, C., Ortner, M., and Zerubia, J. (2022). Small Ob-

ject Detection and Tracking in Satellite Videos With

Motion Informed-CNN and GM-PHD Filter. Fron-

tiers in Signal Processing, 2(April):1–15.

Aharon, N., Orfaig, R., and Bobrovsky, B.-Z. (2022). BoT-

SORT: Robust Associations Multi-Pedestrian Track-

ing. arXiv preprint arXiv:2206.14651.

Baisa, N. L. (2019). Online Multi-object Visual Tracking

using a GM-PHD Filter with Deep Appearance Learn-

ing. FUSION 2019 - 22nd International Conference

on Information Fusion.

Baisa, N. L. (2021a). Occlusion-robust online multi-object

visual tracking using a GM-PHD ﬁlter with CNN-

based re-identiﬁcation. Journal of Visual Communica-

tion and Image Representation, 80(January):103279.

Baisa, N. L. (2021b). Robust online multi-target visual

tracking using a HISP ﬁlter with discriminative deep

appearance learning. Journal of Visual Communica-

tion and Image Representation, 77.

Bar-Shalom, Y., Blackman, S. S., and Fitzgerald, R. J.

(2007). Dimensionless score function for multiple hy-

pothesis tracking. IEEE Transactions on Aerospace

and Electronic Systems, 43(1):392–400.

Bar-Shalom, Y. and Tse, E. (1975). Tracking in a cluttered

environment with probabilistic data association. Au-

tomatica, 11(5).

Bewley, A., Ge, Z., Ott, L., Ramos, F., and Upcroft, B.

(2016). Simple Online and Realtime Tracking. IEEE

international conference on image processing (ICIP).

Blackman, S. S. and Popoli, R. (1999). Design and analy-

sis of modern tracking systems, volume 1999. Artech

House Publishers.

Bochinski, E., Eiselein, V., and Sikora, T. (2017). High-

Speed tracking-by-detection without using image in-

formation. In 2017 14th IEEE International Confer-

ence on Advanced Video and Signal Based Surveil-

lance, AVSS 2017.

Brekke, E. F. (2019). Fundamentals of Sensor Fusion: Tar-

get tracking, Navigation and SLAM. NTNU.

Cao, J., Weng, X., Khirodkar, R., Pang, J., and Ki-

tani, K. (2022). Observation-Centric SORT: Rethink-

ing SORT for Robust Multi-Object Tracking. arXiv

preprint arXiv:2203.14360.

Cetintas, O., Bras

o, G., and Leal-Taix

e, L. (2023). Unify-

ing Short and Long-Term Tracking With Graph Hier-

archies.

Dendorfer, P., Rezatoﬁghi, H., Milan, A., Shi, J., Cre-

mers, D., Reid, I., Roth, S., Schindler, K., and Leal-

Taix

e, L. (2020). MOT20: A benchmark for multi

object tracking in crowded scenes. arXiv preprint

arXiv:2003.09003.

Du, Y., Song, Y., Yang, B., and Zhao, Y. (2022). Strong-

SORT: Make DeepSORT Great Again. arXiv.

Ess, A., Leibe, B., Schindler, K., and Van Gool, L. (2008).

A mobile vision system for robust multi-person track-

ing. 26th IEEE Conference on Computer Vision and

Pattern Recognition, CVPR.

Evangelidis, G. and Psarakis, E. (2008). Parametric Image

Alignment Using Enhanced Correlation Coefﬁcient

Maximization. IEEE Transactions on Pattern Anal-

ysis and Machine Intelligence, 30(10):1858–1865.

Fu, Z., Angelini, F., Chambers, J., and Naqvi, S. M. (2019).

Multi-Level Cooperative Fusion of GM-PHD Filters

for Online Multiple Human Tracking. IEEE Transac-

tions on Multimedia, 21(9):2277–2291.

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

120

Fu, Z., Feng, P., Angelini, F., Chambers, J., and Naqvi,

S. M. (2018). Particle PHD Filter Based Multiple Hu-

man Tracking Using Online Group-Structured Dictio-

nary Learning. IEEE Access, 6.

Ge, Z., Liu, S., Wang, F., Li, Z., and Sun, J. (2021).

YOLOX: Exceeding YOLO Series in 2021. arXiv.

Jinlong Yang, Peng Ni, Jiani Miao, and Hongwei Ge (2022).

Improving visual multi-object tracking algorithm via

integrating GM-PHD and correlation ﬁlters.

Kim, C., Li, F., Ciptadi, A., and Rehg, J. M. (2015). Multi-

ple Hypothesis Tracking Revisited. In 2015 IEEE In-

ternational Conference on Computer Vision (ICCV),

pages 4696–4704. IEEE.

Leal-Taix

e, L., Milan, A., Reid, I., Roth, S., and Schindler,

K. (2015). MOTChallenge 2015: Towards a Bench-

mark for Multi-Target Tracking. arXiv.

Mahler, R. P. S. and Martin, L. (2003). Multitar-

get Bayes Filtering via First-Order Multitarget Mo-

ments. IEEE TRANSACTIONS ON AEROSPACE

AND ELECTRONIC SYSTEMS, 39(4).

Milan, A., Leal-Taixe, L., Reid, I., Roth, S., and Schindler,

K. (2016). MOT16: A Benchmark for Multi-Object

Tracking. arXiv.

Nasseri, M. H., Babaee, M., Moradi, H., and Hosseini, R.

(2022). Fast Online and Relational Tracking. arXiv.

Qin, Z., Zhou, S., Wang, L., Duan, J., Hua, G., and Tang,

W. (2023). MotionTrack: Learning Robust Short-term

and Long-term Motions for Multi-Object Tracking. In

Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 17939–

17948.

Reid, D. B. (1979). An Algorithm for Tracking Multiple

Targets. IEEE Transactions on Automatic Control,

24(6):843–854.

Ren, H., Han, S., Ding, H., Zhang, Z., Wang, H., and

Wang, F. (2023). Focus On Details: Online Multi-

object Tracking with Diverse Fine-grained Represen-

tation. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

11289–11298.

Sanchez-Matilla, R., Cavallaro, A., and N, N. (2020). Mo-

tion Prediction for First-Person Vision Multi-object

Tracking. In Lecture Notes in Computer Science (in-

cluding subseries Lecture Notes in Artiﬁcial Intelli-

gence and Lecture Notes in Bioinformatics), volume

12538 LNCS, pages 485–499. Springer Science and

Business Media Deutschland GmbH.

Sanchez-Matilla, R., Poiesi, F., and Cavallaro, A. (2016).

Online multi-target tracking with strong and weak de-

tections. Lecture Notes in Computer Science (in-

cluding subseries Lecture Notes in Artiﬁcial Intel-

ligence and Lecture Notes in Bioinformatics), 9914

LNCS:84–99.

Shao, S., Zhao, Z., Li, B., Xiao, T., Yu, G., Zhang, X., and

Sun, J. (2018). CrowdHuman: A Benchmark for De-

tecting Human in a Crowd. arXiv, pages 1–9.

Song, Y.-M., Yoon, K., Yoon, Y.-C., Yow, K. C., and Jeon,

M. (2019). Online Multi-Object Tracking With GM-

PHD Filter and Occlusion Group Management. IEEE

Access, 7:165103–165121.

Stadler, D. and Beyerer, J. (2022). Modelling Ambiguous

Assignments for Multi-Person Tracking in Crowds.

Proceedings - 2022 IEEE/CVF Winter Conference on

Applications of Computer Vision Workshops, WACVW

2022, pages 133–142.

Stadler, D. and Beyerer, J. (2023). An Improved Asso-

ciation Pipeline for Multi-Person Tracking. In 2023

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition Workshops (CVPRW), pages 3170–

3179. IEEE.

Sun, P., Cao, J., Jiang, Y., Yuan, Z., Bai, S., Kitani, K., and

Luo, P. (2022). DanceTrack: Multi-Object Tracking

in Uniform Appearance and Diverse Motion.

Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z.,

Wang, C., and Luo, P. (2020). TransTrack: Multiple

Object Tracking with Transformer. arXiv.

Wang, Y., Kitani, K., and Weng, X. (2020). Joint Object De-

tection and Multi-Object Tracking with Graph Neural

Networks. arXiv.

Wojke, N., Bewley, A., and Paulus, D. (2018). Simple on-

line and realtime tracking with a deep association met-

ric. Proceedings - International Conference on Image

Processing, ICIP, 2017-Septe:3645–3649.

Wojke, N. and Paulus, D. (2017). Conﬁdence-Aware prob-

ability hypothesis density ﬁlter for visual multi-object

tracking. VISIGRAPP 2017 - Proceedings of the 12th

International Joint Conference on Computer Vision,

Imaging and Computer Graphics Theory and Appli-

cations, 6(Visigrapp):132–139.

Yan, B., Jiang, Y., Sun, P., Wang, D., Yuan, Z., Luo, P., and

Lu, H. (2022). Towards Grand Uniﬁcation of Object

Tracking. arXiv.

Yang, F., Chang, X., Sakti, S., Wu, Y., and Nakamura, S.

(2021). ReMOT: A model-agnostic reﬁnement for

multiple object tracking. Image and Vision Comput-

ing, 106:104091.

Yang, F., Odashima, S., Masui, S., and Jiang Fujitsu Re-

search, S. (2023). Hard To Track Objects With Irregu-

lar Motions and Similar Appearances? Make It Easier

by Buffering the Matching Space. In Proceedings of

the IEEE/CVF Winter Conference on Applications of

Computer Vision (WACV),, pages 4799–4808.

Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., and

Wei, Y. (2021). MOTR: End-to-End Multiple-Object

Tracking with Transformer. arXiv.

Zhang, S., Benenson, R., and Schiele, B. (2017). CityPer-

sons: A diverse dataset for pedestrian detection. Pro-

ceedings - 30th IEEE Conference on Computer Vi-

sion and Pattern Recognition, CVPR 2017, 2017-

Janua:4457–4465.

Zhang, Y., Sun, P., Jiang, Y., Yu, D., Yuan, Z., Luo, P., Liu,

W., and Wang, X. (2021). ByteTrack: Multi-Object

Tracking by Associating Every Detection Box. ECCV

2022, Proceedings.

Zhang, Y., Wang, T., and Zhang, X. (2023). MOTRv2:

Bootstrapping End-to-End Multi-Object Tracking by

Pretrained Object Detectors. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 22056–22065.

Zhu, T., Hiller, M., Ehsanpour, M., Ma, R., Drummond, T.,

Reid, I., and Rezatoﬁghi, H. (2021). Looking Beyond

Two Frames: End-to-End Multi-Object Tracking Us-

ing Spatial and Temporal Transformers. arXiv, pages

1–20.

BASE: Probably a Better Approach to Visual Multi-Object Tracking

121