MULTI-CAMERA PEOPLE TRACKING WITH HIERARCHICAL

LIKELIHOOD GRIDS

Lili Chen, Giorgio Panin and Alois Knoll

Fakult

at f

ur Informatik, Technische Universit

at M

unchen, Boltzmannstrasse 3, 85748 Garching bei M

unchen, Germany

Keywords:

Edge-based background subtraction, Hierarchical likelihood grids, Oriented distance transform, Data associa-

tion, Multi-view and Multi-target tracking.

Abstract:

In this paper, we present a grid-based tracking by detection methodology, applied to 3D people tracking for

multi-camera video surveillance. In particular, frame-by-frame detection is performed by means of hierar-

chical likelihood grids, using edge matching through the oriented distance transform on each camera view

and a simple person model, followed by likelihood grids clustering in state-space. Subsequently, the tracking

module performs a global nearest neighbor data association, in order to initiate, maintain and terminate tracks

automatically. The proposed system can easily include additional features, such as color or background sub-

traction, it can be scaled to more camera views, and it can be used to track other items as well. We demonstrate

it through experiments in indoor sequences, using a calibrated multi-camera setup.

1 INTRODUCTION

Nowadays automatic visual surveillance is becoming

increasingly popular, because of its wide applications

in indoor and outdoor environments with security re-

quirements. Usually there are two major problems

in an automatic surveillance system: one is to detect

moving targets, and the other is to keep them tracked

throughout the sequence. As the most representa-

tive application, detecting and tracking people is ob-

viously the most challenging and attractive topic, due

to people’s huge variations in physical appearance,

pose, movement and interaction. Therefore, people

detection and tracking receives a signiﬁcant amount

of attention in the area of research and development.

Although some systems have been successfully

developed towards this challenging task, it still re-

mains difﬁcult to detect and track multiple people pre-

cisely and automatically, only using generic models

in a cluttered scene. This paper addresses the prob-

lem of employing a grid-based tracking-by-detection

methodology, with a very simple shape model. The

primary goal of our paper is to develop a fully auto-

matic system for tracking multiple people in an over-

lapping, multi-camera environment, providing a 3D

output robust to mutual occlusion between interacting

people.

As a commonly used technique for segmenting

out objects of interest, background subtraction has

achieved a signiﬁcant success in ﬁxed camera scenar-

ios. Most of the methods work by comparing color

or intensities of pixels in the incoming video frame

to the reference image (Stauffer and Grimson, 2000;

Wren et al., 1997; Eng et al., 2004). However, it

has the drawback of being susceptible to illumination

changes, and provides a less precise localization. In

contrast, we propose here an edge-based background

subtraction, which employs the Canny edge map to-

gether with Sobel gradients, because edges are more

precisely and stably localized, to a better extent in

presence of illumination changes, so that the model

has not to be adapted so often.

A second contribution of our system is frame-by-

frame detection by means of hierarchical likelihood

grids. This scheme, adapted from (Stenger et al.,

2006), takes the advantage of multi-resolution grids

that can, precisely and efﬁciently locate targets in

cluttered scenes, without prior knowledge of their po-

sition. In particular, we compute the likelihood by

edge matching through the oriented distance trans-

form, which matches not only the location of edge

points but also their orientation. And the likelihood

is ﬁrst computed on a coarse grid, then reﬁned on

the next level only the locations where likelihoods

are higher than a given threshold. Subsequently, we

perform state-space clustering on the high-resolution

grid, in order to ﬁnd the relevant peaks, possibly as-

sociated to people.

474

Chen L., Panin G. and Knoll A..

MULTI-CAMERA PEOPLE TRACKING WITH HIERARCHICAL LIKELIHOOD GRIDS.

DOI: 10.5220/0003316904740483

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2011), pages 474-483

ISBN: 978-989-8425-47-8

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

The third main issue consists in associating de-

tected peaks to tracks, which is a classical data as-

sociation problem, where a track can be updated by at

most one measurement, and a measurement can be

assigned to at most one track. Several approaches

have been developed for this purpose, the most rep-

resentative ones being (Fortmann et al., 1983; Reid,

1979); however, in place of complex methods, which

require more complex models and parameter tuning,

and further increase the computational complexity,

our tracking module employs a Global Nearest Neigh-

bor (GNN) approach in order to initiate, maintain and

terminate tracks automatically.

The remainder of the paper is organized as fol-

lows: Section 2 reviews the state of the art and related

work to our paper. Section 3 describes the general

system overview with hardware setup and algorith-

mic ﬂow of software. In Section 4, we provide the

detailed detection procedure, including models, edge-

based background subtraction, hierarchical grid eval-

uation as well as model-based contour matching and

state-space clustering. Tracking by data association is

presented in Section 5. The experimental results are

discussed in Section 6. Finally, Section 7 summarizes

the paper and proposes future development roads.

2 RELATED WORK

A vast amount of literature has been published on

people detection and tracking. We can mainly clas-

sify it into four categories: region-based approaches,

which are based on the variation of image regions

in motion (Khan et al., 2001); feature-based (Wren

et al., 1997; Fieguth and Terzopoulos, 1997; Li et al.,

2003), that usually utilize information about color,

texture, etc.; contour-based (Isard and Blake, 1996;

Nguyen et al., 2002; Roh et al., 2007), that make use

of the bounding contours to represent the target out-

line; and model-based methods (Gavrila and Davis,

1996; Andriluka et al., 2010) that explicitly require

a 2D or 3D model of a person for tracking. How-

ever, a too detailed review of all the approaches is

beyond the scope of our paper, therefore, in the fol-

lowing we will focus on people tracking-by-detection

methodologies, more related to our work. There has

been a number of literature on this approach (Okuma

et al., 2004; Leibe et al., 2007; Wu and Nevatia,

2007), where detection of people in individual frame,

as well as data association between detections in dif-

ferent frames, are the most challenging and ambigu-

ous issues(Andriluka et al., 2008).

Template-based methods have yielded nice results

for locating targets with no prior knowledge in a clut-

tered scene. In (Gavrila, 2000), the efﬁciency of this

method is illustrated, by using about 4,500 templates

to match pedestrians in images. The core is the idea

of using a Chamfer distance measure, so that match-

ing a template with the DT image results in a similar-

ity measure, that is a smooth function of the template

transformation parameters. Meanwhile this approach

enables the use of an efﬁcient search algorithm that

locks onto the correct solution. However, if only com-

puting the location of edge pixels without considering

their orientation when computing distance transform,

it inevitably leads to a high rate of false alarms in pres-

ence of clutter.

Another highlight of this system is the utiliza-

tion of a template hierarchy, which is generated au-

tomatically from available examples, and formed by

a bottom-up approach, using a partitioned clustering

algorithm. It only searches locations where the dis-

tance measure is under a given threshold, so that a

speed-up of three orders of magnitude, compared to

exhaustive searching, is demonstrated.

This idea was taken further by (Stenger et al.,

2006), that however does not build the template hi-

erarchy (or tree) by bottom-up clustering, rather by

partitioning a state-space represented with an inte-

gral grid. The grid is hierarchically partitioned as the

search descends into each region, so that regions at

the leaf-level deﬁne the ﬁnest partition. This method

is demonstrated to be capable of covering 3D mo-

tion, even with self-occlusion. Unfortunately, both

approaches need a very speciﬁc model, only valid for

a speciﬁc target.

Once the measurements have been obtained from

the frame-by-frame detection, data association can

be applied to solve the problem of measurement

to track assignment. A simple nearest-neighbor

approach(Bar-Shalom and Fortmann, 1988) uses only

the closest observation to any predicted state in order

to perform the measurement update, and it is com-

monly used for MTT systems because of its fast com-

putation. More complex approaches, such as Joint

Probabilistic Data Association Filter (JPDA) (Fort-

mann et al., 1983) and Multiple Hypothesis Tracking

(MHT) (Reid, 1979) solve this problem by maintain-

ing multiple hypotheses, until enough measurements

can be collected to resolve the ambiguity. In partic-

ular, (Fortmann et al., 1983) combines all of the po-

tential measurements into one weighted average, be-

fore associating it to the track, in a single update. By

contrast, (Reid, 1979) calculates every possible up-

date hypothesis, with a track, formed by previous hy-

potheses associated to the target. Both methods are

known to be quite complex, and require a careful im-

plementation in terms of parameters; in particular, the

MULTI-CAMERA PEOPLE TRACKING WITH HIERARCHICAL LIKELIHOOD GRIDS

475

latter cannot avoid the drawback of an exponentially

growing computational complexity, with the number

of targets and measurements involved in the resolu-

tion situation, so that sub-optimal solutions must be

sought (Cox and Hingorani, 1996).

3 SYSTEM OVERVIEW

In this section, we describe the hardware setup and

present an overview of our tracking system, which

will be discussed in more detail afterwards.

World Coordinate

Camera Coordinate

4 Cameras on Ceiling

Scene

Figure 1: Hardware setup.

The overall setup is depicted in Figure 1. Four

uEye usb cameras, with a resolution of 752 ×480, are

mounted overhead on the corners of the ceiling, each

of them observing the same 3D scene synchronously

from different viewpoints. Furthermore, all the four

cameras are connected to one multi-core PC. A nec-

essary step before being able to get accurate 3D in-

formation, is calibration of the intrinsic and extrinsic

camera parameters, that we perform with the Matlab

Calibration Toolbox

, with respect to a world coordi-

nates system placed on the ﬂoor.

The detection and tracking software is designed

and implemented in the OpenTL framework

(Panin

et al., 2008; Panin, 2011), which is a structured,

general purpose architecture for model-based visual

tracking. We provide the block diagram of our track-

ing system in Figure 2, that consists of two main pro-

cessing modules. Ofﬂine, we use a certain number

of background frames to learn the background model.

Moreover, grid states are sampled for each level, and

the silhouettes are generated by projecting the exter-

nal contours of the cylinder shape and keeping, for

each contour and each camera view, a list of pixel

positions and normals. Online, we have three main

sub-modules: pre-process, detection and tracking.

http://www.vision.caltech.edu/bouguetj/calib doc/

http://www.opentl.org

Background

learning

Generate

silhouettes

Likelihood

computing

Clustering in

state space

Data

association

OFFLINE

TRACKING

ONLINE

Tracked

Result

Background

subtraction

Compute

OrientedDT

PRE_PROCESS

Background

frames

Sample

grid states

DETECTION

Track

maintenance

Input video

Figure 2: Block diagram of the tracking system.

In the pre-process part, for each camera view fore-

ground contours are segmented by edge-based back-

ground subtraction, using the learned model. After-

wards, we compute an oriented distance transform

onto this image, in order to match, for each tem-

plate, both the location and the orientation of its con-

tours. In particular, the oriented DT is efﬁciently com-

puted over a ﬁnite set of orientations, so that the im-

age is sampled over parallel scan lines that are pre-

computed. The advantage of using both edge posi-

tion and orientation, during background subtraction

as well as template matching, is a strong reduction of

false alarms, i.e. false edge matching, that would arise

when using only positional information.

Detection part ﬁrst computes the likelihoods by

matching projected templates and oriented DT for

each camera view, where the likelihoods are com-

puted on the coarse grid ﬁrstly, then reﬁned on the

next resolution only the locations where the likeli-

hood is higher than a given threshold, the joint likeli-

hoods can simply be multiplied then. The object-level

measurements, or target hypotheses, are obtained by

means of likelihood grid clustering, that is performed

by Gaussian ﬁltering of the high-resolution grid, and

local maxima detection. Finally, the tracking module

performs measurement-to-target association with the

Global Nearest Neighbor approach, in order to initi-

ate, maintain and terminate tracks automatically.

4 PEOPLE DETECTION

In this section, we provide more details about peo-

ple detection, that serves as one of our key building

blocks for our system.

4.1 Construction of Template Hierarchy

The idea to construct a template hierarchy is inspired

by the paper (Stenger et al., 2006), as well as by

the system developed by (Gavrila, 2000), extended

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

476

to multiple views, multiple targets, and with a more

general template.

Assuming there are L levels of search, the state

space is partitioned with a coarse-to-ﬁne strategy. A

graphical illustration is shown in Figure 3.

Figure 3: Grid based state space with hierarchical partition.

Each discrete region



i,l



i=1

, where N

is the

number of cells at level l, is sampled at its center, be-

fore the template hierarchy is generated. Meanwhile,

we connect regions at a child level with its parent cell,

by computing the nearest-neighbor in state-space, as

well as its nearest neighbors within the same level, as

it will be described in Section 4.4, in order to smooth

the grid likelihoods.

After sampling the grid, templates are generated

by rendering the 3D model at each state, under the re-

spective camera projection. The model chosen in our

approach is a simple cylinder, undergoing (x,y) trans-

lation on the ﬂoor, while its silhouette is generated by

projecting the external contour. An example is shown

in Figure 4, while a partial view of the hierarchy of

silhouettes is depicted in Figure 5.

(a) (b) (c)

Figure 4: Our model. (a) Discretized cylinder. (b) Projected

external contour. (c) Silhouette with normals.

For each silhouette, the position of each point as

well as its normal are collected, as it will be described

further in Section 4.3. As already emphasized, both

grid sampling and template hierarchy generation are

performed ofﬂine.

Figure 5: Hierarchy with silhouette of cylinder.

4.2 Background Learning and

Foreground Segmentation

In order to match the image data with the templates,

we ﬁrst apply an edge-based background subtraction.

This approach can be divided into two phases:

background learning (ofﬂine) and foreground seg-

mentation (online). In the ﬁrst phase, we utilize a cer-

tain number N of frames without people, in order to

learn the background model. Let Θ

(t), G

(t)

respectively be the Canny edge map, and Sobel x-

gradient and y-gradient images, detected at frame

(t). The Canny map Θ

is accumulated by bi-

nary OR, from frame Θ

(I)

(1), . . . , Θ

(I)

(N), while So-

bel gradients are accumulated in a running average

over the same frames. At the end, we normalize the

accumulated Sobel image

+ G

= 1, ∀(x, y) (1)

Subsequently, standard distance transform is applied

to the accumulated background Canny map, and

thresholded to a few pixels, providing a binary mask

∈

{

0, 1

}

, where potential background edges are

found.

Online, from the foreground Canny map and So-

bel gradients Θ

(t), G

f x

(t), G

f y

(t) of camera frame

(t), we test the position and orientation of each edge

pixel: edges Θ

(t) 6= 0 that lie near to a background

edge Θ

6= 0 are candidate for removal.

Then, we further test these edges for orientation

with the Sobel masks, and if the scalar product is

higher than another threshold θ

f x

+ G

f y

f x

+ G

f y

> θ (2)

the point is removed from Θ

(t). Figure 6 shows an

example of this procedure: as we can see, the result-

MULTI-CAMERA PEOPLE TRACKING WITH HIERARCHICAL LIKELIHOOD GRIDS

477

(a) (b) (c) (d)

Figure 6: Edge-based background subtraction. (a) Original frame. (b) Learned background model. (c) Unsegmented fore-

ground edge. (d) Segmented foreground edge.

ing edge map robustly preserves the person contours,

while discarding most of the background edges.

4.3 Matching based on the Oriented

Distance Transform

The next step is to match foreground edges with the

model silhouettes. One possibility would be to use

the Chamfer distance transform on the edge map,

that is tolerant to small shape variations, and has al-

ready been applied in several works, such as (Borge-

fors, 1988; Gavrila, 2000). However, in case of im-

ages with considerable clutter, a signiﬁcant rate of

false alarms would be present. This problem can be

reduced by matching not only the location of edge

points, but also their orientation (Olsen and Hutten-

locher, 1997).

Therefore, we propose here another approach, us-

ing the oriented distance transform. We deﬁne the

oriented DT by scanning the edge image along par-

allel lines L

(a) through pixel a = (x, y) for a given

orientation γ, and repeat it for a ﬁnite set of N

di-

rections Γ =

{

}

i=1

. The algorithm is illustrated in

Figure 7.

Figure 7: Scanning single line for one direction. From left

to right: Multiple single line scanning; Distance value to the

nearest edge point on the line; Multiple scanning directions.

In particular, for each direction and each scan

line, the oriented DT is a mono-dimensional function,

looking for the nearest edge point b on either direction

b = DT

(a) = min

b∈L

(a)

a − b

(3)

An example of oriented distance transform is shown

in Figure 8.

Once oriented DTs are computed, template match-

ing simply amounts to compute the likelihood, by

summing up all values over the silhouette pixels, in

the corresponding direction of the normal. To formal-

ize the idea, a projected template s is represented by

a set of pixel positions and normals

{

, y

, g

}

i=1

, ob-

tained by re-projection through a 3 × 4 camera pro-

jection matrix P, where g

selects the nearest γ ∈ Γ,

from which the DT value will be taken. Therefore,

the likelihood for state hypothesis s is given by:

P(z|s) = exp

−

2NR

∑

i=1

min



γ(g

)

, y

)

, D

max



(4)

where γ(g

) denotes the closest available direction to

the normal, and the sum is performed over all val-

ues

{

, y

, g

}

i=1

. R is the measurement standard de-

viation, and an outlier threshold is usually ﬁxed at

max

= 3R, which is our validation gate for a more

robust matching. Also notice that, in order to avoid

problems with different scales, the sum is further nor-

malized by N.

During the computation of likelihood, a coarse-to-

ﬁne search strategy is applied by evaluating it, at each

level, only for locations where the parent cell likeli-

hood is higher than a given threshold, which is usu-

ally obtained as the average likelihood (Stenger et al.,

2006). For those cells where the parent likelihood is

under the threshold, its value is simply inherited, thus

saving a large amount of computation.

4.4 Likelihood Grid Clustering

In order to obtain the object-level measurements, or

target hypotheses, after likelihood computation we

employ a clustering procedure on the high-resolution

grid, where each cluster is a local maximum, poten-

tially corresponding to a person.

This approach is similar to mean-shift, but explic-

itly done on discrete states. First of all, a Gaussian

ﬁltering is applied to the grid, where the isotropic

Gaussian corresponds to the ﬁltering kernel. For each

cell s

within the grid, we take the nearest neighbor

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

478

(a) (b)

(c)

Figure 8: Results of oriented distance transform. (a) Input

image. (b) Foreground edge map. (c) Oriented DT (at 12

discrete orientations).

by looking at the connected states with distance

i, j



− s



up to a validation gate D

max

= 3σ

where σ

is the measurement covariance in state-

space, these neighbors are pre-computed in the off-

line phase. For each neighbor, the Gaussian weight is

also pre-computed by

i, j

= exp(−

i, j

2σ

) (5)

the computed weights are also normalized to 1, so that

the smoothed likelihood for state cell s

is given by

P(z|s)

weighted

(i) =

∑

i, j

· P(z|s)( j) (6)

Subsequently, local maxima are detected (within

the same neighborhood), to obtain the target hypothe-

ses, or measurements. The ﬁnal step will be to asso-

ciate these hypotheses to tracks, as it will be described

in next section.

5 MULTIPLE PEOPLE

TRACKING

In this section we deal with the problem of multi-

target tracking, by associating measurements ob-

tained from our detector to individual tracks, also per-

forming automatic track initiation and termination.

In particular, our track management follows a

strategy indicated in (Bar-Shalom and Li, 1995):

• Track Initiation. In case of new targets entering

into the scene, they will generate measurements

that are too far from the existing targets, and there-

fore can be used to start new tracks. In this case,

they are labeled with a unique ID, and a counter

for the number of consecutive, successful detec-

tions for this target is also initialized to 1.

• Track Maintenance. During tracking, a target is

successfully detected whenever the data associ-

ation algorithm provides one valid measurement

for it, so its counter is increased up to a maxi-

mum value (which can be taken as a conﬁrma-

tion time), while in case of misdetection it will

be decreased. Those targets which are success-

fully detected over the conﬁrmation time, can be

considered as stable targets and maintained by the

algorithm. In this way, if a target is misdetected

for a few frames in case of occlusion, it can still

be recovered until the counter goes to 0.

• Track Termination. When a target exits the scene,

or after occlusion for a too long time, its misdetec-

tion counter goes to 0, and its track is terminated.

A pseudo-code of the whole procedure is shown

in Algorithm 1, where the GNN algorithm is called in

(line 25).

The data association problem consists in decid-

ing which measurement should correspond to which

track. Although our detection algorithm is fairly ro-

bust, it is also not person-speciﬁc, and therefore in a

small indoor environment there are always ambigui-

ties, arising from neighboring targets, as well as from

missing detections and false alarms caused by back-

ground clutter. To this respect we employ the Global

Nearest Neighbor (GNN) approach, that gives a good

solution for this problem (Konstantinova et al., 2003),

while requiring relative low computational cost.

The ﬁrst step of the GNN is to set up a distance

(or cost) matrix: assuming that, at time t, there are M

existing tracks and N measurements, the cost matrix

is given by

D =







··· d







(7)

where d

i j

is the Euclidean distance between track

i and measurement j, and i = 1, 2, . . . , M; j =

1, 2, . . . , N. In particular, d

i j

is set to ∞ if it exceeds

the validation gate, which is a circle with ﬁxed radius

around the predicted position, eliminating unlikely

observation-to-track pairs. Moreover, it is commonly

required that a target can be associated with at most

one measurement (none, in case of misdetection), and

MULTI-CAMERA PEOPLE TRACKING WITH HIERARCHICAL LIKELIHOOD GRIDS

479

a measurement can be associated to at most one target

(none, in case of false alarms).

The GNN solution to this problem is the one that

maximizes the number of valid assignments, while

minimizing the sum of distances of the assigned pairs.

To this aim, we adopt the extended Munkres’ algo-

rithm (Burgeois and Lasalle, 1971), where the input is

the cost matrix D, and output are the indices (row, col)

of assigned track-measurement pairs.

Algorithm 1: Track management with GNN.

1: if nMeasurements = 0 then

2: for i = 0 to nTargets do

3: DecreaseCounter(target[i]);

4: if Counter(target[i]) > 0 then

5: MaintainTarget(target[i]);

6: else

7: TerminateTarget(target[i]);

8: end if

9: end for

10: else

11: if nTargets = 0 then

12: for j = 0 to nMeasurements do

13: newTarget = CreateTarget(meas[ j]);

14: ResetCounter(newTarget);

15: end for

16: else

17: for i = 0 to nTargets do

18: for j = 0 to nMeasurements do

19: D(i, j) = Distance(target[i], meas[ j]);

20: if D(i, j) > ValidGate then

21: D(i, j) = ∞;

22: end if

23: end for

24: end for

25: (i ↔ j) = GNN(D);

26: for i = 0 to nAssocTargets do

27: if D(i, j(i)) ≤ ValidGate then

28: MoveTarget(target[i], meas[ j]);

29: IncreaseCounter(target[i]);

30: if Counter(target[i]) > MaxC then

31: Counter(target[i]) = MaxC;

32: end if

33: else

34: DecreaseCounter(target[i]);

35: if Counter(target[i]) = 0 then

36: TerminateTarget(target[i]);

37: end if

38: end if

39: end for

40: for j = 0 to nUnassocMeas do

41: newTarget = CreateTarget(meas[ j]);

42: ResetCounter(newTarget);

43: end for

44: end if

45: end if

6 EXPERIMENTAL RESULTS

We evaluated the proposed algorithms through pre-

recorded video sequences, with multiple people enter-

ing and leaving the scene, as well as interacting with

each other. The sequences have been simultaneously

recorded from four cameras, as described in Section

3, with a resolution of (752 × 480), and a frame rate

of 25 fps.

Before carrying out detection and tracking, state

grids are set up at all levels, respectively 10 × 10,

20 × 20 and 40 × 40 from the coarsest to the ﬁnest,

resulting in a total of 2100 grid cells, and the same

amount of silhouette templates are sampled off-line.

Since the area of interest is (6m × 4.2m), the corre-

sponding grid on the ﬁnest level has a resolution of

(150mm × 105mm).

Our current implementation of the oriented dis-

tance transform uses 12 discrete orientations, rang-

ing from 0 to π. As it computes each orientation

separately, they overall require about 0.25 sec/frame

for four images, whereas a single, standard distance

transform is computed in 0.1 sec/frame. Therefore,

the speed of our oriented distance transform is ac-

ceptable and reasonable in comparison with standard

distance transform. And the subsequent matching is

done very quickly for each hypothesis.

Figure 9 shows qualitative tracking results in a

multi-camera environment, with a complex back-

ground. In particular, the top row shows foreground

edges after edge-based background subtraction. Here

30 frames have been used for background learning,

where the threshold θ mentioned in Eq. (2) is set to

0.9. The middle row shows likelihood values onto

the ﬁnest grid, and the bottom row shows the corre-

sponding tracking results after data association, with

the projected cylinder silhouettes.

During data association, we keep a conﬁrmation

time of 10 frames (which is the maximum value for

the consecutive detections counter) for keeping or re-

moving tracks. As can be seen from the results, there

are situations with signiﬁcant occlusion from one or

more views. For instance, at frame 345, each two tar-

gets are occluded from some views, however, since

for the same pairs there are no occlusions from an-

other camera view, all targets are successfully de-

tected, thanks to the robustness of multi-camera fu-

sion and oriented DT matching. The system also

successfully handles targets entering and leaving the

scene.

In order to better evaluate the performance of our

system, we manually label the ground truth data for

our sequences, and compare the results of our tracker,

both in terms of position accuracy and robustness of

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

480

(a) Frame 185

(b) Frame 345

Figure 9: Performance of 3D people tracking. Shown are edge-based background subtraction, likelihood grids, and the

corresponding tracking results, on four camera views.

detection. Ground truth trajectories, labeled on the

ﬁnest grid, are depicted in Figure 10, where we can

see the challenges due to targets that keep close most

of the time, with mutual interactions and position ex-

changes.

Figure 11 shows the (X,Y ) position errors of our

tracking system. Because of the above mentioned

occlusions and dynamics, for each target the system

temporarily loses track, and recovers it again shortly

afterwards. That happens about 4-5 times per target

over the 550 frames of sequence, leading to several

sub-tracks with different IDs, as shown in Figure 11

by the green boxes.

Overall these results indicate that, despite the clut-

tered situation, position errors are considerably low

for all people, being most of the time under 100-

150mm, that corresponds to one cell of the high-

resolution grid. This is because of the local edge-

based matching which, despite the simplicity of the

model, is more precise with respect to global statis-

tics such as color histograms (Stillman et al., 1998),

or histograms of oriented gradients (Dalal and Triggs,

2005).

The execution time of the whole tracking proce-

dure is currently 2 FPS, on a desktop PC with In-

tel Core 2 Duo CPU (1.86 GHz), 1GB RAM and an

Nvidia GeForce 8600 GT graphic card.

7 CONCLUSIONS

In this paper, we presented a novel system for multi-

MULTI-CAMERA PEOPLE TRACKING WITH HIERARCHICAL LIKELIHOOD GRIDS

481

−2000 −1000 0 1000 2000

−2000

−1500

−1000

−500

500

1000

1500

2000

Target1

−2000 −1000 0 1000 2000

−2000

−1500

−1000

−500

500

1000

1500

2000

Target2

−2000 −1000 0 1000 2000

−2000

−1500

−1000

−500

500

1000

1500

2000

Target3

Figure 10: Ground-truth trajectories, sampled on the discretized grid (high-resolution).

100 200 300 400 500

−500

−400

−300

−200

−100

100

200

300

400

500

Frame

Position error on X/Y translation[mm]

100 200 300 400 500

−500

−400

−300

−200

−100

100

200

300

400

500

Frame

Position error on X/Y translation[mm]

100 200 300 400 500

−500

−400

−300

−200

−100

100

200

300

400

500

Frame

Position error on X/Y translation[mm]

Figure 11: Position error on X(red) and Y(blue) in world coordinates. From left to right are shown target 1, 2 and 3. The

green boxes correspond to sub-tracks estimated by our system.

ple people tracking in a multi-camera environment,

using a grid-based tracking by detection methodol-

ogy. A template hierarchy is constructed off-line, by

partitioning the state space. And frame-by-frame de-

tection is performed by means of hierarchical like-

lihood grids and clustered on the ﬁnest level, fol-

lowed by data association through the GNN approach.

Moreover, edge-based background subtraction has

been proposed for foreground segmentation, which

is quite robust to illumination changes, together with

an oriented distance transform, matching the silhou-

ette templates by taking gradient orientations into ac-

count, thus signiﬁcantly reducing the rate of false

alarms. Our system initiates, maintains and termi-

nates tracks in a fully automatic way. Experimental

results over the video sequences also show that our

proposed system deals fairly well with mutual occlu-

sions.

As a future work, this system can be easily ex-

tended to include additional features, such as color or

motion, also can be scaled to more camera views, as

well as being used for tracking different objects, for

example 3D indoor tracking of ﬂying quadrotors. In

addition, the individual components can still be fur-

ther optimized, both with respect to speed and per-

formance, graphics hardware is possibly need to be

exploited. Moreover, we plan to address the issue

of heavy occlusions between people, taking place for

longer periods. Re-identiﬁcation after occlusions are

going to be done by using more speciﬁc features, such

as color or texture.

Besides these straightforward improvements, we

also plan to test and extend our system to more chal-

lenging scenarios, such as outdoor tracking with mul-

tiple models (such as people and cars), as well as peo-

ple tracking on mobile robots, with a non-static back-

ground and viewpoint.

REFERENCES

Andriluka, M., Roth, S., and Schiele, B. (2008).

People-tracking-by-detection and people-detection-

by-tracking. In IEEE Conference on Computer Vision

and Pattern Recognition.

Andriluka, M., Roth, S., and Schiele, B. (2010). Monocular

3d pose estimation and tracking by detection. In IEEE

conference on Computer Vision and Pattern Recogni-

tion(CVPR).

Bar-Shalom, Y. and Fortmann, T. E. (1988). Tracking and

data association. Academic Press, San Diego.

Bar-Shalom, Y. and Li, X. (1995). Multitarget-Multisensor

Tracking: Principles and Techniques. YBS Publish-

ing.

Borgefors, G. (1988). Hierarchical chamfer match-

ing: A parametric edge matching algorithm. IEEE

Trans. Pattern Analysis and Machine Intelligence,

10(6):849–865.

Burgeois, F. and Lasalle, J. C. (1971). An extension of

the munkres algorithm for the assignment problem to

rectangular matrices. Communications of the ACM,

14:802–806.

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

482

Cox, I. J. and Hingorani, S. L. (1996). An efﬁcient im-

plementation of reid’s multiple hypothesis tracking al-

gorithm and its evaluation for the purpose of visual

tracking. IEEE Trans. Pattern Anal. Mach. Intell.,

18(2):138–150.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection. In Proceedings of the

2005 IEEE Computer Society Conference on Com-

puter Vision and Pattern Recognition, volume 1, pages

886–893, Washington, DC, USA. IEEE Computer So-

ciety.

Eng, H., Wang, J., Kam, A., and Yau, W. (2004). A bayesian

framework for robust human detection and occlusion

handling using a human shape model. In International

Conference on Pattern Recognition, volume 2, pages

257–260.

Fieguth, P. and Terzopoulos, D. (1997). Color-based track-

ing of heads and other mobile objects at video frame

rates. In Proceedings IEEE Conf. on Computer Vi-

sion and Pattern Recognition, pages 21–27, San Juan,

Puerto Rico.

Fortmann, T. E., Bar-Shalom, Y., and Scheffe, M. (1983).

Sonar tracking of multiple targets using joint prob-

abilistic data association. IEEE Journal of Oceanic

Engineering, 8(3):173–184.

Gavrila, D. M. (2000). Pedestrian detection from a moving

vehicle. In Proc. of European Conference on Com-

puter Vision, pages 37–49, Dublin, Ireland.

Gavrila, D. M. and Davis, L. S. (1996). 3-d model-based

tracking of humans in action: a multi-view approach.

In Proc. IEEE Computer Vision and Pattern Recogni-

tion, pages 73–80, San Francisco.

Isard, M. and Blake, A. (1996). Contour tracking by

stochastic propagation of conditional density. In Pro-

ceedings of the European Conference on Computer Vi-

sion, pages 343–356, Cambridge, UK.

Khan, S., Javed, O., Rasheed, Z., and Shah, M. (2001). Hu-

man tracking in multiple cameras. In Proceedings of

the 8th IEEE International Conference on Computer

Vision, pages 331–336, Vancouver, Canada.

Konstantinova, P., Udvarev, A., and Semerdjiev, T. (2003).

A study of a target tracking algorithm using global

nearest neighbor approach. In Proceeding of Inter-

national Conference on Computer Systems and Tech-

nologies.

Leibe, B., Schindler, K., and Gool, L. V. (2007). Coupled

detection and trajectory estimation for multi-object

tracking. In International Conference on Computer

Vision.

Li, L., Huang, W., Gu, I. Y. H., and Tian, Q. (2003). Fore-

ground object detection from videos containing com-

plex background. In Proceedings of the 11th ACM

International Conference on Multimedia, pages 2–10.

Nguyen, H. T., Worring, M., van den Boomgaard, R., and

Smeulders, A. W. M. (2002). Tracking nonparame-

terized object contours in video. IEEE Trans. Image

Process, 11(9):1081–1091.

Okuma, K., Taleghani, A., Freitas, N. D., Little, J., and

Lowe, D. (2004). A boosted particle ﬁlter: Multitar-

get detection and tracking. In European Conference

on Computer Vision.

Olsen, C. F. and Huttenlocher, D. P. (1997). Automatic

target recognition by matching oriented edge pixels.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 6:103–113.

Panin, G. (2011). Model-based visual tracking: the OpenTL

framework. Wiley-Blackwell. (to appear).

Panin, G., Lenz, C., Nair, S., Roth, E., Wojtczyk, M.,

Friedlhuber, T., and Knoll, A. (2008). A unifying

software architecture for model-based visual tracking.

In IS&T/SPIE 20th Annual Symposium of Electronic

Imaging, San Jose, CA.

Reid, D. B. (1979). An algorithm for tracking multi-

ple targets. IEEE Transaction on Automatic Control,

24(6):843–854.

Roh, M. C., Kim, T. Y., Park, J., and Lee, S. W. (2007).

Accurate object contour tracking based on boundary

edge selection. Pattern Recognition, 40(3):931–943.

Stauffer, C. and Grimson, W. E. L. (2000). Learning pat-

terns of activity using real-time tracking. IEEE Trans-

action on Pattern Analysis and Machine Intelligence,

22(8):747–757.

Stenger, B., Thayananthan, A., Torr, P. H. S., and Cipolla,

R. (2006). Model-based hand tracking using a hierar-

chical bayesian ﬁlter. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 28:1372–1384.

Stillman, S., Tanawongsuwan, R., and Essa, I. (1998). A

system for tracking and recognizing multiple people

with multiple cameras. In In Proceedings of Second

International Conference on Audio-Visionbased Per-

son Authentication, pages 96–101.

Wren, C., Azarbayejani, A., Darrel, T., and Pentland, A.

(1997). Pﬁnder, real time tracking of the human body.

IEEE Transaction on Pattern Analysis and Machine

Intelligence, 19(7):780–785.

Wu, B. and Nevatia, R. (2007). Detection and tracking of

multiple, partially occluded humans by bayesian com-

bination of edgelet based part detectors. International

Journal of Computer Vision, 75(2):247–266.

MULTI-CAMERA PEOPLE TRACKING WITH HIERARCHICAL LIKELIHOOD GRIDS

483