MULTI-MODAL PERSON DETECTION AND TRACKING FROM

MOBILE ROBOT IN A CROWDED ENVIRONMENT

A. A. Mekonnen

†‡

, F. Lerasle

†,‡

and I. Zuriarrain

†

CNRS, LAAS, 7 Avenue du Colonel Roche, 31077 Toulouse Cedex 4, France

‡

Universit´e de Toulouse, UPS, INSA, INP, ISAE, LAAS, F-31077 Toulouse, France

University of Mondragon, Goi Eskola Politeknikoa, Mondragon, Spain

Keywords:

Multi-person tracking, Multi-modal data fusion, MCMC particle ﬁltering, Interactive robotics.

Abstract:

This paper addresses multi-modal person detection and tracking using a 2D SICK Laser Range Finder and a

visual camera from a mobile robot in a crowded and cluttered environment. A sequential approach in which

the laser data is segmented to ﬁlter human leg like structures to generate person hypothesis which are further

reﬁned by a state of the art parts based visual person detector for ﬁnal detection, is proposed. Based on

this detection routine, a Monte Carlo Markov Chain (MCMC) particle ﬁltering strategy is utilized to track

multiple persons around the robot. Integration of the implemented multi-modal person detector and tracker

in our robotic platform and associated experiments are presented. Results obtained from all tests carried out

have been clearly reported proving the multi-modal approach outperforms its single sensor counterparts taking

detection, subsequent use, computation time, and precision into account. The work presented here will be used

to deﬁne navigational control laws for passer-by avoidance during a service robot’s person following activity.

1 INTRODUCTION

Currently, there is more demand to use robots in ev-

eryday life, a demand for their introduction into hu-

man all day environments. For this task, robots should

be able to interact with humans at a higher level with

more natural and effective interaction. One such inter-

action, the ability of a mobile robot to automatically

follow a person in public areas, is a key issue to effec-

tively interact with the surrounding world. Recently

various researchers have reported successful person

following activities from a mobile robot (Germa et al.,

2009), (Calisi et al., 2007), (Chen and Birchﬁeld,

2007). A key point in person following task is safe

interaction as the workspace at any moment is shared

by humans and the robot. The robot should be capa-

ble of avoiding all passers-by in the environment in

a socially acceptable manner while carrying out the

activity. Some authors addressed this as static ob-

stacle avoidance considering people as static obsta-

cles e.g.(Calisi et al., 2007). We argue otherwise,

an effective collision avoidance not only has to cir-

cumvent static objects in the environment, but it also

has to take the dynamics of the persons in the sur-

rounding into account. This entails for perception of

the whereabouts and dynamics of humans sharing the

workspace. To the best of our knowledge, an assis-

Figure 1: A block diagram of our complete envisaged sys-

em: Person Following with passer-by avoidance in a so-

cially acceptable manner.

tant robot capable of following a given person taking

the dynamics of the passers-by into consideration and

avoiding them in a socially acceptable way does not

yet exist (Fong et al., 2003).

A block diagram of our complete envisaged sys-

tem is shown in ﬁgure 1. The block diagram rep-

resents a person following activity with passer-by

avoidance in a socially acceptable manner (keeping a

social distance from surrounding persons while at the

same time taking their dynamics into consideration)

511

A. Mekonnen A., Lerasle F. and Zuriarrain I..

MULTI-MODAL PERSON DETECTION AND TRACKING FROM A MOBILE ROBOT IN A CROWDED ENVIRONMENT.

DOI: 10.5220/0003367705110520

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2011), pages 511-520

ISBN: 978-989-8425-47-8

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

by a service robot. The person following activity, the

area labeled ‘A’ in ﬁgure 1, has been successfully ad-

dressed in (Germa et al., 2009). The work presented

here addresses detection and tracking of multiple peo-

ple around the robot, the shaded areas: block ‘B’ and

‘C’ in ﬁgure 1, while the control law for passer-by

avoidance will be presented in future works. In a nut-

shell, the objective of the work presented in this pa-

per is detection and tracking of people in the robot

vicinity maintaining a correct trajectory of all tracked

people. It is aimed for deﬁning control laws for so-

cially acceptable passer-by avoidance, during person

following activity, in a crowded public environment.

Automated person detection/tracking ﬁnds its ap-

plications in many areas including robotics, video-

surveillance, pedestrian protection systems, auto-

mated image and video indexing. Contrary to video-

surveillance applications where conventional back-

ground subtraction can be used, person detection is

more challenging in mobile robotics due to sensor

limitations, short ﬁelds of view and motion of embed-

ded sensors, and computationalrequirements for reac-

tive response acceptable by humans. All these chal-

lenges make successful person detections and track-

ing based on a single sensor very difﬁcult. Sev-

eral works on person detection in the robotic com-

munity are based on vision and Laser Range Find-

ers (LRFs)(Schiele et al., 2009). Eventhough vision

based person detection yields a lot of information, de-

tections are very sensitive to illumination variation,

deformations, and partial occlusions on top of the as-

sociated high computational cost. Person detection

based on LRFs are computationally cheap and in-

sensitive to illumination. But their information con-

tent is not discriminative enough for robust detec-

tion unless used in a non-cluttered environment with

a priori learnt environment map which is not realis-

tic for crowded dynamic scenes. For real world sce-

narios, well established approaches combine inputs

from more than one sensory channel, a majority of the

works combining vision and Laser, e.g. (Zivkovic and

Kr¨ose, 2007) (Spinello et al., 2008). In this vein, we

propose a multi-modal person detector that uses a 2D

SICK Laser Range Finder (LRF) and a visual camera

for detecting multiple persons around the robot. A se-

quential approach in which the laser data is segmented

to ﬁlter human leg like structures to generate person

hypothesis which are further reﬁned by a state-of-the-

art parts based visual person detector for ﬁnal detec-

tion, is proposed. To be able to make spatio-temporal

analysis of the targets, we have also employed track-

ing based on the detections.

The literature in multi-target tracking contains dif-

ferent approaches, most commonly: Multiple Hy-

pothesis Tracker (MHT)(Reid, D., 1979), Joint Prob-

abilistic Data Association Filter (JPDAF)(Rasmussen

and Hager, 2001), centralized (Isard and Mac-

Cormick, 2001) and decentralized particle ﬁlters

(PFs) (Breitenstein et al., 2009), and MCMC PF

(Khan et al., 2005). MHT is computationally expen-

sive as the number of hypothesis grows exponentially

over time, while JPDAF is applicable to tracking a

ﬁxed number of targets. The decentralized particle

ﬁltering scheme, based on multiple independent PFs

per target, suffers from the “hijacking” problem since

whenever targets pass close to one another, the tar-

get with the best likelihood score takes the ﬁlters of

nearby targets. The centralized PF scheme, a parti-

cle ﬁlter with a joint state space of all targets, is not

viable for more than three or four targets due to the

associated computational requirement. A more ap-

pealing alternative in terms of performance and com-

putational requirement is the MCMC PF. MCMC PF

replaces the traditional importance sampling step in

joint PFs by an MCMC sampling step overcoming

the exponential complexity and leading to a more

tractable solution. For varying number of targets,

RJMCMC PF, an extension of MCMC to variable

dimensional state space, has been pioneered to per-

form successful tracking (Khan et al., 2005). The

MCMC PF frame work including RJMCMC PF has

been validated in video surveillance context solely on

visual data, e.g.(Smith et al., 2005). Inspired by this,

we have used RJMCMC PF for multi-person tracking

driven by our multi-modal detector with sensors em-

bedded on a robot. Implementation details along with

integration in our robotic platform, associated exper-

iments, and evaluation results are presented, proving

the proposed approach outperforms its single sensor

counterparts taking detection, subsequent use, com-

putation time, and precision into account.

This paper is structured as follows: section 2 dis-

cusses our multi-modal person detector implementa-

tion while section 3 presents our implementation of

the RJMCMC PF tracker. Integration of the devel-

oped functionalities in our robotic platform, associ-

ated experiments, results, and discussions are pre-

sented in section 4. Finally, section 5 summarizes the

presented work and highlights possible future investi-

gations.

2 MULTI-MODAL PERSON

DETECTOR

Our multi-modal person detector is based on a 2D

SICK Laser Range Finder and a visual camera.

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

512

2.1 SICK-based Detector

Recently, Laser Range Finders (LRFs) have become

attractive tools in the robotics area for environment

detection due to their accuracy and reliability. As the

LRFs rotate and acquire range data, they will have

distinct scan signatures corresponding to the shape of

an obstacle in the scan region. Detection of a person

from LRF information hence proceeds by trying to

detect shapes of a person in the scan data at the height

the scan is performed. In the context of this work, leg

detection will be considered as the laser scanner used

is positioned at a height of 38 cm above the ground.

Our robotic platform, Rackham (presented in

§4.1), has a SICK LMS200 2D laser range ﬁnder that

swipes an arc of 180

measuring the radial distance

of obstacles in a set angular resolution of 0.5

. The

detection makes use of geometric properties of leg

scans highlighted in (Xavier et al., 2005) with no a

priori environment map assumption. Though if a 2D

map of the environment, made of line segments, is

available, all points not lying on the map are ﬁltered

to be considered further. The detection proceeds in

three steps:

Blob Segmentation. All sequential candidate scan

points that are close to each other are grouped to

make blobs of points. The grouping is done based on

the distance between consecutive points.

Blob Filtering. The blobs formed are ﬁltered using

geometric properties outlined in (Xavier et al.,

2005). The ﬁltering criteria used are Number of scan

points, Mid point distance, Mean Internal Angle

and Internal Angle Variance, and Sharp structure

removal. For details on these criteria, the reader is

referred to (Xavier et al., 2005).

Leg Formation. All the blobs that are not ﬁltered out

by the above stated requirements are considered to be

legs. Each formed leg is then paired with a detected

leg in its vicinity (if there is one). The center of the

paired legs makes the position of the detected human.

This detection system has some drawbacks,

namely: false detection of table legs, chair legs, and

other narrow objects with circular pattern. People

standing with closed legs or wearing long skirts do

not yield appropriate leg signatures needed by the de-

tector, so are classiﬁed as negative instances resulting

in false-negatives. On top of these, it is not possi-

ble to know which leg detections correspond to which

person in the presence of multiple people, making as-

sociations of each legs in consecutive frames difﬁcult.

This mode of detection is different from the combined

detector presented in §2.3 in that it makes use of all

the geometric properties strictly for leg detection.

2.2 Visual Detector

Recently, remarkable advances have been made in au-

tomated visual person detection, (Dalal and Triggs,

2005), (Laptev, 2006), and recently (Felzenszwalb

et al., 2010). For visual person detection, we have

used our complete C implementation of the state-of-

the-art person detector, Felzenszwalb’s person detec-

tor with discriminatively trained part based models.

The detector is based on mixtures of multi-scale de-

formable parts models that have the ability to repre-

sent a highly variable object class like that of a per-

son. The resulting person detector is efﬁcient, accu-

rate, and has achieved state-of-the-art results in the

PASCAL VOC competition and the INRIA person

dataset

. Brieﬂy speaking, the detector uses con-

trast sensitive and insensitive Histograms of Orienta-

tion Gradients (HOGs) with analytically reduced di-

mension as features. A person is modelled using a

star-structured part based model deﬁned by a root ﬁl-

ter and a set of parts ﬁlters with associated deforma-

tion models. Compared to full body detection ap-

proaches, (Dalal and Triggs, 2005), (Laptev, 2006),

this body parts based detector is more robust to partial

occlusions. The person model currently implemented

consists of mixtures of two models each of which

have one coarse root ﬁlter that approximately covers

an entire person and six high resolution parts ﬁlters

that cover smaller parts of the object. For details on

this person detector, the reader is referred to (Felzen-

szwalb et al., 2010).

In this work a person model trained with the Pas-

cal VOC 2008 dataset and provided with the Matlab

open source (Felzenszwalb et al., 2009) is used. Un-

fortunately, the C implementation of the person de-

tector takes about 4.6 seconds to detect persons on

a 320x240 image on a PIII 850 MHz computer with

6 levels in each octave of the feature pyramid. This

computation time is not acceptable for the task at

hand, navigation in a crowded environment, and en-

tails further improvements to speed the detection pro-

cess.

2.3 Combined Detector

The person detection from LRF suffer from false pos-

itives due to structures resembling that of a person leg,

mis-detections due to closed legs or long skirt, and do

not carry enough information to discriminate detec-

tions between multiple persons. On the other hand,

the visual person detector (Felzenszwalb et al., 2010)

is not readily applicable for the objective at hand due

to computation time requirement. To make use of the

See the URL http://pascal.inrialpes.fr/data/human/

MULTI-MODAL PERSON DETECTION AND TRACKING FROM A MOBILE ROBOT IN A CROWDED

ENVIRONMENT

513

two detectors in a complementary fashion, a multi-

modal detector is implemented. The block labelled

‘B’ in ﬁgure 1 shows a block diagram of the overall

multi-modal detector. Similar to (Cui et al., 2005)

and (Spinello et al., 2008), the proposed approach

is to deﬁne region of interests, henceforth referred as

person hypothesis, using the detections from the laser

scanner and then validate this by using the visual per-

son detector on these regions. For this, ﬁrst the ge-

ometric criteria to detect persons from the 2D laser

scanner within the camera ﬁeld of view region are re-

laxed to have a 100% person detection while at the

same time having many false positives. For every hy-

pothesis, a virtual rectangle conforming to an average

person height of 1.8 m with an aspect ratio of 4:11

(width:height) is positioned at the precise distance ob-

tained from the laser assuming a ﬂat world. Then each

virtual rectangle is projected on to the image, thanks

to a complete calibrated camera system, deﬁning a

rectangular search region on the image. The parts

based visual person detector is used to evaluate these

deﬁned regions. All the regions conﬁrmed to contain

persons are labelled as detections while those hypoth-

esis not conﬁrmed by the visual detector are discarded

as false alarms. The main advantage of using the de-

ﬁned region of interests is the reduced computation

time. Neither all the levels of the feature pyramid nor

model scores at all possible positions on the feature

pyramid need be computed. In the region outside the

camera ﬁeld of view, detection is solely based on the

laser range ﬁnder as described in subsection 2.1. Note

that this mode of operation differs from the Laser only

based detection, explained in subsection 2.1, in the re-

gion within the camera ﬁeld of view as loose geomet-

ric constraints are made use of for speeding the visual

person detector.

Finally, the multi-modal person detector provides

a list of detected targets with their precise locations,

(x, y), in the ground plane with respect to the robot,

and a normalized histogram of the image patch (if the

detection occurred within the ﬁeld of view of the cam-

era) to the Multi-Person Tracker. Figure 2 shows a

typical instance of the multi-modal detector. Figure

2(a), shows the Human-Robot situation, 2(b) detected

persons by the multi-modal detector with bounding

boxes projected on the image plane, and 2(c) shows

the raw laser data (in blue) with the corresponding

person detections (in red) in the ground plane. The

shaded area in ﬁgure 2(c) is the camera ﬁeld of view

whereas the robot is depicted as the red object in the

center of the arc.

(a) (b)

(c)

Figure 2: An instance of the multi-modal person detector

with Human-Robot situation.

3 MULTI-PERSON TRACKER

The Multi-Person Tracker (MPT) is concerned with

the problem of tracking a variable number of per-

sons, possibly interacting. Our aim here is to correctly

track and obtain trajectories of multiple persons in the

vicinity of the robot and within the ﬁeld of view of the

utilized sensors based on the detector outputs.

3.1 Formalism

In object tracking in general, the primary goal is to de-

termine the posterior distribution P(X

1:t

) of a target

state X

at the current time t, given the observations

sequence Z

1:t

= {Z

, Z

, ..., Z

}. Under the Marko-

vian target motion assumption, the Bayes ﬁlter offers

a concise way to express the tracking problem. Par-

ticle Filters offer approximations of the Bayes ﬁlter

by propagating N number of particles over time to ap-

proximate the posterior P(X

1:t

) as a sum of Dirac

functions, such that: P(X

1:t

) ≈

∑

n=1

δ(X

− X

)

where X

denote the n

particle. In multi-target

tracking, the state encodes the conﬁguration of the

tracked targets: X

= {I

, x

(t,i)

}, i ∈ {1, ..., I

}, where

is the number of tracked objects of hypothesis n at

time t, and x

(t,i)

is a vector encoding the state of ob-

ject i. In MCMCPF, the inefﬁcient importance sam-

pling of a classical Particle Filters is replaced with a

more efﬁcient MCMC sampling step. MCMC meth-

ods deﬁne a Markov Chain over the space conﬁgu-

ration X

such that the stationary distribution of the

chain is equal to the desired posterior.

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

514

Reversible Jump Monte Carlo Markov Chain

(RJMCMC) PF is an extension of MCMC PF that

accounts for the variability of the tracked targets by

deﬁning a variable dimension state space. In this case,

the state space dimension is considered as a union of

several subspaces. Whenever a new person enters the

scene, the state “jumps” to a larger dimension sub-

space and there will be a “jump” to a lower dimen-

sion subspace whenever a tracked person leaves the

scene. An important point in RJMCMC is the re-

versibility of the proposals that vary the dimension-

ality of the state space exploration. Any jump be-

tween subspaces must have a corresponding reverse

jump to prevent the search chain from getting stuck

in local minimum. These moves that guide the state

space exploration are referred as proposal moves. A

common technique that simpliﬁes both the transition

of the new proposed state hypothesis X

∗

from X and

evaluation of the acceptance ratio is, for the state tran-

sition model to consider only changes to a randomly

chosen subset of the state (in the case of multi-target

tracking, this translates into changing a single target

per iteration). In cases where interaction between dif-

ferent targets is likely to occur, an Interaction Model

should be included to maintain tracked target identity.

3.2 Implementation

Our RJMCMC PF tracker is driven by the multi-

modal detector described in §2. To handle the vari-

ability of the tracked targets three sets of proposal

moves are utilized in the RJMCMC PF: {Add, Up-

date, Remove}. A Markov Random Field is also used

to model the interactions amongst targets. The com-

plete principle of our tracker is presented in Algo-

rithm 1. Roughly, the algorithm iterates N+N

times

proposing new state based on the previous one. N

is the number of particles whereas N

represents the

number of burn-in iterations needed to converge to

stationary samples. Each subsection below gives an

overview of part of the algorithm in detail.

3.2.1 State Space

The state vector of a single hypothesis n at a certain

time t in our tracker is made of the joint state vec-

tors of the tracked persons (encodes the entire conﬁg-

uration): X

= {I

, x

t,i

, i ∈ 1, ..., I

}, where I

is the

number of tracked persons, N is the total number of

hypotheses (particles), and x

is the state vector of in-

dividual persons. Since our aim is to outline trajecto-

ries of persons around the robot, the tracking is done

on the ground plane. Hence, the state vector of an in-

dividual person is represents as (Id, x, y) in the ground

plane with respect to the robot. Formally, the i

state

Algorithm 1: RJMCMC Particle Filter.

Input: Particle set at time t − 1 : {X

t−1

}

n=1

Prediction: generate a prediction set at time

t : {X

n∗

t−1

}

n=1

according to the system dynamics

Q(X

t−1

Init: X

= X

r∗

, r ∈ {1, ..., N}

1. for i = 0 to N + N

2. ⊲ Choose a move m ∈ {add, update, remove} ∼ q

3. if m == ’add’ then

4. ⊲ X

∗

= {X

i−1

, x

i−1

}, with the new target x

i−1

and I

i−1

representing the number of per-

sons hypothesized by X

i−1

5. else if m == ’remove’ then

6. ⊲ X

∗

= {X

i−1

\ x

} where p ∈ {1, ..., I

i−1

}

7. else if m == ’update’ then

8. ⊲ Randomly choose a tracked person from X

i−1

9. ⊲ Replace the person’s state in X

i−1

with a ran-

domly chosen state corresponding to this person

in the prediction set {X

n∗

t−1

}

n=1

, proposing X

∗

10. end if

11. ⊲ Compute Acceptance Ratio :

β = min(1,

π(X

∗

ind∗

i−1

∗

)Ψ(X

∗

)

π(X

i−1

ind

∗

i−1

)Ψ(X

i−1

)

where ind ∈ {add, update, remove} and ind

∗

denotes

the reverse operation.

12. if β ≥ 1 then

13. ⊲ X

= X

∗

14. else

15. ⊲ Accept X

= X

∗

with probability β or reject and

set X

= X

i−1

16. end if

17. end for

18. ⊲ Discard the ﬁrst N

samples of the chain (burn-in).

19. ⊲ Compute the MAP estimate,

X = E

p(X

1:t

)

] =

argmax

[count(x

)]

Output: Particle Set at time t: X

n=N

+1,...,N

and MAP

estimate,

vector of a single person in hypothesis n at time t is a

2D vector represented as: x

t,i

= {Id

, x

t,i

, y

t,i

3.2.2 Proposal Moves

At each iteration of the RJMCMC PF, a proposal

move on only one randomly chosen dimension is pro-

posed. Recall that three sets of move are considered,

namely: m = {Add,U pdate, Remove}. The choice of

the proposal privileged in each iteration is determined

by q

, the jump move distribution. The probabilities

of Add, U pdate, and Remove are set to 0.2, 0.6, and

0.2 respectively. These proposal moves make use of

the proposal densities, Q(), associated with them. The

proposal densities make use of two masking maps:

a map made from detected targets, and a map made

from the tracked (MAP estimate) targets. Assuming

the number of detected persons at time t is M

, each

MULTI-MODAL PERSON DETECTION AND TRACKING FROM A MOBILE ROBOT IN A CROWDED

ENVIRONMENT

515

detection can be represented as p = (x, y) to make an

associated mask map as a Gaussian mixture with each

detection as a Gaussian having mean p and assumed

variance Σ (equation 1). Similarly, each tracked target

in the MAP estimate at t − 1 is used to make a mask-

ing map as a Gaussian mixture; ˆx

as mean values,

where j ∈ 1, .., N

and N

is total number of tracked

targets (equation 2).

) =

∑

j=0

N (; p

, Σ) (

map

(

t−1

) =

∑

j=0

N (; ˆx

, Σ) (2)

Add. The add move, randomly select a detected per-

son, x

, from the multi-modal detector and appends

its state vector on X

i−1

resulting in a proposal state

∗

. The proposal density driving the Add proposal,

when computing the acceptance ratio Q

Add

∗

i−1

is given in equation 3. The mask map from de-

tected targets is multiplied by a map derived from the

tracked targets mask map. This distribution will have

higher values whenever an add is proposed on loca-

tions conforming to detected targets that are not yet

being tracked.

add

∗

i−1

) = S

) ∗ (1− S

map

(

t−1

)) (3)

Figure 3 illustrates the derivation of Q

add

∗

i−1

) at

a certain time t. Figure 3(a) bottom shows an inverted

mask derived from the tracked targets at time t − 1

and ﬁgure 3(b) shows the detected targets. Finally, the

distribution Q

add

∗

i−1

) is derived by multiplying

both (ﬁgure 3(c) bottom). The derived distribution

shows higher values in the region near the detected

target that is not being tracked, favoring its addition.

The top ﬁgure in 3(c) shows the effect of the mask on

the actual video image and is presented here solely for

clarity purposes.

(a) (b) (c)

Figure 3: Derivation of Q

add

∗

i−1

) from tracked targets

and detection. White intensity value represents high value

whereas black is for low value.

Remove. The remove move, randomly selects a

tracked person from the particle being considered,

i−1

, and removesit, proposing a news state X

∗

. Con-

trary to the add move, the proposal density used when

computing the acceptance ratio, Q

Remove

∗

i−1

)

(equation 4), is given by the mask map from the

tracked targets multiplied by a map driven from the

detected targets. This density assures targets that

are not detected but are still being tracked have

higher values. Figure 4 depicts derivation of the

Remove

∗

i−1

) distribution. A tracked target has

just left the scene but the tracker still has the per-

son in its state ( ﬁgure 4(a) bottom ). The detec-

tor returns one detection corresponding to the person

still in the scene (ﬁgure 4(b)). As illustrated, the ﬁ-

nal Q

Remove

∗

i−1

), ﬁgure 4(c) bottom, shows high

values for the target which left the scene favoring its

removal. Figure 4(c) top illustrates the effect of the

mask on the video feed, all black meaning no target

in the camera ﬁeld of view should be removed.

remove

∗

i−1

) = (1− S

)) ∗ S

map

(

t−1

) (4)

(a) (b) (c)

Figure 4: Illustration of Q

Remove

∗

i−1

) derivation. White

intensity value represents high value whereas black is for

low value.

Update. In the update proposal move, the state vec-

tor of a randomly chosen target is perturbed by a zero

mean normal distribution. The update proposal den-

sity, Q

update

∗

i−1

), is a normal distribution with

the position of the newly updated target as mean.

Hence, the acceptance ratio is inﬂuence by the like-

lihood evaluation and interaction amongst the targets.

3.2.3 Interaction Model

Similar to (Khan et al., 2005) and (Smith et al., 2005),

a Markov Random Field (MRF) is adopted to address

the interactions between nearby targets. The MRF is

deﬁned on an undirected graph, with targets deﬁning

the nodes of the graph, and links created at each time-

step between pairs of proximate targets. A pairwise

MRF where the cliques are restricted to the pairs of

nodes that are directly connected to the graph, is im-

plemented as part of our tracker. For a given state X,

the MRF model is given by equation 5. φ(x

, x

) eval-

uates to zero if two targets are in the same position,

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

516

penalizing ﬁtting of two trackers to the same object

during overlap (interaction).

Ψ(X) = Π

i6= j

φ(x

, x

)

φ(x

, x

) = 1− exp(−(

d(x

, x

)

) (5)

where d(x

, x

) is Euclidean distance, i, j ∈ {1, ..., N},

and N number of targets in X.

3.2.4 Observation Likelihood

The likelihood measure is derived from the 2D laser

range raw data. Every segmented blob is ﬁltered to

keep blobs within a range of radius. This ﬁlters out

laser data pertaining to walls, thin table or chair legs,

and other wide structures. Then every ﬁlter blob is

represented as a Gaussian centered on the centroid of

the blob. The complete mixture of Gaussians makes

up the likelihood map for our tracker. Given a state

X, its likelihood is evaluated as the sum of likelihood

values on the position of each target averaged over the

number of targets (equation 6).

π(X

∗

) =

∑

i=0

lik

)

lik

) =

∑

j=0

N (;Z

t, j

, Σ) (

Where N

is the number of targets in X

∗

, and N

is the

number of blobs formed from the laser reading (Z

4 EXPERIMENTS

4.1 Robotic Platform

The target robotic platform, Rackham, is an iRobot

B21r mobile platform (ﬁgure 5). Its standard equip-

ment has been extended with one pan-tilt Sony EVI-

D70 camera, one digital camera mounted on a Di-

rected Perception pan tilt unit(PTU), one ELO touch

screen, a pair of loudspeakers, an optical ﬁber gyro-

scope, a Wireless Ethernet, and an RF system for de-

tecting RFID tags. It integrates two PCs (one mono-

CPU and one bi-CPUs PIII running at 850 MHz).

Rackham also has an LMS200 SICK Laser Range

Finder as its standard equipment. All these devices

give Rackham the ability to operate in public areas as

a service robot. The digital camera with the Directed

Perception PTU is dedicated for the person following

activity along with the RF system, whereas the Sony

EVI-D70 camera is used for the multi-person (passer-

by) detection and tracking. Rackham’s software ar-

Figure 5: Rackham with its onboarded sensors.

chitecture is based on the GenoM architecture for au-

tonomy (Alami et al., 1998). All its functionalities

have been embedded in modules created by GenoM

using C/C++ interface. Accordingly, the multi-modal

person detector and MCMC PF tracker described are

implemented as a GenoM modules.

4.2 Ofﬂine Evaluations

The ofﬂine evaluation corresponds to the evaluation

of both the multi-modal person detector and the RJM-

CMC PF tracker ofﬂine using real data acquired with

Rackham.

4.2.1 Multi-modal Detector Evaluations

In all the experiments, a 5 meter radius around the

robot is considered for detection and tracking. The

camera has a 45

ﬁeld of view, leaving the rest of laser

scanner ﬁeld of view, 135

, for laser only detection.

To evaluate the multi-modal person detector a dataset

containing a total of 2872 frames is used. To quantify

performance of the multi-modal person detector, two

measures namely True Positive Rate (TPR) and False

Positive Per Image (FPPI) are used.

• True Positive Rate (TPR): computes the ratio of

correctly detected targets to the total number of

targets present averaged over the entire dataset,

i.e.

∑

k, j

where δ

k, j

= 1 if a target is detected

in frame k or 0 otherwise. J

is the total number

of targets present in the entire dataset.

• False Positive Per Image (FPPI): computes the

false positive occurrence per frame averaged over

the entire dataset, i.e.

∑

k, j

where δ

k, j

= 1

if a target j is detected when there is actually no

target in frame k or 0 for correct detection. K is

the total number of frames in the entire dataset.

All the 2872 frames were hand labelled for (x, y) po-

sitions of persons on the ground plane based on the

MULTI-MODAL PERSON DETECTION AND TRACKING FROM A MOBILE ROBOT IN A CROWDED

ENVIRONMENT

517

Figure 6: ROC curve (TPR Vs FPPI) comparing the per-

formance of both LRF only detector, and the multi-modal

detector.

laser data. A True Positive occurs whenever a detec-

tion is within 30 cm radius of the ground truth. A

Receiver Operator Curve (ROC), TPR vs FPPI ROC

graph shown in ﬁgure 6, is generated by relaxing

and/or straining the geometric constraints for leg de-

tection. To verify that the multi-modal detector is su-

perior than the Laser only based detector, the experi-

ment has also been done on the Laser only based de-

tector. Hence, the ROC curve is generated for the LRF

only based person detector (§2.1) and for the multi-

modal person detector (§2.3).

Looking at the ROC curve in ﬁgure 6, it can be

seen that, the addition of the visual detector improves

the overall detection performance. On top of this per-

formance improvement, a rich discriminative infor-

mation is obtained whenever a target is within the

ﬁeld of view of the camera. Balancing True Detec-

tion with False Positive rate, the multi-modal person

detector is set to operate at a point with TPR = 0.72

and FPPI = 1.6. Sample detections obtained operat-

ing the detector at this point are shown in ﬁgure 7.

(a) (b)

(c)

Figure 7: Sample person detection with the multi-modal de-

tector.

4.2.2 Multi-person Tracker Evaluations

Similarly, to evaluate the performance of the MCM-

CPF multi-person tracker, two complete sequences

are used.

• Sequence I. A sequence of 785 frames containing

two moving targets.

• Sequence II. A sequence of 507 frames with two

moving targets but once in a while other targets

appear and disappear in the tracking area.

As a performance measure, the following three mea-

sures are computed.

• Tracking Success Rate (TSR): given by

∑

k, j

where δ

k, j

= 1 if target j is tracked at time t, else

0. J

∑

k, j

, and j

represents the number of

persons in the tracking area at frame k.

• Ghost Rate (GR): computes the number of can-

didate targets over no target (ghosts) averaged

over the total number of targets in the dataset, i.e.

∑

k, j

with δ

k, j

= 1 if tracked target j is a

ghost at frame k, else 0.

• Precision Error (PE): measures how precisely the

targets are tracked, as the sum of the squared er-

ror between tracker position estimate and ground

truth averaged over the entire sequence.

For each sequence, a hand labeled ground truth

with (x, y) position and unique Id for each person is

used. Similar to the detection, a person is considered

to be correctly tracked (True Success), if the track-

ing position is within a 30 cm radius of the ground

truth. All the Gaussians used to make associated dis-

tributions are constructed in polar form (ρ, θ) with

standard deviation of σ

= 30cm and σ

= 0.157rad.

These values are set to account for a single walk (of

an average person) uncertainty. Whenever a target is

in the camera ﬁeld of view, a histogram of the re-

gion subtended by the target is cached in memory.

This histogram is used to overcome discontinuities in

tracking when a tracked target is removed and initial-

ized as a new target due to subsequent misdetections.

It is also used to make association distinctions when-

ever targets come close by and some conﬁguration

within the state sample another target’s state.

For evaluation, each sequence is run ten times to

account for the stochastic nature of the ﬁlter. Results

are reported as mean value and associated standard

deviation in table 1. The results show that our multi-

modal person tracker performs well on the two se-

quences used for evaluation, with a 73.3% True De-

tection on the ﬁrst sequence. What should be high-

lighted here is that, the detector plus tracker makes

no use of a priori knowledge of the environment.

The environment the experiments were carried out is

a highly cluttered environment containing many arti-

facts that resemble the leg of a person and the ﬁeld of

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

518

(a) (b) (c) (d) (e) (f) (g) (h)

Figure 9: A sequence of frames from online obstacle avoidance scenario based on the multi-modal person detector. The top

row shows Human-Robot situation, the middle user tracking for person following, and the third multi-modal person detections

on the ground plane.

Table 1: Results of the MCMCPF multi-tagged person

tracker.

MCMCPF Person Tracking Results

Seq. TSR GR PE (cm)

I 0.733± 0.074 1.221± 0.078 7.93± 0.68

II 0.62± 0.078 1.355± 0.297 8.37± 1.13

(a) (b)

(c)

Figure 8: Sample snap-shots taken from the multi-person

tracking on sequence I at the 35

, 51

, and 259

frames

respectively.

view of the camera is very narrow. The average po-

sition precision of the tracker is also less than 9cm.

An average Id switch per sequence of one for the ﬁrst

sequence and two for the second has also been ob-

served.

Figure 8 shows sample snap-shots

taken from

tracking runs of sequence 1. The top row shows the

tracking on the video feed, the middle shows the par-

A video of the tracking sequence is available at

http://homepages.laas.fr/aamekonn/videos.htm

ticle swarm, and the bottom row shows the trajectory

of the tracked persons.

4.3 Online Robotic Evaluations

The online robotic evaluation corresponds to the ex-

periments carried out on Rackham. As mentioned

§4.1, the multi-modal detector and tracker are im-

plemented in C/C++ embedded in GenoM modules.

Both the detector and tracker run on the same com-

puter while the LRF scan data is acquired through

the second computer. The multi-modal detector alone

runs from 1.5f ps minimum to 4.5f ps maximum de-

pending on the number of hypothesis generated for

the visual detector. The rate at which the com-

bined system runs varies depending on the number of

tracked persons and number of hypothesis generated

by the laser for the visual detector. In our experiment,

an approximate minimum of 0.7 frames per second

was noted.

Recall the end goal is to realize a person following

service robot with passer-by avoidance. The Person

Following activity presented in (Germa et al., 2009)

and depicted in the shaded area in ﬁgure 1 is based on

an RFID system and a visual camera. A user (tagged

person) wearing an RFID tag is tracked and followed

by the robot irrespective of camera out of ﬁeld of

view, or occlusions. To check the integration of both

systems, an experiment was carried out. In the ex-

periment a tagged person is followed while a simple

control law with rotative repulsive potential was used

to avoid passers-by based on the multi-modal detector

only. Figure 9 shows the a sequence of the video dur-

ing a person following with obstacle avoidance based

only on the multi-modal detector.

MULTI-MODAL PERSON DETECTION AND TRACKING FROM A MOBILE ROBOT IN A CROWDED

ENVIRONMENT

519

5 CONCLUSIONS

To conclude, this paper presented multi-modal per-

son detection and tracking from a mobile robot based

on LRF and vision intended for a socially accept-

able navigation in crowded scenes during a person

following activity. Though a person following sce-

nario is considered, the framework is applicable for

any service robot activity in a crowded public envi-

ronment where perception of the whereabouts and dy-

namics of the persons around is required. It has been

clearly shown that the multi-modal approach outper-

forms its single sensor counterparts taking detection,

subsequent use, computation time, and precision all

into account. Results obtained from ofﬂine and online

robotic experiments have also been clearly reported

asserting this statement.

Currently, investigations are on the way to use

a LadyBug2 spherical camera to improve the detec-

tion and tracking further taking advantage of its wide

ﬁeld of view. Preliminary investigations are also un-

derway with navigational schemes that consider the

spatio-temporal information provided by our multi-

target tracker.

REFERENCES

Alami, R., Chatila, R., Fleury, S., Ghallab, M., and In-

grand, F. (1998). An architecture for autonomy. In

Int.Journal of Robotics Research (IJRR’98), 17:315–

337.

Breitenstein, M., Reichlin, F., Leibe, B., Koller-Meier,

E., and Van Gool, L. (2009). Robust tracking-by-

detection using a detector conﬁdence particle ﬁlter. In

IEEE 12th Int. Conf. on Computer Vision (ICCV’09),

pages 1515 –1522.

Calisi, D., Iocchi, L., and Leone, G. R. (2007). Person fol-

lowing through appearance models and stereo vision

using a mobile robot. In Proceedings of Int. Workshop

on Robot Vision, pages 46–56.

Chen, Z. and Birchﬁeld, S. T. (2007). Person following with

a mobile robot using binocular feature-based tracking.

In Proceedings of the IEEE/RSJ Int. Conf. on Intelli-

gent Robots and Systems (IROS’07).

Cui, J., Zha, H., Zhao, H., and Shibasaki, R. (2005). Track-

ing multiple people using laser and vision. In Proceed-

ings of the 2005 IEEE/RSJ Int. Conf. on Intelligent

Robots and Systems (IROS’05), pages pp.1301–1306.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection. In Proceedings of the

Int. Conf. on Computer Vision and Pattern Recogni-

tion (CVPR’05), pages 886–893.

Felzenszwalb, P. F., Girshick, R. B., and

McAllester, D. (2009). Discriminatively

trained deformable part models, release 3.

http://people.cs.uchicago.edu/ pff/latent-release3/.

Felzenszwalb, P. F., Girshick, R. B., McAllester, D. A.,

and Ramanan, D. (2010). Object detection with dis-

criminatively trained part-based models. IEEE Trans.

Pattern Anal. Mach. Intell. (TPAMI’10), 32(9):1627–

1645.

Fong, T. W., Nourbakhsh, I., and Dautenhahn, K. (2003).

A survey of socially interactive robots. Robotics and

Autonomous Systems.

Germa, T., Lerasle, F., Ouadah, N., Cadenat, V., and Devy,

M. (2009). Vision and RFID-based person tracking

in crowds from a mobile robot. In Proceedings of the

IEEE/RSJ Int. Conf. on Intelligent Robots and Systems

(IROS’09), pages 5591–5596. IEEE Press.

Isard, M. and MacCormick, J. (2001). Bramble: a bayesian

multiple-blob tracker. In Proceedings of 8th IEEE

Int. Conf. on Computer Vision (ICCV’01), volume 2,

pages 34 –41 vol.2.

Khan, Z., Balch, T., and Dellaert, F. (2005). Mcmc-based

particle ﬁltering for tracking a variable number of in-

teracting targets. IEEE Trans. Pattern Anal. Mach. In-

tell. (TPAMI’05), 27(11):1805–1918.

Laptev, I. (2006). Improvements of object detection using

boosted histograms. In Proceedings of the British Ma-

chine Vision Conference (BMVC’06), pages 949–958.

Rasmussen, C. and Hager, G. D. (2001). Probabilistic

data association methods for tracking complex vi-

sual objects. IEEE Trans. Pattern Anal. Mach. In-

tell.(TPAMI’01), 23(6):560–576.

Reid, D. (1979). An algorithm for tracking multiple targets.

IEEE Transactions on Automatic Control, 24(6):843–

854.

Schiele, B., Andriluka, M., Majer, N., Roth, S., and Wojek,

C. (2009). Visual people detection: Different mod-

els, comparison and discussion. In Proceedings of the

IEEE ICRA 2009 Workshop on People Detection and

Tracking, pages 1–8.

Smith, K., Gatica-Perez, D., and Odobez, J.-M. (2005).

Using particles to track varying numbers of inter-

acting people. In Proceedings of the Int. Conf. on

Computer Vision and Pattern Recognition (CVPR’05),

pages 962–969, Washington, DC, USA. IEEE Com-

puter Society.

Spinello, L., Triebel, R., and Siegwart, R. (2008). Mul-

timodal people detection and tracking in crowded

scenes. In Proceedings of the 23rd National Confer-

ence on Artiﬁcial intelligence (AAAI’08), pages 1409–

1414. AAAI Press.

Xavier, J., Pacheco, M., Castro, D., and Ruano, A. (2005).

Fast line, arc/circle and leg detection from laser scan

data in a player driver. In Proceedings of the Int. Conf.

on Robotics and Automation (ICRA’05).

Zivkovic, Z. and Kr¨ose, B. (2007). Part based people detec-

tion using 2d range data and images. In Proceedings

of the 2007 IEEE/RSJ Int. Conf. on Intelligent Robots

and Systems (IROS’07).

VISAPP 2011 - International Conference on Computer Vision Theory and Applications

520