Comparison of Multi-shot Models for Short-term Re-identiﬁcation of

People using RGB-D Sensors

Andreas Møgelmose, Chris Bahnsen and Thomas B. Moeslund

Visual Analysis of People Lab, Aalborg University, Rendsburggade 14, 9000, Aalborg, Denmark

Keywords:

Re-identiﬁcation, Systems, Color Features.

Abstract:

This work explores different types of multi-shot descriptors for re-identiﬁcation in an on-the-ﬂy enrolled envir-

onment using RGB-D sensors. We present a full re-identiﬁcation pipeline complete with detection, segmenta-

tion, feature extraction, and re-identiﬁcation, which expands on previous work by using multi-shot descriptors

modeling people over a full camera pass instead of single frames with no temporal linking. We compare two

different multi-shot models; mean histogram and histogram series, and test them each in 3 different color

spaces. Both histogram descriptors are assisted by a depth-based pruning step where unlikely candidates are

ﬁltered away. Tests are run on 3 sequences captured in different circumstances and lighting situations to ensure

proper generalization and lighting/environment invariance.

1 INTRODUCTION

The task of person re-identiﬁcation is about recogniz-

ing people that have been captured earlier by a camera

in a surveillance network. The network may consist of

one or more cameras, and can be placed in traditional

surveillance contexts or more narrowly scoped areas,

such as keeping track of a single queue of people. The

objective is simple: When a person enters the ﬁeld of

view of a camera in the system, it must be determined

whether or not this person has been seen before. Per-

son re-identiﬁcation is closely related to person track-

ing and person recognition. However, is has several

extra challenges, that makes it less straight-forward

(Møgelmose et al., 2013b):

• There is no fully known gallery dataset. As op-

posed to traditional person recognition, the sys-

tem must enroll new people on-the-ﬂy, without

them taking any action.

• Methods must be robust to pose changes. Since

subjects are not required to participate actively,

there are only weak constraints on pose and view-

ing angles.

• Sensor resolution is a big challenge. People

simply passing by at various distances are to be

re-identiﬁed, so it is not reasonable to use hard

biometrics like ﬁngerprints or face recognition.

• The database of known people must be continu-

ally cleaned up - when a person has not been seen

for some period of time, they have most likely left

the area and should be removed from the database.

There are two fundamentally different approaches

to re-identiﬁcation: Single-shot and multi-shot.

Single-shot performs the re-identiﬁcation on stand-

alone frames. This is useful in situations where only a

single probe image is available. However, very often

the subject has been captured on video, and thus has

several frames describing her. Multi-shot combines a

full pass across the ﬁeld of view into a single model,

which is the used as probe in a gallery of similarly

collected multi-shot models. Multi-shot gives the op-

tion of capturing more information about the subject

than a single frame contains, and has the potential to

make the system more robust to occlusions and sud-

den changes in lighting.

Person re-identiﬁcation has been in active re-

search for a while, but multi-modal systems have

only recently come into play. The reason for this

is twofold: 1) Algorithms have so far mostly been

developed for use in existing surveillance infrastruc-

ture and 2) more advanced sensor capabilities, such

as depth and thermal, have not been readily avail-

able. We believe that as sensor technology pro-

gresses, more modalities will show up in regular sur-

veillance cameras, making the development of new

multi-modal algorithms highly relevant.

This work builds on the method presented in (Mø-

gelmose et al., 2013b) and is a full RGB-D based re-

identiﬁcation system covering all parts of the pipeline

244

Møgelmose A., Bahnsen C. and Moeslund T..

Comparison of Multi-shot Models for Short-term Re-identiﬁcation of People using RGB-D Sensors.

DOI: 10.5220/0005266402440251

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 244-251

ISBN: 978-989-758-090-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

from detection through re-identiﬁcation to database

maintenance. The main contributions are:

• While the earlier work was single-shot based,

the method has been updated to a multi-shot ap-

proach. This work compares several different

multi-shot person models.

• The earlier work relied on RGB-color histograms.

This work presents a comparison of three different

color spaces: RGB, HSV, and XYZ.

• More thorough testing. On top of testing on the

original dataset from (Møgelmose et al., 2013b),

two more datasets have been captured to test the

performance in different circumstances.

• The system is now free of arbitrary thresholds in

the re-identiﬁcation stage, as every threshold is

learned from training data in a cross-validation

scheme.

• In the original work, the height of subjects only

had little inﬂuence on the re-id performance. We

introduce a more thorough pruning step based on

depth-adjusted height of subjects which increases

re-id performance signiﬁcantly.

The remainder of this paper is structured as fol-

lows: Section 2 gives an overview of related work

in the ﬁeld of re-identiﬁcation. It also contains a

description of existing datasets, as well as the ones

captured and used in this work. Section 3 explains

the algorithms used and goes through detection and

segmentation, multi-shot person modeling, and re-

identiﬁcation. In section 4 the various methods

presented are evaluated against each other. Section

5 concludes the paper.

2 RELATED WORK

Person re-identiﬁcation as described above has been

an active research area for about a decade and truly

gained speed in the latter half of the 2000s. A relat-

ively recent survey on person re-identiﬁcation can be

found in (Doretto et al., 2011), and in this section we

highlight notable recent papers. As mentioned pre-

viously, re-identiﬁcation approaches can be divided

into single-shot and multi-shot. Furthermore, we dis-

tinguish whether multi-modal methods are used.

Zheng et. al. (Zheng et al., 2011) and Zhao

et. al. (Zhao et al., 2013) both use single shot al-

gorithms. The ﬁrst use color and texture histograms,

whereas the latter uses dense color histograms and

SIFT descriptors with the addition of using a saliency

map to decide which parts of the person are the most

descriptive.

Multi-shot is championed by Bak et. al. in (Bak

et al., 2012) and Demirkus et. al. (Demirkus et al.,

2010). Bak uses a large pool of features and the best

one to describe a particular person is selected. De-

mirkus uses a set of more directly understandable soft

biometrics, such as gender, hair color, and clothing

color.

Moving away from the traditional visible light

modality, Jüngling and Arens (Jüngling and Arens,

2010), presents a full single-shot re-identiﬁcation

pipeline based on infrared images. It detects candid-

ates, then tracks and re-identiﬁes them using SIFT-

features. In the depth modality, Barbosa et. al.

(Barbosa et al., 2012) re-identiﬁes by comparing vari-

ous physical body measurements (anthropometrics)

obtained from the depth image. Velardo and Dugelay

(Velardo and Dugelay, 2012) uses manually measured

anthropometrics to prune the set of candidates for face

recognition.

Finally, two papers combine several modalities. In

(Møgelmose et al., 2013b) RGB is used for detection

and re-identiﬁcation, and depth for segmentation and

pruning of re-id candidates. This is the same basic ap-

proach as in this work. In (Møgelmose et al., 2013a),

thermal images and anthropometric measurements are

added and the re-identiﬁcation is performed in a truly

multi-modal way with a combination of color histo-

grams, SIFT features on thermal images, and anthro-

pometric measurements obtained from depth images.

2.1 Datasets

Several public datasets exist, though mostly sets cap-

tured with traditional visible light sensors.

In other modalities, not many exist. For depth,

the RGB-D Person Re-identiﬁcation Dataset (Barbosa

et al., 2012) is one option. It contains 79 people in 4

different scenarios: Walking slowly with outstretched

arms, two instances of walking from a frontal view-

point, and walking from a rear viewpoint.

For this work, we use our own dataset with a

surveillance-like camera setup. We have three se-

quences: Novi, Basement, and Hallway. They all

contain sequences of persons walking diagonally to-

wards and past the sensor twice. Novi, which was also

used in (Møgelmose et al., 2013b), contains 22 per-

sons over 7800 frames (passes have varying lengths).

Basement contains 35 persons over 7231 frames, and

Hallway contains 10 persons over 4492 frames. Stats

about the public as well as our own datasets can be

seen in table 1. The sequences were captured with

Microsoft Kinect for Xbox. Example pictures from

each sequence can be seen in ﬁg. 1.

ComparisonofMulti-shotModelsforShort-termRe-identificationofPeopleusingRGB-DSensors

245

(a) (b) (c)

Figure 1: Example images from our own (a) Novi, (b) Basement, and (c) Hallway sequences.

Detection

Segmentation

Ground plane

estimation

Generate model

Re-identiﬁcation

Figure 2: Illustration of the ﬂow through the system.

Table 1: Statistics on the three data sequences used in this

work.

Novi Basement Hallway

Number of persons 22 35 10

Number of frames 7800 7231 4492

Contains image se-

quences

Yes Yes Yes

Available modalit-

ies

RGB, depth RGB, depth,

thermal

RGB, depth,

thermal

3 ALGORITHM OVERVIEW

This paper describes a full re-identiﬁcation system

which takes a raw RGB-D feed as input and outputs

whether or not a passing person has been seen before,

and if so, what the previous ID was. This is different

from many other re-identiﬁcation papers which most

often describe a core algorithm without much focus

on all the other system parts that must be in place to

have an actual working system. The process requires

several steps: Persons must be detected and segmen-

ted, they must be modeled, and ﬁnally re-identiﬁed.

On top of the re-identiﬁcation process comes the pro-

cess of keeping tabs on the person database. A ﬂow-

chart is shown in ﬁg. 2.

3.1 Detection and Segmentation

The detection is done with a standard HOG-detector

as ﬁrst proposed by Dalal and Triggs (Dalal and

Triggs, 2005). The detector is trained on the INRIA

Person Dataset introduced by the same paper. The

detector runs on the RGB images and returns person

bounding boxes.

The detected persons need to be segmented in fur-

ther detail. The bounding box is not sufﬁcient, since

we do not want to capture features from the back-

ground. Segmentation is achieved with a ﬂood ﬁll in

the depth image. Persons not crawling on the ﬂoor

are conveniently separated from the background in the

depth modality, so a ﬂood ﬁll to similar pixels starting

at the points

X =







2/5 1/4

2/5 1/3

2/5 2/5

1/2 1/4

1/2 1/3

1/2 2/5

3/5 1/4

3/5 1/3

3/5 2/5









0 b















9x2

(1)

where X is a 9x2 matrix containing the x and y co-

ordinates of the ﬂood ﬁll points, b is the bounding

box with subscript x, y, w, and h meaning top-left

x-coordinate, top-left y-coordinate, width, and height

respectively. The ﬂood ﬁll is performed at multiple

positions to ensure that we have a stable object in the

depth modality. A person is classiﬁed as stable if at

least j depth points converge, i.e. the ﬂood ﬁll of these

points ﬁll out the same volume. For this implementa-

tion, j = 4.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

246

Figure 3: The left image illustrates a detection. On the right, the person has been segmented in the depth image, and the blue

boxes illustrates the boxes which are used as basis for the color histograms.

3.1.1 Ground Plane Estimation

One problem with the ﬂood ﬁll is that at the feet of

the subject, the ﬁll is likely to spill onto the ﬂoor. To

counter this, ground plane pixels on the depth image

are removed. When the system is started initially, a

ground plane is deﬁned in the depth image. This is

done by marking a number of points on the ground

and performing a least squares solution of the bivari-

ate polynomial:

poly

= a

x +a

y+a

xy (2)

Although the ﬂoor is planar, the measurements of the

ﬂoor from the Kinect depth sensor are representing

the plane as a hyperbolic plane, thus stating the need

for a bivariate polynomial. When the coefﬁcients are

determined, any pixel in the depth image close to the

ground plane is colored black. Those pixels are the

ones fulﬁlling the inequality in equation (3), where

p is the pixel in question and t

depth

deﬁnes the dis-

tance from the theoretical ground plane that is still

considered part of that plane.

poly

− p

| < t

depth

(3)

3.2 Person Model

One of the objectives of this paper is to compare two

types of multi-shot person models. They are both

based on the two-part color histogram used in (Mø-

gelmose et al., 2013b): After a person is segmented, a

color histogram is computed for the upper part of the

body and the lower part of the body (as illustrated by

the blue boxes in ﬁg. 3). Each color channel is di-

vided into 20 bins, the individual channel histograms

are concatenated, and ﬁnally the two part histograms

are concatenated for a feature vector of 20 ·3 ·2 = 120

dimensions in the case of a 3 channel color space. In

addition to the two modeling paradigms, 3 different

color spaces were tested: RGB, HSV, and XYZ. For

HSV and XYZ the luminance channels were removed

to enhance lighting invariance, so in those cases the

ﬁnal histogram would be 80-dimensional and contain

just the HS- and XZ-channels, respectively.

Two multi-shot schemes have been tested:

1) Mean histogram of all frames in a pass.

2) All frame-histograms saved individually.

In 1) the mean histogram is computed when a pass

is over. Each bin is simply averaged:

∑

j=0

i, j

for 0 ≤ i < k (4)

where m is the mean histogram, n is the number of

frames in the pass, k is the number of bins in the his-

tograms and h

i, j

is the value of bin i in histogram j.

In 2) no averaging takes place. Instead a pass is

modeled after each histogram in it. See the follow-

ing section on how each model is matched against the

person database.

Both of the color-based models are augmented

with a measure of the person’s height. We use normal-

ized height-to-border. This is the distance in pixels

from the top of the person in the image, to the bottom

of the frame, normalized by the depth of the observa-

tion. This reduces noise, as only one of the bounds

of the height is now determined from the noisy depth

sensor. It also allows for clipping.

In ﬁg. 4 height-to-border versus depth is plotted.

Because the surface and ﬁeld-of-view is the same for

all who pass by the camera, the only change that will

happen to the curve for people of different heights is

a shift in its y-axis intercept. Instead of approximat-

ing the full curve, we go for the less computationally

heavy option of modelling each pass with the mean

of the depth-normalized height-to-border, designated

γ, for all instances in the pass:

ComparisonofMulti-shotModelsforShort-termRe-identificationofPeopleusingRGB-DSensors

247

Figure 4: Curves depicting height-to-border versus distance

for all tracks in a sequence. The curves are colored in pairs,

such that two tracks of the same color are two passes by

the same person. It can be seen that most lines are close

to their partner of the same color, showing that the height

measurement is stable across passes.

γ =

∑

i=0

· d

(5)

where g

is the height-to-border for observation i in

the pass, and d

is the distance to the person in that

observation. While the person is not completely ﬂat,

for the purpose of this normalization, we use the depth

of the seed point described in equation 1.

3.3 Re-identiﬁcation

A pruning stage based on the height measurement is

used before the re-identiﬁcation. The height of the

probe is compared to the gallery by means of the ab-

solute difference in their heights. If the mean normal-

ized height-to-border is more than t

away from a can-

didate, the candidate is not considered a match for this

subject. t

is found from analyzing training data be-

fore running the system. The threshold t

is set to the

mean of the height difference between wrong matches

in the training set.

When re-identifying, the model of the current pass

is compared to those of the persons in the database,

which is initially empty, but will be built as time pro-

gresses. Both the mean histogram and the histogram

series model use the Bhattachariyya distance (Bradski

and Kaehler, 2008):

d(H

) =

1 −

∑

(I)H

(I)

∑

(I) ·

∑

(I)

(6)

where d(H

) is the distance between the histo-

grams H

and H

, and H(I) is the value of bin I in

the histogram H. The result is a number between 0

and 1, where 0 is a perfect match.

With mean histograms, where only two histo-

grams - probe and gallery - are involved, the distance

itself is used, and the subject is either re-identiﬁed,

ignored, or added to the database. With histogram

series, the model comprise a series of histograms. In

this case, each histogram in the probe model is com-

pared to each histogram in the database. The probe

then casts a vote for the ID of the gallery-model which

contains the histogram it is closest to, if that is within

a separately trained ignore threshold. The gallery-

model with the most votes is selected as the best can-

didate, provided is has the majority (more than 50%)

of the possible votes.

3.4 Mean Histogram

The re-identiﬁcation process is overned by two

thresholds:

: New threshold: Subjects with d(H

) > t

are added as new persons

(7)

: Ignore threshold: Subjects with d(H

) <= t

are re-identiﬁed

(8)

This implicates that subjects with t

< d(H

) <=

are ignored, because they are too similar to other

subjects, without being similar enough to trust the

identiﬁcation.

The thresholds are learned beforehand by ob-

serving a training set. The distances between all mean

histograms in the training set are computed and stored

in the set D and divided into two sets D

and D

where D

contains distances between different obser-

vations of the same person and D

contains distances

between histograms of different persons:

= {D|id(H

) = id(H

) in d(H

)} (9)

= {D|id(H

) 6= id(H

) in d(H

)} (10)

where id(•) is the person id connected with a his-

togram. The thresholds are then computed as:

= D

− 2 · σ(D

) (11)

= D

+ σ(D

) (12)

where • denotes mean and σ(•) denotes standard

deviation.

3.5 Histogram Series

The re-identiﬁcation for the histogram series model

uses many of the same principles of the mean histo-

gram model, but is adapted to use many more his-

tograms for each subject to encompass variations in

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

248

lighting and pose. A histogram is computed for each

frame in the pass of a subject and they are then com-

pared to all histograms already in the database. When

the shortest distance d

to any gallery-histogram is

less than t

, the associated person id, p

receives a

vote. Thus, each subject histogram contributes with

up to 1 vote, for a theoretical total of len(H) votes:

the number of histograms in the current pass. If there

are no histograms in the pass, the subject is ignored. If

any person in the gallery has received more than half

the theoretical maximum, the subject is re-identiﬁed

as him. If no gallery person satisﬁes this requirement,

the subject is added as a new person.

It is worth noting that this method has no explicit

option of ignoring the subject in case it is uncertain,

other than in the case where no histograms exist.

4 EVALUATION

6 permutations of the system have been tested on 3

different sequences (see section 2.1). The 2 differ-

ent multi-shot models have both been tested in 3 dif-

ferent color spaces: RGB, HSV, and XYZ.HSV and

XYZ have been tested since they both model color

closer to how the human eye sees it, and more spe-

ciﬁcally because they allow for exclusion of the lu-

minance so that differing lighting conditions should

affect performance less. That means that for the fol-

lowing tests all three RGB channels were used, in the

HSV case only HS were used, and with XYZ only XZ

were used.

The performance of the system varies with the or-

der the persons are passing by the camera. If a person

that is very hard to re-identify passes by the camera

in the ﬁrst two passes without any other entries in

the database, odds are that he will be correctly re-

identiﬁed. However, if a similar person enters the

database before the second pass of person 1, they

might be confused with each other and thus lower

the performance. To even out this effect, all res-

ults presented below are averages of 100 runs where

the subjects enters the system in random order. That

should sufﬁciently even out any “lucky” or “unlucky”

orderings and provide accurate results. For each run,

all thresholds have been trained on a random subset of

20% of the sequence, which is then excluded from the

rest of the run. The effect of the training set selection

should also average out.

The re-identiﬁcation performance can be charac-

terized with 5 parameters:

1. Correct new

2. Wrong new

3. Correct ID

4. Wrong ID

5. Ignored

The ﬁrst two describes how well the system dis-

tinguishes between known persons and new persons.

Ideally, there should be no wrong new, as they are per-

sons that are already in the database and should have

been re-identiﬁed. Correct ID and wrong ID com-

prises the subjects that are neither ignored, correct

new, nor wrong new, but are re-identiﬁed. Finally, ig-

nored are the ones that are not handled because they

are neither close enough to an existing person to be

re-identiﬁed, nor different enough from the existing

persons to be added to the database.

The results of the tests can be seen in table 3.

Sequence length and detection performance varies

greatly between sequences, as seen in table 2. Note

that the Hallway sequence contains many shorter

tracks, meaning that generalization, as well as the be-

neﬁt from the multi-shot approach, declines heavily.

Generally, the mean histogram and histogram

series approaches perform equally when looking at

the percentage rates of the identiﬁcation. The differ-

ences between the two approaches are most profound

in the Basement and Novi sequences. The histogram

series approach contains no ignore category which

leads to a higher number of wrong new identiﬁcations

than compared with the mean histograms. However,

the method returns a signiﬁcantly lower number of

wrong identiﬁcations in both sequences. It is seen

from the standard deviation of that the mean histo-

gram exhibits a more stable performance than the his-

togram series on correct identiﬁcations whereas the

opposite seems to be the case for wrong identiﬁca-

tions. The number of wrong identiﬁcations is low

across the board, so the weak spots are the wrong

new- and ignored-counts which are rather high. Most

new passes are correctly classiﬁed as such, at around

29-32 of 35 in the basement sequence, 8/10 and 21/22

in the Hallway and Novi sequences respectively.

The beneﬁt of the ignore-functionality in the mean

histogram model is illustrated in ﬁg. 5. Blue columns

are a histogram of distances between mean histo-

grams of the same person, while red columns are

a histogram of distances between different persons.

The overlap between these shows that is it not pos-

sible to achieve perfect classiﬁcation with a 1d de-

cision boundary in this case. To counter this, an ig-

nore zone is introduced - the space between the green

and the yellow line, the thresholds, which can to some

extent mitigate the effects of this overlap. In real-

ity, when training on a subset of the data, the ignore

zones are generally wider than in this example. It is

possible that a classiﬁcation in a higher dimensional

ComparisonofMulti-shotModelsforShort-termRe-identificationofPeopleusingRGB-DSensors

249

Table 2: Statistics on the amount of observations of captured persons for each sequence. The numbers are based on the amount

of times a single person was detected and modeled in a single pass.

Basement seq. Hallway seq. Novi seq.

Mean observation length: 25.5 10.3 40.7

Median observation length: 24 11.5 41

Minimum observation length: 4 2 5

Maximum observation length: 38 25 57

Table 3: Re-identiﬁcation performance of the 6 system conﬁgurations on 3 different sequences. All numbers are averaged

over 100 runs with random enrollment order. The standard deviation of the results are shown in parenthesis.

Basement sequence

Correct new Wrong new Correct ID Wrong ID Ignored % correct % wrong

RGB

Mean histogram 29.31 (2.92) 4.53 (7.10) 11.76 (5.34) 1.22 (2.50) 8.18 (6.82) 90.62 % 9.38 %

Histogram series 32.61 (1.44) 8.64 (7.37) 13.00 (7.07) 0.74 (2.14) 0.00 (0.00) 94.60 % 5.40 %

Mean histogram 28.75 (3.23) 3.20 (7.36) 12.57 (5.81) 1.18 (2.83) 9.30 (6.78) 91.43 % 8.57 %

Histogram series 32.71 (1.37) 8.13 (7.77) 13.49 (7.51) 0.67 (2.27) 0.00 (0.00) 95.24 % 4.76 %

Mean histogram 29.02 (3.26) 4.72 (7.12) 11.24 (5.15) 1.68 (3.01) 8.34 (7.35) 86.97 % 13.03 %

Histogram series 32.54 (1.45) 8.88 (7.34) 12.59 (6.87) 0.98 (2.33) 0.00 (0.00) 92.78 % 7.22 %

Hallway sequence

Correct new Wrong new Correct ID Wrong ID Ignored % correct % wrong

RGB

Mean histogram 8.36 (1.00) 4.07 (2.34) 1.15 (1.37) 0.61 (1.01) 0.81 (1.62) 65.34 % 34.66 %

Histogram series 8.59 (0.77) 4.45 (2.10) 1.30 (1.57) 0.66 (1.18) 0.00 (0.00) 66.33 % 33.67 %

Mean histogram 8.15 (1.17) 3.95 (2.41) 1.34 (1.61) 0.39 (0.78) 1.17 (2.13) 77.46 % 22.54 %

Histogram series 8.61 (0.62) 4.44 (2.14) 1.44 (1.72) 0.51 (0.76) 0.00 (0.00) 73.85 % 26.15 %

Mean histogram 8.37 (0.96) 4.11 (2.29) 1.15 (1.37) 0.63 (1.10) 0.74 (1.58) 64.61 % 35.39 %

Histogram series 8.57 (0.71) 4.33 (2.14) 1.37 (1.58) 0.73 (1.22) 0.00 (0.00) 65.24 % 34.76 %

Novi sequence

Correct new Wrong new Correct ID Wrong ID Ignored % correct % wrong

RGB

Mean histogram 21.15 (1.40) 3.79 (2.98) 9.68 (2.73) 0.25 (1.31) 1.13 (1.89) 97.48 % 2.52 %

Histogram series 21.51 (0.70) 9.12 (4.21) 5.31 (4.24) 0.06 (0.42) 0.00 (0.00) 98.88 % 1.12 %

Mean histogram 20.64 (1.86) 2.42 (2.13) 10.52 (2.38) 0.44 (1.45) 1.98 (3.21) 95.99 % 4.01 %

Histogram series 21.48 (0.77) 9.18 (4.53) 5.23 (4.59) 0.11 (0.91) 0.00 (0.00) 97.94 % 2.06 %

Mean histogram 21.14 (1.17) 4.89 (3.48) 8.83 (3.10) 0.34 (1.26) 0.80 (1.74) 96.29 % 3.71 %

Histogram series 21.46 (0.87) 10.12 (3.91) 4.31 (3.93) 0.11 (0.91) 0.00 (0.00) 97.51 % 2.49 %

Table 4: Comparison of re-identiﬁcation performance with and without the height-based candidate pruning step.

Without height With height Difference

% correct % wrong % correct % wrong % correct % wrong

Basement

Mean histogram 82.17 % 17.83 % 90.67 % 9.33 % 8.50 % -8.50 %

Histogram series 87.28 % 12.72 % 94.21 % 5.79 % 6.93 % -6.93 %

Hallway

Mean histogram 64.64 % 35.36 % 69.14 % 30.86 % 4.50 % -4.50 %

Histogram series 67.34 % 32.66 % 68.47 % 31.53 % 1.10 % -1.10 %

Novi

Mean histogram 92.03 % 7.97 % 96.59 % 3.41 % 4.56 % -4.56 %

Histogram series 96.50 % 3.51 % 98.11 % 1.89 % 1.61 % -1.61 %

Average 81,66 % 18,34 % 86,20 % 13,80 % 4,53 % -4.53 %

space would work better and allow discarding the ig-

nore zone.

Table 4 shows how the height-based pruning step

improves the re-id rates across all methods. By dis-

carding obviously wrong candidates based on height,

the correct re-id rate goes up by 4.53 percentage

points on average.

We have been unable to compare our results to the

work of others, as they do not present full-ﬂow sys-

tems, but rely on tightly pre-cropped images of per-

sons. Furthermore, our system needs depth images as

well as RGB, so no existing dataset has been com-

patible. We also do not present CMC-curves as that

ranking system works poorly for on-the-ﬂy enroll-

ment systems, where, in many cases, there are simply

not enough entries in the database to do a proper rank-

ing.

We can, however, compare some of our results to

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

250

Figure 5: Distribution of distances between histograms in

the full basement sequence. There is a clear overlap of dis-

tances between histograms from the same person and his-

tograms from different persons. When using a distance

threshold to classify, this will result in wrong identiﬁca-

tions. The ignore-threshold allows to remove the distances

that are the most affected by this overlap.

the work previously presented in (Møgelmose et al.,

2013b). Not all stats are directly comparable, but the

correct and wrong ID rates are. In that work, they are

68% and 0%, with an ignore rate of 24%. The system

presented here has a much higher correct ID rate, but

at the cost of a somewhat higher wrong ID rate.

5 CONCLUSION

This work presented a re-identiﬁcation system using

RGB-D data and compared several model and color

space conﬁgurations. It introduces 3 new, different re-

identiﬁcation sequences for testing, and goes through

all stages from candidate detection to identiﬁcation.

Furthermore, it investigates how to handle online en-

rollment of subjects, a subject few previous works

have touched. Future work includes more sophistic-

ated multi-shot models, and enhancing the system to

cope with multiple, co-occluding subjects in crowded

environments.

REFERENCES

Bak, S., Charpiat, G., Corvée, E., Brémond, F., and Thon-

nat, M. (2012). Learning to Match Appearances by

Correlations in a Covariance Metric Space. In ECCV

(3), volume 7574 of LNCS, pages 806–820. Springer.

Barbosa, I. B., Cristani, M., Bue, A. D., Bazzani, L., and

Murino, V. (2012). Re-identiﬁcation with RGB-D

Sensors. In ECCV Workshops (1), volume 7583 of

LNCS, pages 433–442. Springer.

Bradski, G. and Kaehler, A. (2008). Learning OpenCV,

chapter 7, pages 201–202. O’Reilly.

Dalal, N. and Triggs, B. (2005). Histograms of Oriented

Gradients for Human Detection. In CVPR.

Demirkus, M., Garg, K., and Guler, S. (2010). Automated

person categorization for video surveillance using soft

biometrics. pages 76670P–76670P–12.

Doretto, G., Sebastian, T., Tu, P. H., and Rittscher, J. (2011).

Appearance-based person reidentiﬁcation in camera

networks: problem overview and current approaches.

J. Ambient Intelligence and Humanized Computing,

2(2):127–151.

Jüngling, K. and Arens, M. (2010). Local Feature Based

Person Reidentiﬁcation in Infrared Image Sequences.

In AVSS, pages 448–455. IEEE Computer Society.

Møgelmose, A., Clapés, A., Bahnsen, C., Moeslund, T. B.,

and Escalera, S. (2013a). Tri-modal Person Re-

identiﬁcation with RGB, Depth and Thermal Features.

In 9th IEEE Workshop on Perception Beyond the Vis-

ible Spectrum. IEEE.

Møgelmose, A., Moeslund, T. B., and Nasrollahi, K.

(2013b). Multimodal Person Re-Identiﬁcation using

RGB-D Sensors and a Transient Identiﬁcation Data-

base. In International Workshop on Biometrics and

Forensics.

Velardo, C. and Dugelay, J. (2012). Improving Identiﬁca-

tion by Pruning: A Case Study on Face Recognition

and Body Soft Biometric. In WIAMIS, pages 1–4.

IEEE.

Zhao, R., Ouyang, W., and Wang, X. (2013). Unsuper-

vised Salience Learning for Person Re-identiﬁcation.

CVPR.

Zheng, W., Gong, S., and Xiang, T. (2011). Person re-

identiﬁcation by probabilistic relative distance com-

parison. In CVPR, pages 649–656. IEEE.

ComparisonofMulti-shotModelsforShort-termRe-identificationofPeopleusingRGB-DSensors

251