APPEARANCE-BASED HUMAN GALLERY CONSTRUCTION

FROM VIDEO

Kyongil Yoon

Computer Studies, College of Notre Dame of Maryland, Baltimore, MD, 21210, USA

Yaser Yacoob, David Harwood, Larry Davis

Computer Science Department, University of Maryland, College Park, MD, 20742, USA

Keywords:

People recognition, Gallery construction, Appearance Modeling.

Abstract:

An approach for constructing a dynamic gallery of people observed in a video stream is described. We con-

sider two scenarios that require determining the number and identity of participants: outdoor surveillance and

meeting rooms. In these applications face identiﬁcation is typically not feasible due to the low resolution

across the face. The proposed approach automatically computes an appearance model based on the clothing

of people and employs this model in constructing and matching the gallery of participants. The appearance

model uses color/path-length proﬁle and a robust distance measure based on Kernel Density Estimation (KDE)

and Kullback-Leibler (KL) distance, to evaluate similarity between people and add models to the gallery. A

one-to-one constraint is enforced to correctly match instances to models at each frame. In the meeting room

scenario we exploit the fact that the relative locations of subjects are likely to remain unchanged for the whole

sequence.

1 INTRODUCTION

One aspect of video surveillance of indoor meetings

involves matching a person against a gallery of known

people. Such a gallery is tedious to construct man-

ually; this paper describes an approach to automat-

ically construct a gallery of participants based on

clothing-appearance. The gallery directly supports

the human identiﬁcation task but it can also be used

to answer questions such as how many people were

observed, when each has appeared and how people

interacted in video sequences.

We propose a method for building a gallery from

a video clip based on clothing-appearance of people.

We assume that people do not change clothing, al-

though our method does tolerate localized appearance

changes. We employ well-known approaches for hu-

man detection in video and focus on the modelling

and matching of human appearance.

We consider two application areas: surveillance

and meetings video. Here, it is difﬁcult to employ

faces for identiﬁcation since the resolution across the

face is too small and faces typically appear in off-

frontal poses or proﬁle views. Instead, we model the

clothing of people and acquire quantitative models

that support matching.

2 APPEARANCE MODELS

Over a short period of time, we assumed that the ap-

pearance of the person remains unchanged, except for

small, local changes, for instance due to carried pack-

ages or illumination variation.

In (Nakajima et al., 2003), a full-body recognition

system based on color and shape features has been

suggested. They carried out recognition using sup-

port vector machine classiﬁer on several features such

as color histogram, normalized color histogram, com-

bined histogram of shape and color, and local shape

features. However, they did not combine spatial in-

formation with color as we do.

(Elgammal et al., 2002) segmented the human ﬁg-

ure into three blobs and computed a separate color

distribution for each blob. Speciﬁcally, the head,

torso, and legs were segmented by assuming that the

person appears in an upright pose. Although they sep-

arated the body into parts, most of the spatial informa-

336

Yoon K., Yacoob Y., Harwood D. and Davis L. (2007).

APPEARANCE-BASED HUMAN GALLERY CONSTRUCTION FROM VIDEO.

In Proceedings of the Second International Conference on Signal Processing and Multimedia Applications, pages 332-337

DOI: 10.5220/0002142103320337

 SciTePress

Figure 1: This is a simpliﬁed drawing of human body by

Leonardo da Vinci. The red lines show shortest-path. The

path-length is the distance from the top of the head to a

given point on the path. The path-length to the end of hand

or foot is relatively unchanged by the motion of the arms

and legs.

tion is lost.

To overcome this problem, we introduced a sim-

ple, efﬁcient feature, path-length, which represents

the spatial information of a pixel with respect to a ref-

erence point on the person’s body. The path-length

is robust to changes in human posture and limb posi-

tions.

The path-length of a pixel is deﬁned as the nor-

malized length of the shortest path from the top cen-

ter pixel (usually top of the head) inside a silhouette.

Fig. 1 illustrates the idea of path-length.

In addition to path-length, clothing color informa-

tion is employed to model appearance. The brightness

(Br) deﬁned as the sum of the three color components

in (Alexander and Buxton, 2001), and two color pro-

portions, red and green are used.

red =

RED

,green =

GREEN

3 MATCHING METRIC

The foreground region representing a person is used

to construct an appearance model that is compared

to models in the gallery. The distance between the

current appearance and existing appearance models in

the gallery determines if a new model should be added

to the gallery or not.

3.1 Distance between Models

Our distance measure is computed based on kernel

density estimation (KDE) and Kullback-Leibler (KL)

distance. Kernel density estimation is a general non-

parametric technique to estimate an underlying den-

sity using data points. In KDE, the probability for a

given feature x is estimated as

f(x) =

∑

K(x− x

where K is a kernel function centered at data points x

i = 1...n, and a

are weighting coefﬁcients. Typically,

the Gaussian is used as a kernel function, and uniform

weights are used, i.e., i = 1/n. Theoretically, suit-

able kernel density estimators converge to any density

functions if enough samples are provided (Silverman,

1986) (Duda et al., 2000).

Assume that we are to compute the distance be-

tween Model M = {x

|i = 1,...,N

}, where N

is the

number of data points in the appearance model, and

current instance I = {y

|i = 1,...,N

}, where N

is the

number of data points in the current instance. The

estimated probability distribution of model M is

(x) =

∑

i=1

(x−x

) (1)

and the distribution of the current instance, I is (2).

(x) =

∑

i=1

(x−y

). (2)

The distance between the instance and the model can

be thought of as the distance between two distribu-

tions represented by KDE,

(x) and

(x). The

two most frequently used methods for comparing two

distributions are Chi-Square test and Kolmogorov-

Smirnov test (Press et al., 1988). Neither method

is appropriate for our models. The Kolmogorov-

Smirnov test is not suitable for our four dimensional

model. The Chi-Square test involves dividing the data

points into a number of bins; it is a good approxi-

mation when the number of bins is large (≫ 1), and

number of events in each bin is large (≫ 1). How-

ever, for human appearance, the color distribution is

very skewed, leading in many empty bins.

We instead use the Kullback-Leibler (KL) dis-

tance to compare

(x) and

(x). The KL dis-

tance is deﬁned on two probability distributions in

(Kapur and Kesavan, 1992), (Kullback and Leibler,

1951), (Cover and Thomas, 1991). For any given

point, we can compute a pair of probabilities using

the two estimated densities. Assume that there is a set

of sample points, S = {s

|i = 1,...,n}, where n is the

number of sample points. Then likelihood values for

the sample points can be computed using

= ˆp(s

) =

∑

j=1

− x

= ˆq(s

) =

∑

j=1

− y

APPEARANCE-BASED HUMAN GALLERY CONSTRUCTION FROM VIDEO

337

To compute the KL distance on those values, we nor-

malize p

and q

as following

ˆp

∑

j=1

, ˆq

∑

j=1

The KL distance is deﬁned as

= d( ˆq, ˆp) =

∑

i=1

ˆq

log

ˆq

ˆp

How to select S can be critical. To maximize the dif-

ference between p

’s and q

’s, it is best to use all the

points in I; however, the computational cost can be

prohibitive. Instead, by sampling points from I, we

typically get equivalent results as long as the sam-

pling process is reasonable. We sample points uni-

formly along path-length values. Practically, when

we choose 100 points randomly spaced at 1% seg-

ments of path-length, the results are equivalent to us-

ing all the data points.

By examining the KL distance, we can measure

how different two distributions are. However, because

and q

are normalized, this can be problematical.

For instance, when

(x) and

(x) are uniform dis-

tributions over different ranges, then all the p

’s are

very low, and all the q

’s are very high. Although two

distributions are quite different, after normalization ˆp

and ˆq

form almost identical distributions and d

approximately and misleadingly 0.

To overcome this limitation, we introduce an ad-

ditional distance measure which represents a quanti-

tative difference between p

’s and q

’s as follows:

= |1− (

¯p

¯q

)| (3)

where ¯p = (

∑

)/n and ¯q = (

∑

)/n.

3.2 Robust Distance Measure

Human appearance in video streams varies over time.

In outdoor scenes, lighting, human pose variation and

carried objects may lead to changes in the foreground

region. To cope with such variations we employ a

robust estimation norm that adjusts the weighting of

points within the distance metric based on whether

points are inliers or outliers.

For the robust estimation, we employ the general

M-estimator of (Huber, 1977), which minimizes the

objective function,

∑

i=1

ρ(e

) =

∑

i=1

ρ(y

− x

b) (4)

where x

’s are independent variables, y

’s are data

points, b is a coefﬁcient vector, ρ is the inﬂuence

function, and n is the number of data points.

If we deﬁne the weight function ω(e) = ρ

′

(e)/e,

and let ω

= ω(e

). Then we need to solve the follow-

ing equation to minimize (4)

∑

i=1

− x

b)x

= 0 (5)

In our approach, we deﬁne a new feature, δ

using

and q

for each sample point, s

, :

− p

max(p

)

When the current instance is correctly matched to a

model, most p

’s are similar to q

’s leading the δ

’s to

be close to 0. On the other hand, when the instance

and model are mismatched, most δ

’s will be greater

than 0. The mean of δ

will represent how well the

current instance is matched to the model. We apply

the robust ﬁtting (5) to compute the robust mean of

the δ

’s, µ; it can be written as

∑

i=1

(δ

− µ) = 0

Notice that weights are designed to minimize the in-

ﬂuence of outliers. In other words, the weight of each

data point depends on how far the point is from the

mean. Data points near to the estimated mean get high

weight. Points that are far from the mean have smaller

weights.

We used the iteratively re-weighted least square

(IRLS) method using the bisqaure weight function to

solve the equation to get a robust mean as in (Cole-

man et al., 1980) and (Fox, 2002).

The ﬁnal weights at the last iteration after the es-

timated mean converges were investigated to ﬁnd in-

liers. Only data points with the weight greater than

a certain threshold value are regarded as inliers. The

two distances, d

′

and d

′

, are recomputed using only

inliers. Fig. 2 shows examples of outliers and inliers

as determined using robust ﬁtting method for a sam-

ple region that has been manually altered by changing

its color.

4 SPATIAL ANALYSIS

Sometimes it is possible to improve the accuracy of

the models in the gallery and the matching perfor-

mance by utilizing the relative order of participants.

We perform this as follows.

For each model, M

, we compute an adjacency

matrix, F

that captures the frequency of spatial or-

dering among models. An adjacency matrix, F

m × n, where n is the number of models and m in-

dexes relative positions. For example, if N is the

SIGMAP 2007 - International Conference on Signal Processing and Multimedia Applications

338

Figure 2: Detection of outliers. The image in the ﬁrst col-

umn is the model image. Second column images are used as

instances. To synthesize outliers, a 15% size block with red

color pixels is created. In the third column the inliers and

outliers are shown as white and black points, respectively.

maximum number of people in one frame and peo-

ple are arranged in a ”linear” conﬁguration, then m =

2∗ (N − 1).

To build the adjacency matrix F

, all the frames

which have a person matched to model M

are em-

ployed. The ( j,k)-th element of F

is the frequency of

model M

at the relative horizontal position, pos( j).

pos(.) is deﬁned as

pos( j) =



j−

− 1 if j <

j−

otherwise

(6)

The upper half of an adjacency matrix, F

, repre-

sents the frequencies of models to the ”left” of M

; the

bottom the ”right” side.

The difference between adjacency matrices repre-

sents how similar two models are to each other. To

compute the distance between adjacency matrices, the

sum of absolute differences is used. Before comput-

ing d

, each F

is normalized by the max

j,k

((F

)

j,k

so we have

∑

k=1

∑

l=1



)

k,l

− (F

)

k,l



(7)

Fig. 3 shows the adjacency matrices from the ex-

periment described in detail in section 5.2, 15 mod-

els were found after the ﬁrst pass. Distances between

adjacency matrices are computed, and pairs with dis-

tance less than a threshold can be merged into one.

5 EXPERIMENTS

We present two experiments. The ﬁrst was conducted

on four video clips collected at different locations and

under different illumination conditions. The second

experiment analyzes an 18 minute long video clip of

a meeting. In this experiment, a face detection al-

gorithm was used to determine an approximate torso

Model 1 Model 2 Model 3 Model 4

Model 5 Model 6 Model 7 Model 8

Model 9 Model 10 Model 11 Model 12

Model 13 Model 14 Model 15

Figure 3: Adjacency matrices for 15 models in the experi-

ment of section 5.2.

Figure 4: Sample frames of the full body gallery test.

area. In each experiment, we show the ﬁnal gallery

and the matching results based on the gallery.

The gallery construction process consists of two

passes.

1. Construct an initial gallery. From an empty set,

a gallery is built while processing all the frames.

After this pass, the gallery has all the tentative

models.

2. Reﬁne the gallery. In this pass, redundant models

are removed based on frequency and spatial anal-

ysis of the matching result, and a more compact

and accurate gallery is built.

5.1 Full Body Gallery - Experiment 1

For this experiment, 1212 frames were collected from

four different video clips. Three clips were outdoor

video, and one clip was captured in a room monitor-

ing people coming and going. The number of people

in the test set is 12. We employed a background sub-

traction algorithm to detect the foreground regions.

The detected regions are considered as full-body ap-

pearance of human. Fig. 4 shows some images in this

test set.

After the ﬁrst pass, we have 24 models in the

gallery. The second pass uses the static gallery of

the 24 models. In this experiment, most redundancy

APPEARANCE-BASED HUMAN GALLERY CONSTRUCTION FROM VIDEO

339

Table 1: Matching result - Full body.

Num of Correct Incorrect Match

Gallery Models Match Match Rate

Initial 24 1609 291 85.1%

Reﬁned 16 1583 307 83.7%

Figure 5: The ﬁnal gallery built with a test set of Fig. 4. The

number of models in the gallery is 16.

comes from the inaccuracies of human silhouettes

created by background subtraction. After the second

pass, we have a ﬁnal gallery of 16 models as shown in

Fig. 5. All 12 people have models. 2 people have two

models ((M

, M

) and (M

, M

)) and 1 person has 3

models respectively (M

, M

In this data set, 1890 foreground areas are detected

from the 1212 frames. Using the the ﬁnal gallery with

16 models, we could match 1583 regions correctly,

while 307 are mismatched (83.7% success). When we

use the 24 model gallery before removing redundant

models, the number of correct matches is 1609 and

291 regions are not matched correctly (85.1% suc-

cess). The representation power of the gallery is de-

pendent on data set and foreground segmentation re-

sults. When using the same segmentation results, the

ﬁnal gallery has similar representation power com-

pared to the gallery before redundant model removal

(Table 1).

5.2 Upper Body Gallery - Experiment 2

An 18 minute long video clip which has 8 people is

used for this experiment. Although the number of to-

tal frames is 32400, only one frame out of every ﬁve

Figure 6: Some frames showing matching results with the

ﬁnal gallery.

Figure 7: Sample frames from the video clip used in upper

body gallery test.

Table 2: Frequency of each model.

Model M

Freq. 910 703 370 277 1945

Model M

Freq. 426 1892 2997 9 3359

Model M

Freq. 97 221 16 18 7

frames were processed. This video clip was captured

in a meeting room, and people remain seated with-

out position changes. The cameras pan and tilt as the

meeting progresses, so that at any one time we see a

different subset of the participants. Only the upper

bodies of people are seen.

We employ a face detection algorithm to locate

people (Viola and Jones, 2001). Based on the detected

faces, the torso areas were computed and appearance

matching was conducted. Since the relative positions

between people remain unchanged for the entire clip,

we perform the spatial analysis described in section 4.

Several frames are shown in Fig. 7. The ﬁrst pass

constructed a 15 model gallery excluding false alarms

from the face detector.

In the second pass, the spatial analysis of relative

horizontal positions was carried out. The adjacency

matrices of the 15 models were shown in Fig. 3.

Before calculating the differences between adja-

cency matrices, the total frequency for each model

is used to eliminate some models. The total number

of face occurrences is 13709, and some models have

very low frequency. Table 2 shows the frequency of

each model.

As seen in Table 2, M

, M

can be

eliminated since their frequencies are very low. Next,

by thresholding the differences between adjacency

matrices, we select pairs of models, which can be

merged into one.

The ﬁnal gallery has 8 models. In the video clip,

although there are nine people appearing, the ninth

person shows only side view and she was not detected

by the face recognition algorithm. The eighth person

was not included in the gallery, and two models were

found for the ﬁrst person. Table 3 shows the gallery

we constructed. The merged models are shown in

parentheses. Fig. 9 shows some of the matching re-

SIGMAP 2007 - International Conference on Signal Processing and Multimedia Applications

340

Table 3: Final Gallery.

Person Model

)

NONE

Figure 8: 8 models in the ﬁnal gallery after the spatial anal-

ysis.

sults using the ﬁnal gallery. To investigate the iden-

tiﬁcation accuracy of matching, we randomly chose

100 frames which were found to have 210 face areas.

Table 4 summarized the result. Just like in the exper-

iment in Sec. 5.1, even with the smaller number of

models the gallery shows the similar performance.

6 CONCLUSION AND FUTURE

WORK

We proposed an approach for constructing a dynamic

gallery of people from a video clip or a set of frame

images based on appearance model using color/path-

length proﬁle. Kullback-Leibler distance is used to

robustly compare models and a one-to-one constraint

is enforced when more than one instance is present

and matched in a frame. When the order of people

rarely changes, the relative spatial order is analyzed

and used to reduce the redundant models from the

gallery.

There is trade-off between representation power

and compactness of gallery. Using multiple key-

frames to build a model can help to give more rep-

resentation power to the models. One of our future

Table 4: Matching result - Upper body.

Num of Correct Incorrect Match

Gallery Models Match Match Rate

Initial 15 198 12 94.3%

Reﬁned 8 194 16 92.4%

Figure 9: Some frames showing matching results with the

ﬁnal gallery in the second experiment.

work is to ﬁnd an effective method to accumulate the

information of multiple frames into one model.

REFERENCES

Alexander, D. C. and Buxton, B. F. (2001). Statistical mod-

eling of colour data. International Journal of Com-

puter Vision, 44(2):87–109.

Coleman, D., Holland, P., Kaden, N., Klema, V., and Peters,

S. C. (1980). A system of subroutines for iteratively

reweighted least squares computations. ACM Trans.

Math. Softw., 6(3):327–336.

Cover, T. M. and Thomas, J. A. (1991). Elements of Infor-

mation Theory. New York: Wiley.

Duda, R. O., Stork, D., and Hart, P. E. (2000). Pattern Clas-

siﬁcation. John Wiley and Sons Inc.

Elgammal, A., Duraiswami, R., Harwood, D., and Davis,

L. S. (2002). Background and foreground model-

ing using non-parametric kernel density estimation

for visual surveillance. Proceedings of the IEEE,

90(7):1151–1163.

Fox, J. (2002). Robust regression: Appendix to an r and

s-plus companion to applied regression.

Huber, P. J. (1977). Robust Statistical Procedures. Society

for Industrial and Applied Mathematics.

Kapur, J. N. and Kesavan, H. K. (1992). Entropy Optimiza-

tion Principles with Applications. Academic Press.

Kullback, S. and Leibler, R. A. (1951). On information and

sufﬁciency. Annals of Mathematical Statistics, 22:79–

86.

Nakajima, C., Pontil, M., Heisele, B., and Poggio, T.

(2003). Full-body person recognition system. Pattern

Recognition, 36(9):1997–2006.

Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flan-

nery, B. P. (1988). Numerical Recipes in C: The Art of

Scientiﬁc Computing. Cambridge University Press.

Silverman, B. W. (1986). Density estimation for statistics

and data analysis. Chapman & Hall, New York.

Viola, P. A. and Jones, M. J. (2001). Rapid object detection

using a boosted cascade of simple features. In CVPR

(1), pages 511–518.

APPEARANCE-BASED HUMAN GALLERY CONSTRUCTION FROM VIDEO

341