ON THE HUMAN POSE RECOVERY BASED ON A SINGLE VIEW

ebastien Pi

erard and Marc Van Droogenbroeck

INTELSIG Laboratory, Monteﬁore Institute, University of Li

ege, Li

ege, Belgium

Keywords:

Human, Silhouette, Pose recovery, Projection ambiguities.

Abstract:

Estimating the pose of the observed person is crucial for a large variety of applications including home en-

tertainment, man-machine interaction, video surveillance, etc. Often, only a single side view is available, but

authors claim that it is possible to derive the pose despite that humans evolve in a 3D environment. In addi-

tion, to decrease the sensitivity to color and texture, it is preferable to rely only on the silhouette to recover

the pose. Under these conditions, we show that there is an intrinsic limitation: at least two poses correspond

to the observed silhouette. We discuss this intrinsic limitation in details in this short paper. To our knowledge,

this issue has been overlooked by authors in the past. We observe that this limitation has an impact on the way

previous reported results should be interpreted, and it has clearly to be taken into account for designing new

methods.

1 INTRODUCTION

The interpretation of video scenes is a crucial task

for a large variety of applications including home en-

tertainment, man-machine interaction, video surveil-

lance, etc. As most scene of interest contain people,

understanding their behavior is essential. This is a

challenge, because of the wide range of poses and ap-

pearances human can take. In this paper, we discuss

the feasibility to recover the pose from a single view.

Moreover, we assume that the decision is based only

on the observed person’s silhouette.

In most places, ceilings are not high enough to

place a camera above the scene and to observe a wide

area. Moreover, in the context of home entertainment

applications, most of existing applications (such as

games) require to have a camera located on top or at

the bottom of the screen. Therefore we focus on a side

view in the following.

Our second assumption is that the decision is

based on the silhouette. As a matter of fact, most

authors consider that it is preferable to decrease the

sensitivity to appearance, and therefore to rely on

shapes instead of colors or textures. Silhouettes can

be extracted reliably from videos by background sub-

traction techniques (e.g. ViBe (Barnich and Van

Droogenbroeck, 2011)). Moreover, as stated by

(Agarwal and Triggs, 2006), silhouettes “encode a

great deal of useful information about 3D pose”.

It is well established that pose recovery is a dif-

ﬁcult problem since the relation between silhouettes

and poses is multivalued. As stressed by Poppe et

al. (Poppe and Poel, 2006), variations in morpholo-

gies and clothing result in a family of silhouettes cor-

responding to the same pose, and due to the self-

occlusions and to the uncertainty on the location of

the silhouette boundaries (limited visual accuracy,

noise, clothing, etc) similar input silhouettes can cor-

respond to a range of poses.

But, we have noticed that a more fundamental lim-

itation exists. Even if we knew precisely the morphol-

ogy of the person present in front of the camera, and

if we were able to acquire noise-free silhouettes, at

least two poses correspond to the observed silhouette.

It does not matter if there are self-occlusions or not.

This intrinsic limitation is the cornerstone of the dis-

cussions of this paper.

1.1 Various Approaches for

Pose-recovery

Let us ﬁrst brieﬂy remind the various approaches that

exist for estimating the pose. A few surveys covering

the pose recovery have been published. For example,

125 references related to the pose-recovery, and pub-

lished between the years 2000 and 2006, are given

in (Moeslund et al., 2006). The methods that have

been proposed in the literature to recover the pose can

be classiﬁed into three main families: example-based,

model-based, and learning-based.

310

Piérard S. and Van Droogenbroeck M. (2012).

ON THE HUMAN POSE RECOVERY BASED ON A SINGLE VIEW.

In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, pages 310-315

DOI: 10.5220/0003731203100315

 SciTePress

Example-based methods use a database of silhou-

ettes stored together with the related pose parameters.

They look in the database for the silhouette that is

the closest to the input silhouette, and the predicted

pose is the pose stored with the selected silhouette.

To some extend, it is a nearest neighbor method. The

three main challenges are: (1) how to handle a very

large database (the high dimensionality space of poses

has to be sampled sufﬁciently densely); (2) how to

deﬁne a distance between two silhouettes that is ro-

bust to noise, to viewpoint variations, and to vari-

ous appearances of humans (morphologies, clothing,

hairstyles, ...); and (3) how to pre-compute a signature

for each silhouette in the database in order to speed

up the nearest neighbor search or to quickly discard a

large part of the database as in (Shakhnarovich et al.,

2003).

Model-based methods (also named generative

methods) maximize the similarity between the ob-

served silhouette and a synthetic rendered silhouette

corresponding to the guessed conﬁguration (pose and

orientation). The two main issues here are (1) that the

problem always has many local maxima of the likeli-

hood, and (2) that it supposes that a good initialization

conﬁguration can be guessed.

Learning-based approaches (also named discrim-

inative methods) try to learn, from samples, the func-

tion that associates the value of a pose parameter (i.e.

kinematic angle) to the input silhouette. This func-

tion is known as being the “model”. Usually, there

is one model per pose parameter. The goal of ma-

chine learning is twofold. Firstly, it aims at building

models that correctly generalize the learning samples.

Secondly, it aims at pre-computing a model instead of

using the whole set of learning samples at runtime, for

efﬁciency reasons. This can also be seen as a smooth

interpolation problem in a high dimensionality space,

robust to irrelevant dimensions. The main problem

with the learning-based approach is that the typical

behavior of regression methods (such as the ExtRa-

Trees (Geurts et al., 2006), etc) is to achieve a com-

promise between all the possible solutions (this fact

has been underlined by (Agarwal and Triggs, 2006),

and we have also observed this fact on several occa-

sions by ourselves). Thus, the function that associates

the value of a pose parameter to the input silhouette

must be single-valued to avoid irrelevant estimates.

Unfortunately, this is in general not the case.

In general, methods that assume a one-to-one

mapping between silhouettes and poses are forced to

arbitrarily or randomly choose one possible pose or

to compromise. In this paper, we show that at least

two poses correspond to the observed silhouette, and

that these two poses are equally likely. Therefore, es-

timating the underlying probabilities does not help to

choose the right pose. Also, note that the methods

able to deal with multiple estimates do not solve com-

pletely the problem, as they sometimes still compro-

mise between multiple poses (this has been observed

in (Agarwal and Triggs, 2005)), even when the corre-

sponding 3D shapes are very different.

1.2 Known Ambiguities

The fact that several poses may correspond to a sil-

houette motivated the recent development of meth-

ods predicting a set of 3D poses, but the phenomenon

responsible of the ambiguities was only partially un-

derstood. Before explaining the intrinsic limitation in

Section 2, we give an overview of the current knowl-

edge.

Firstly, the depth-direction of the limbs (for-

ward/backward) is unobservable with silhouettes.

Therefore, using the algorithm of Taylor (Taylor,

2000), there are 2

3D skeletons corresponding to a

2D skeleton (stick-ﬁgure) with n links. It is however

easy to prune this large amount of possibilities since

most of the solutions found by Taylor’s method are

physically impossible (e.g. self-intersections, knee or

elbow bent the wrong way, etc).

Secondly, with silhouettes, an information related

to the scene layers is missing. For example, with a

frontal view, it may be impossible to know if the arm

is behind or in front of the torso. This is due to occlu-

sions, and therefore this problem does not exist sys-

tematically.

A more annoying problem is the following. The

silhouettes of a person observed laterally are identi-

cal for mirror poses. Also, front and back views can

lead to similar silhouettes. This means that the am-

biguities may be related to the pose or to the orien-

tation. With these observations, one might think that

a temporal disambiguation is possible since ambigu-

ities appear in two very particular orientations of the

observed person. We show in this paper that this is

not possible because there are much more ambigui-

ties than these two ones. In fact, we derive a more

general rule that establishes the link between the two

sources of ambiguities (pose and orientation).

2 THE INTRINSIC LIMITATION

In this paper, we assume that the rotation axis of the

observed person is parallel to the image plane. This

means that the camera is looking horizontally (i.e. we

see a side view). In the following, we adopt the fol-

lowing convention (without any loss of generality):

ON THE HUMAN POSE RECOVERY BASED ON A SINGLE VIEW

311

pose p

Figure 1: The poses p

and p

are mirror poses.

, θ) (p

, 180° − θ)

Figure 2: Two conﬁgurations leading approximately to the

same silhouettes. In this ﬁgure, θ = 30°.

the orientation θ = 0° corresponds to the person fac-

ing the camera (so, a translation of the person in space

implies a change of orientation).

Under this assumption there is an intrinsic limita-

tion of estimating the pose from a single silhouette.

In most of the cases, two different poses correspond

to the same silhouette. This remains true even when

the precise morphology of the person present in front

of the camera is known, when the silhouette is noise-

free, and when there are no occlusions.

Let us consider two mirror poses p

and p

as the

ones depicted in Figure 1. Obviously, they have the

same probability density to be observed in any appli-

cation. Note that p

and p

are neither symmetrical

nor planar, and therefore, our observations are valid

for all poses. As shown in Figure 2, the conﬁgurations

, θ) and (p

, 180° − θ) give rise to approximately

the same silhouettes.

A closer look at silhouettes of Figure 2 shows that

there are small differences between the two silhou-

ettes. These are due to perspective effects and become

negligible for large distances between the camera and

the observed person. When the distance becomes in-

ﬁnite, the pinhole camera becomes an orthographic

camera. For this limit case, the two silhouettes are,

strictly speaking, identical. In practice however and

despite the presence of differences, it is not possi-

ble to robustly differentiate the two silhouettes even

when the camera is very close to the observed person.

This is illustrated in Figure 3. Perspective effects lead

to small details located on the boundaries of the sil-

houettes. But in a practical application, noise alters

the boundaries of the observed silhouettes. There-

fore, the information relative to the pose that is po-

tentially present because of perspective effects is un-

3 m 10 m 40 m

Figure 3: Magnitude of the perspective effects. For differ-

ent distances between the camera and the observed person,

this ﬁgure shows the silhouette corresponding to the conﬁg-

uration (p

, 30°) on the ﬁrst row, and the silhouette corre-

sponding to the conﬁguration (p

, 150°) on the second row.

The difference between the two silhouettes is depicted on

the third row.

usable when there is noise.

In summary, at least two poses must always be

taken into account for each silhouette observed. This

intrinsic limitation is not due to noise, nor to occlu-

sions (e.g. there are no signiﬁcant occlusions in Fig-

ure 2). It should also be stressed that the two poses

are equally likely. The only rare cases in which only

one pose has to be considered arise when the pose of

the observed person is itself symmetrical.

3 DISCUSSION

3.1 Practical Implications

The ﬁrst practical implication of the intrinsic limi-

tation for pose recovery systems is that once a pose

has been estimated, both the pose itself and its mir-

rored version have to be considered. This is true

for example-based, model-based, and learning-based

methods.

Another implication of the intrinsic limitation

concerns the learning set used in learning-based ap-

proaches. As we have already noticed in Section 1.1,

there should be only one pose corresponding to a sil-

houette. Otherwise, the regression methods would

produce irrelevant estimates of the pose parameters.

It is therefore necessary to discard half of the conﬁg-

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

312

urations in the learning set (Section 3.3 provides some

indications on how to do this correctly).

3.2 Overcoming the Intrinsic Limitation

It is clear that we must rely on as few as possible in-

formation related to the appearance (colors and tex-

tures may be at best considered as weak cues because

of their high variability and their sensitivity to light-

ing conditions), and that it is advised to base the deci-

sions on geometric information. However, the silhou-

ettes are not the only geometric information that can

be obtained from a single camera.

With a color camera, one can also extract con-

tours, but in a less reliable way than silhouettes. Un-

fortunately, the internal contours do not always pro-

vide enough information to overcome the intrinsic

limitation. For example, if we consider the situation

depicted in Figure 2, the indetermination persists be-

cause no contour can be detected inside the silhou-

ettes. This is because there is no signiﬁcant occlusion

in that case.

With a range camera, it is possible to acquire

silhouettes annotated with depth. Nowadays there

exist cheap range cameras (for example Microsoft’s

kinect). Therefore, range cameras are a viable alter-

native to color cameras. It is not surprising that it

is possible to estimate human poses from silhouettes

annotated with depth (Shotton et al., 2011; Girshick

et al., 2011) since the intrinsic limitation presented in

this paper does not apply with that kind of data.

3.3 Does it Help to Know the

Orientation?

If we take again a quick look at Figure 2, we see that

among the two conﬁgurations leading to the same sil-

houettes, the orientation of one pose is comprised in

the [−90°, 90°] angle range, and the other one in the

[90°, 270°] range. Naively, one may think that the

knowledge of the orientation helps to choose the right

conﬁguration, and thus the right pose. But this is only

true if the orientation is not close to −90° or 90°, be-

cause in those cases there exists still two possibilities

and the knowledge of the orientation does not help to

choose the right one (see Figure 4). In conclusion,

knowing the orientation may help to overcome the in-

trinsic limitation in most of the cases, but not always.

The orientation and the pose are two independent

notions. We can evaluate them simultaneously, or in-

dependently. An example of pose recovery method

estimating the orientation in a ﬁrst step has been pro-

posed by Gond et al. (Gond et al., 2008). Another

interesting procedure is the one adopted by Peng et

, 90°) (p

, 90°)

Figure 4: Even when we know the orientation of the per-

son, the intrinsic limitation still implies that there may be

two different poses giving rive to the same silhouette. This

happens when the orientation is close to 90° or −90°.

al. (Peng and Qian, 2008). They start by choos-

ing an initial pose estimate and an initial orientation

estimate. Then, alternatively, they improve the esti-

mated pose considering that the orientation is known,

and they improve the estimated orientation consider-

ing that the pose is known, until convergence. Both

the methods proposed by Gond et al. and Peng et

al. tend to demonstrate that, as expected, knowing the

orientation helps to recover the pose.

A simple, but fast and effective, method to esti-

mate the orientation has been proposed in (Pi

erard

et al., 2011). In that paper, we explain that a silhouette

annotated with depth is necessary to estimate the ori-

entation. Because there is the same kind of intrinsic

limitation for the orientation estimation than for the

pose recovery, it is impossible to estimate the orienta-

tion of the person in front of the camera by deriving it

from a binary silhouette, but possible with a silhouette

annotated with depth.

3.4 A Few Notes About the Previous

Methods Described in the Literature

In this subsection, we would like to discuss results on

pose recovery that were reported in the literature.

Poppe et al. (Poppe and Poel, 2006) compared

three shape descriptors to recover the pose from a

unique silhouette in an example-based approach, with

a nearly horizontal camera (the elevation was chosen

to be 10° or 20°). From their results, it seems possi-

ble to retrieve the pose from a unique silhouette. But,

they only used silhouettes corresponding to orienta-

tions in the range [−80°, 80°]. The success of their

method is thus coherent with the intrinsic limitation

described in this paper. Since we know that the orien-

tation is between −90° and 90°, these bounds being

excluded, it is possible to select the good pose out of

the two possibilities.

Another pose recognition system based on a single

silhouette has been proposed by Agarwal et al. (Agar-

wal and Triggs, 2006). They announced mean angu-

lar errors of about 5° on synthetic (noise-free) data.

ON THE HUMAN POSE RECOVERY BASED ON A SINGLE VIEW

313

This seems to be good results, but the problem is that

they estimate the pose based on a sole silhouette, for

orientations in a 360° range, and we have proved in

Section 2 that this is impossible. Because their data

(poses and orientation) are taken from real human mo-

tion capture sequences, three hypotheses about their

learning set and their test set could explain their un-

usually optimistic results: (1) that the orientation is

not uniformly distributed over 360°, (2) that the ori-

entation is statistically linked to the orientation, and

most likely (3) that their method takes into account

the small details due to perspective effects

. Never-

theless, we suppose that their method should work as

expected if the allowed orientations are restricted to

the ]−90°, 90°[ range.

In the description of their model-based method,

Sminchisescu and Telea (Sminchisescu and Telea,

2002) explain that extracting pose from silhouettes

is an under-constrained problem, and suggest to use

temporal disambiguation. Similarly, Howe (Howe,

2004) predicted a set of 3D poses for each silhou-

ette, and tried to select the right one using temporal

coherence. This led to tracking failures. Following

the explanation given in this paper, it is now clear that

at least two motions may correspond to a sequence of

silhouettes. Therefore, the temporal information does

not sufﬁce to resolve ambiguities.

In their work, Elgammal et al. (Elgammal and

Lee, 2009) assume that the motion is known (e.g.

walking, running, performing a golf swing or kick-

ing). They estimate the 3D pose and the viewpoint

from a single silhouette (or edges). This is done by

learning both the visual manifold (e.g. the manifold

related to the viewpoint) and the kinematic manifold

(e.g. the manifold related to the pose), and using a

particle ﬁlter for tracking. The joint manifold is topo-

logically equivalent to a torus and each point on this

torus corresponds to a pose and a viewpoint. This

methodology allows to recover the multiple poses and

viewpoints corresponding to the input silhouette since

the particles may converge to multiple areas on the

torus. The illustration in their paper clearly shows that

the particles may converge towards two areas on the

torus. Note that given a motion, every pose does not

necessarily admit a symmetrical one, especially when

the learning set is only populated for right-handed

persons (walking is an exception).

In summary, it is very important to consider the

intrinsic limitation when building learning and testing

sets. Also, it is useful to consider this limitation to

better understand the results reported in the literature

We showed in (Pi

erard et al., 2011) how it is easy, even

unintentionally, to learn the small details due to perspective

effects when working with synthetic silhouettes.

until now and to correctly understand the limitations

of the corresponding methods.

3.5 On the Evaluation Methods

In recent years, a few methods predicting a set of 3D

poses (“modes”) from the silhouette have emerged

(Rosales and Sclaroff, 2001; Howe, 2004; Sminchis-

escu et al., 2005; Agarwal and Triggs, 2005; Elgam-

mal and Lee, 2009). In theory, these methods are

better suited for the monocular human pose recov-

ery because they are able to properly handle the am-

biguities. However, when evaluating such methods,

authors fall back on a deterministic method (Smin-

chisescu et al., 2005; Rosales and Sclaroff, 2001), for

example by only measuring the error with respect to

the most probable mode. Doing this, they loose the

potential of their methods (the correct solution is not

always the one predicted as the most probable). To

avoid pessimistic results, the authors sometimes limit

the ambiguity by using a learning set speciﬁc to the

test sequence (Sminchisescu et al., 2005), or by using

test sequences of symmetrical poses (e.g. the jumping

in the air while rotating sequence in (Elgammal and

Lee, 2009)). Unfortunately, this gives rise to highly

optimistic results. A better solution would be, for ex-

ample, to consider the most probable mode and its

symmetrical pose, and to report the minimum error

obtained with these two poses.

4 CONCLUSIONS

The pose recovery from a single silhouette ac-

quired by a camera placed horizontally is an under-

determined problem: at least two poses may corre-

spond to an input silhouette. Surprisingly, it seems

that this intrinsic limitation has never been discussed

completely by previous authors (to our knowledge).

From our point of view, it is necessary either to as-

sume that the orientation is comprised between −90°

and 90° or equivalently between 90° and 270° (the

bounds being excluded, this is important), or to use a

range sensor in order to annotate the silhouettes with

depth, or to use a method able to predict a set of 3D

poses instead of a unique pose.

Note that the human pose recovery has been often

compared to the hand pose recovery in the literature.

The intrinsic limitation presented in this paper origi-

nates from the symmetry of the human body. A hand

does not have such a symmetry, and therefore the hu-

man pose recovery is much complex than the hand

pose recovery.

ICPRAM 2012 - International Conference on Pattern Recognition Applications and Methods

314

The intrinsic limitation presented in this paper

changes the way we should interpret some of the pre-

vious reported results, and has to be taken into ac-

count for designing new methods. It has an impact at

many levels: on the choice of the sensors best suited

for the pose recovery, on the choice of the approach

to follow (model-based, example-based, or learning-

based), as well as on the choice of the attributes to use

for learning-based methods.

REFERENCES

Agarwal, A. and Triggs, B. (2005). Monocular human mo-

tion capture with a mixture of regressors. In IEEE

Computer Society Conference on Computer Vision

and Pattern Recognition Workshops, volume 3, San

Diego, USA.

Agarwal, A. and Triggs, B. (2006). Recovering 3D human

pose from monocular images. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 28(1):44–

58.

Barnich, O. and Van Droogenbroeck, M. (2011). ViBe: A

universal background subtraction algorithm for video

sequences. IEEE Transactions on Image Processing,

20(6):1709–1724.

Elgammal, A. and Lee, C. (2009). Tracking people on a

torus. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 31(3):520–538.

Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely

randomized trees. Machine Learning, 63(1):3–42.

Girshick, R., Shotton, J., Kohli, P., Criminisi, A., and

Fitzgibbon, A. (2011). Efﬁcient regression of general-

activity human poses from depth images. In In-

ternational Conference on Computer Vision (ICCV),

Barcelona, Spain.

Gond, L., Sayd, P., Chateau, T., and Dhome, M. (2008).

A 3D shape descriptor for human pose recovery. In

Perales, F. and Fisher, R., editors, Articulated Mo-

tion and Deformable Objects, volume 5098 of Lecture

Notes in Computer Science, pages 370–379. Springer.

Howe, N. (2004). Silhouette lookup for automatic pose

tracking. In IEEE Computer Society Conference on

Computer Vision and Pattern Recognition Workshops,

volume 1, pages 15–22, Washington, USA.

Moeslund, T., Hilton, A., and Kr

uger, V. (2006). A sur-

vey of advances in vision-based human motion cap-

ture and analysis. Computer Vision and Image Under-

standing, 104:90–126.

Peng, B. and Qian, G. (2008). Binocular dance pose recog-

nition and body orientation estimation via multilinear

analysis. In IEEE Computer Society Conference on

Computer Vision and Pattern Recognition Workshops,

Anchorage, USA.

erard, S., Leroy, D., Hansen, J.-F., and Van Droogen-

broeck, M. (2011). Estimation of human orientation

in images captured with a range camera. In Advances

Concepts for Intelligent Vision Systems (ACIVS), vol-

ume 6915 of Lecture Notes in Computer Science,

pages 519–530. Springer.

Poppe, R. and Poel, M. (2006). Comparison of sil-

houette shape descriptors for example-based human

pose recovery. In International Conference on Auto-

matic Face and Gesture Recognition, pages 541–546,

Southampton, UK.

Rosales, R. and Sclaroff, S. (2001). Learning body pose via

specialized maps. In Proceedings of Neural Informa-

tion Processing Systems, pages 1263–1270, Vancou-

ver, Canada.

Shakhnarovich, G., Viola, P., and Darrell, T. (2003). Fast

pose estimation with parameter-sensitive hashing. In

International Conference on Computer Vision (ICCV),

volume 2, pages 750–757, Nice, France.

Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio,

M., Moore, R., Kipman, A., and Blake, A. (2011).

Real-time human pose recognition in parts from sin-

gle depth images. In IEEE International Conference

on Computer Vision and Pattern Recognition (CVPR),

Colorado Springs.

Sminchisescu, C., Kanaujia, A., Li, Z., and Metaxas, D.

(2005). Discriminative density propagation for 3d hu-

man motion estimation. In IEEE International Con-

ference on Computer Vision and Pattern Recognition

(CVPR), volume 1, pages 390–397, San Diego, USA.

Sminchisescu, C. and Telea, A. (2002). Human pose es-

timation from silhouettes - a consistent approach us-

ing distance level sets. In International Conference

for Computer Graphics, Visualization and Computer

Vision, volume 10, pages 413–420, Plze

n, Czech Re-

public.

Taylor, C. (2000). Reconstruction of articulated objects

from point correspondences in a single uncalibrated

image. Computer Vision and Image Understanding,

80(3):349–363.

ON THE HUMAN POSE RECOVERY BASED ON A SINGLE VIEW

315