Upper Body Detection and Feature Set Evaluation for Body Pose

Classiﬁcation

Laurent Fitte-Duval, Alhayat Ali Mekonnen and Fr´ed´eric Lerasle

CNRS, LAAS, 7, Avenue du Colonel Roche, F-31400 Toulouse, France

Universit´e de Toulouse, UPS, LAAS, F-31400 Toulouse, France

Keywords:

Upper Body Detection, Body Pose Classiﬁcation, Fast Feature Pyramid, Sparse Classiﬁcation, Aggregated

Channel Features, Feature Evaluation

Abstract:

This work investigates some visual functionalities required in Human-Robot Interaction (HRI) to evaluate

the intention of a person to interact with another agent (robot or human). Analyzing the upper part of the

human body which includes the head and the shoulders, we obtain essential cues on the person’s intention.

We propose a fast and efﬁcient upper body detector and an approach to estimate the upper body pose in 2D

images. The upper body detector derived from a state-of-the-art pedestrian detector identiﬁes people using

Aggregated Channel Features (ACF) and fast feature pyramid whereas the upper body pose classiﬁer uses a

sparse representation technique to recognize their shoulder orientation. The proposed detector exhibits state-

of-the-art result on a public dataset in terms of both detection performance and frame rate. We also present

an evaluation of different feature set combinations for pose classiﬁcation using upper body images and report

promising results despite the associated challenges.

1 INTRODUCTION

In Human Robot Interaction (HRI), one of the fun-

damental requirements is the correct detection and

localization of human agents in the vicinity of the

robot. The robot should be able to perceivethe where-

abouts of human agents in order to coordinate with

them seamlessly. Depending on the exact applica-

tion, the interaction space can vary from few centime-

ters to several meters. This proximity, in turn, intro-

duces constraints about the ﬁeld of view of the cam-

eras mounted on the robot for instance. Usually, in

close interaction the majority of cameras mounted on

a robot can only see parts of the person, speciﬁcally

the part above the thigh (ﬁgure 1), and any person de-

tection mechanism adopted should take this into con-

sideration.

In HRI, person detection relies on either classical

RGB cameras, e.g., (Mekonnen et al., 2011), or RGB-

D cameras that can provide 3D data like the Kinect

sensor (Jafari et al., 2014). Due to physical, economi-

cal, and design constraints classical RGB cameras are

predominantly found on robots. Hence, in this pa-

per we will focus on perceptions based on 2D RGB

images. The most popular approach for human de-

tection in HRI is using a pedestrian detector that has

been trained on full body annotated people dataset,

e.g., (Doll´ar et al., 2012). Unfortunately, these detec-

tors fail to detect people in presence of partial occlu-

sions, speciﬁcally partial occlusions of the legs. But,

by focusing on the upper part of the body, principally

the head and the shoulders, it is possible to identify

the presence of humans in the image under these cir-

cumstances. The approach, referred as upper body

detection, is similar to a pedestrian detection but fo-

cuses on a smaller area which is less variable than the

complete human body and is less exposed to the prob-

lem of occlusion (Li et al., 2009; Zeng and Ma, 2010).

Figure 1 illustrates this point: given a typical Human-

Robot (H/R) situation depicted on the left, the best

pedestrian detector fails to correctly detect the two

persons in front of the robot (bottom right) whereas

our proposed upper body detector handles it perfectly

(top right).

After identifying the human agents, we need to

characterize their global behavior and their degree of

intentionality to interact with the robot or with an-

other human agent. Generally, the analysis of these

cues is related to the head cues (Katzenmaier et al.,

2004; Sheikhi and Odobez, 2012). The head cue in-

dicates the direction of the person’s visual point of

interest and eventually the recipient of the person’s

439

Fitte-Duval L., Mekonnen A. and Lerasle F..

Upper Body Detection and Feature Set Evaluation for Body Pose Classiﬁcation.

DOI: 10.5220/0005313104390446

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 439-446

ISBN: 978-989-758-090-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 1: Output of our proposed detector (top right) and

state-of-art pedestrian detector (Doll´ar et al., 2014), bottom

right, in an HRI context.

speech if there is a discussion (Bazzani et al., 2013).

But the estimation of orientations from both indica-

tors, the head and the body, helps to know its posture

or the direction of its movement in addition to its vi-

sual point of interest (Chen et al., 2011).

In this vein, we propose to estimate the orienta-

tion (body pose) of people using upper body cues.

We present an extensive evaluation of different fea-

ture sets to identify the best feature set that can cap-

ture required discriminative cues for upper body pose

classiﬁcation. Given different scenarios as those pic-

tured in Fig. 2, the upper body pose allows to differ-

entiate a situation where two agents interact amongst

each other without paying attention to the robot (left)

from a situation where the agent is facing the robot

forward for a possible interaction (right).

Figure 2: Output of our pose classiﬁer in several interaction

scenarios. The protruding arrows indicate pose orientation.

Related Works. Many researchers have investi-

gated 2D upper body detection for human detection,

e.g., (Li et al., 2009; Zeng and Ma, 2010; Eichner

et al., 2012). The most frequent features used in

these works are the Histogram of Oriented Gradient

(HOG) (Dalal and Triggs, 2005) features which cap-

ture gradient distribution in the image. To date, these

are the most discriminant features and the best results

are achieved by approaches that use some variants of

HOG (Doll´ar et al., 2012). Some works further im-

prove the detection performance by considering het-

erogeneous pool of features, for example, combining

Local Binary Pattern (LBP) and HOG features (Zeng

and Ma, 2010; Hu et al., 2014).

Some recent advances on pedestrian detection fo-

cus on feature representation introducingthe notion of

Integral Channel Features (ICF) (Doll´ar et al., 2012).

This representation takes advantage of the fast com-

putation using integral image and combines differ-

ent type of heterogeneous features to obtain outper-

forming detection results in comparison to state-of-

the-art detectors based on HOG features. Doll´ar et

al. (2014) propose an alternative representation of

the channel features called Aggregated Channel Fea-

tures (ACF) which slightly improves these perfor-

mances. Both ICF and ACF are quite appealing as

they have recorded outstanding performance both in

terms of detection and computation time. Combined

with a cascade classiﬁers conﬁguration (Bourdev and

Brandt, 2005) and an efﬁcient multiscale represen-

tation using approximation of rescaled features for

the detection process, this approach produces one of

the fastest state-of-the-art pedestrian detector (Doll´ar

et al., 2014).

Upper body pose estimation has also been inves-

tigated by various researchers, e.g, (Eichner et al.,

2012; Weinrich et al., 2012). These works use dif-

ferent methods to retrieve an articulated model of a

good conﬁguration on a single image or on a sequence

of images. To avoid the complexity associated with

articulated models some works have investigated us-

ing data obtained from global pedestrian detection for

pose classiﬁcation in video surveillance context (An-

driluka et al., 2010; Chen et al., 2011; Baltieri et al.,

2012). These works use the same HOG features com-

puted at several scales with different classiﬁers simi-

lar to those used in the detection process to attribute

one of the possible direction to the observed pedes-

trian, for examples, SVMs (Andriluka et al., 2010),

or random trees (Baltieri et al., 2012). The original

approach of Chen et al. (2012), which is adopted in

this work, uses sparse representation technique. This

approach has proved robust to occlusion and data cor-

ruption.

Contributions. This paper makes two core contri-

butions: (1) it presents an upper body detector based

on 2D RGB image using Aggregate Channel Features

(ACF) and soft cascade classiﬁer that leads to state-

of-the-art results in terms of both detection perfor-

mance and frame rate; and (2) it presents a detailed

comparative evaluation of various feature sets for up-

per body based pose classiﬁcation using a sparse rep-

resentation technique.

This paper is structured as follows: It begins with

an overview of the framework in section 2. Sections 3

and 4 detail the approach used for upper body detec-

tion and pose classiﬁcation respectively. All experi-

ments carried out and associated results are presented

in section 5. Finally, the paper ends with concluding

remarks in section 6.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

440

Upper Body

Detector

Body Pose

Classification

Figure 3: Adopted complete perceptual framework.

2 FRAMEWORK OVERVIEW

Figure 3 depicts a concise summary of the framework

adopted in this work that highlights the two core com-

ponents: upper body detector and body pose classiﬁ-

cation. As stated in section 1, given a sequence of

RGB images, we are interested to detect people in the

image stream and correctly classify their body poses.

The upper body detector ﬁnds people in the image by

using a sliding window approach which exhaustively

scans the image at all possible locations and scales

applying the trained model. This step is computation-

ally intensive and is the main bottleneck that inhibits

achieving high frame rates. Taking this into account,

we use the ACF feature set combined with soft cas-

cade classiﬁer, which have been proven to lead to su-

perior detection and speed, for detection. This block

passes all detected upper bodies to the upper body

pose classiﬁcation block.

The body pose classiﬁcation block determines the

orientation of the provided upper body data. For this

we have also adopted a proven technique based on

sparse representation. We evaluate a plethora of fea-

ture sets including classical and multiscale ACF and

HOG features and their combinations but only retain

the feature set that leads to the best result for actual

applications. Finally, this block provides orientation

information for the detected upper bodies.

3 UPPER BODY DETECTOR

We present the key processes used in the detection

framework: the feature representation, the classiﬁca-

tion algorithm structure and the algorithm for multi-

scale feature representation.

3.1 Aggregated Channel Features

Aggregated Channel Features (ACF) differs from In-

tegral Channel Features (ICF) by using pixel lookups

as features instead of sums over rectangular re-

gions (Doll´ar et al., 2014). A channel C is a repre-

sentation of an image I where the pixels are obtained

applying a feature generation function Ω. So several

channels C = Ω(I) can be deﬁned using different fea-

ture transformations. The channel features used in

this work are:

Gray and Color. The gray-scale is the simplest color

channel of an image as C = I. For our detection mod-

ule, we use the three LUV color channels which have

proved informative for person detection (Doll´ar et al.,

2012).

Gradient Magnitude. A non-linear transformation

which captures edge strength.

Gradient Oriented Histogram. It consists of a

histogram indexed by six gradient orientations (one

channel per orientation bin) and weighted by the gra-

dient magnitude. Normalizing the histogram using

the gradient magnitude allows to approximate the

HOG features.

Local Binary Pattern. We introduce this new chan-

nel feature in addition of the previous already used in

the work in (Doll´ar et al., 2014). The Local Binary

Pattern (LBP) is one of the best texture descriptor in

the literature. We use an efﬁcient implementation of

the gray-scale and rotation invariant LBP descriptor

inspired by the works of Ojala et al., (2002).

Once, all the channels C = Ω(I) have been com-

puted, the channels are further divided into blocks.

The pixels in each block are summed and the result-

ing lower resolution channels are smoothed to make

up the ACF. Then, the ACF obtained are used to train

a soft cascaded boosted tree (section 3.2). Figure 4

illustrates the different channels used in our work and

the steps to generate the aggregated channels.

3.2 Soft Cascade

Boosting is a classiﬁcation method that consists

of combining weighted weak classiﬁers to create a

strong classiﬁer. The soft cascade classiﬁer is a

boosted classiﬁer variant proposed by Zhang and Vi-

ola (2008) and used for pedestrian detection success-

fully in (Doll´ar et al., 2014). Unlike a classical cas-

cade which has apredeﬁned number of distinct stages,

the soft cascade has a single stage with many weak

classiﬁers. Rejection thresholds that make the “soft”

cascades are calculated for each weak classiﬁer in a

process called cascade calibration that allows reorder-

ing of the “soft” stages to improve the classiﬁcation

accuracy. Its main strength is that it considers the in-

formation of each stage in the global process contrary

to the classical attentional cascade which trained each

stage individually. It also allows optimization of de-

tection rate and execution time.

3.3 Fast Feature Pyramids

A feature pyramid is a multiscale representation of

an image where channels C

= Ω(I

) are computed at

UpperBodyDetectionandFeatureSetEvaluationforBodyPoseClassification

441

Figure 4: Aggregated Channel Features computation.

every scale s for the corresponding rescaled image I

The scales are organized per octave of 8 scales evenly

sampled in log-space between one given scale and the

next scale with half its value. Instead of generating

each scale computing C

= Ω(I

), a feature channel

scaling approximation is used :

= R(C,s) · s

Ω

(1)

where R(C,s) denotes C re-sampled by s and λ

Ω

, a

lambda coefﬁcient speciﬁc to the channel transforma-

tion Ω in order to scale the feature channel C using

a power law. An iterative method can be deduced

to efﬁciently generate a feature pyramid approximat-

ing channel features at intermediate scales s using the

computed channel features at the closest scale s

′

= R(C, s/s

′

) · (s/s

′

)

Ω

. The best compromise con-

sists to compute one scale per octave and approximate

the 7 others.

This approach, pioneered by Dollar et al. (2014),

combined with the presented ACF features and soft

cascade classiﬁer enables efﬁcient sliding window

based detection implementation. This leads to fast

and highly accurate detector as will be shown in sec-

tion 5.

4 UPPER BODY POSE

CLASSIFICATION

The objective of this modality is to estimate the ori-

entation of a person’s upper body with respect to the

camera. The adopted body pose classiﬁcation ap-

proach has initially been applied for pedestrian direc-

tion estimation (Chen et al., 2011).

4.1 Upper Body Pose Representation

Tackling the upper body pose determination as a clas-

siﬁcation problem–rather than a regression problem

due to limited training dataset–and in accordance with

the literature, we consider eight evenly spaced direc-

Figure 5: The eight upper body pose classes.

tions in steps of 45 degrees (N, NE, E, SE, S, SW, W,

NW) as the possible orientations, ﬁgure 5.

In the state-of-the-art work of (Chen et al., 2011),

a multi-level HOG feature is generated to extract in-

formation from the human bounding box. Three dif-

ferent levels are generated using different cell sizes

(respectively 8× 8, 16×16 and 32× 32 ) and the gra-

dient orientation is quantized in 9 bins. In this work,

we propose to usethe ACFfeatures representationin a

sparse representation for pose classiﬁcation. We also

take advantage of the fast feature pyramid framework

introduced in section 3.3 to generate multiscale ACF

at three different levels similar to those used on the

multi-level HOG feature representation. So we gen-

erate aggregated channels at three scales without ap-

proximating intermediate scales in the octave.

4.2 Classiﬁcation by Sparse

Representation

The upper body pose classiﬁcation method adopted is

the same as in (Chen et al., 2011) inspired by the work

in face recognition in (Wright et al., 2009). We gen-

erate F = [F

,...,F

], the training set matrix with

(1 < i < k) ∈ R

m×n

, the matrix associated to the

pose label i composed of n

feature vectors of size m.

A new feature vector y can be expressed as a linear

combination of the training features:

y = a

+ a

+ ... + a

= Fa

(2)

where F = [F

,...,F

] and a

= [a

,...,a

]

are

the concatenation of the coefﬁcient vectors associated

to each class. The coefﬁcients of a

obtained by solv-

ing the equation Fa = y are non-zero coefﬁcients if

they are related to the actual pose of the new feature

vector demontrasting the sparsity of this decomposi-

tion. An easy way to solve this problem is to use the

-minimization approach:

a* = argminkak

subject to Fa = y, (3)

which can be solved using the pseudoinverse of F.

Giving this decomposition, for each class we can cal-

culate its pose probability ρ

(y) as :

(y) =

∑

∗

/ka∗k

, (4)

Then classifying y consists of ﬁnding the maximal

pose probability deﬁned by:

class(y) = maxρ

(y) (5)

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

442

5 EXPERIMENTS AND RESULTS

In this section, the different experimental considera-

tions and carried out evaluations along with obtained

results are presented in detail.

5.1 Features Set Considerations

The ACF features are primarily used in both the de-

tection and body pose classiﬁcation modules. Utiliz-

ing the same family of features is advantageous as it

avoids further computational overhead in the global

process. To see the effects of different channels on the

detector and pose classiﬁer modules, we evaluate dif-

ferent constituent channel feature combinations listed

below:

• Magnitude gradient and six channels of histogram

of oriented gradients (GM+HOG or a),

• Color channels, magnitude gradient, and his-

togram of oriented gradients (Clr+GM+HOG or

b),

• Color channels, magnitude gradient, histogram

of oriented gradients, and local binary pattern

(Clr+GM+HOG+LBP or c),

• Magnitude gradient, histogram of oriented gradi-

ents, and local binary pattern (GM+HOG+LBP or

d),

In the body pose classiﬁer, the gray-scale channel is

used instead of the 3 LUV color channels because cur-

rent publicly available training sets for pose classiﬁ-

cation are mainly composed of gray-scale images.

5.2 Dataset and Implementation

Speciﬁcations

In both modules, the upper body windows considered

are cropped square windows from standard pedestrian

windows–the same width and a third of their height

from top (Fig. 5). Hence, a 64× 64 base window di-

mension is adopted. It has been observed than using a

square window with a third of the pedestrian window

rather than half leads to marginal loss in overall de-

tection performance on public datasets, but it is able

to detect head and shoulders in situations where the

person is close to the robot.

5.2.1 Upper Body Detector

Dataset. To train the upper body detector, we use

the INRIA Dataset (Dalal and Triggs, 2005). The

dataset contains 614 positive images containing 1237

annotated pedestrians in various situations including

crowds. The negative samples are randomly selected

from the 1218 people-free images. As in (Ferrari

et al., 2008), the training set is augmented by per-

turbing the positive samples applying slight rotations

(3 rotations in steps of 3

) and mirror reﬂections.

The positive training set is hence augmented 6 times

consisting in total around 7000 samples. Introduc-

ing these variations allows better generalization of the

classiﬁer.

To test the trained detector, we use the InriaLite

dataset, a subset of the INRIA person dataset con-

taining 145 outdoor images with 219 persons in to-

tal, most of them entirely visible and viewed approx-

imately from the front or from the back.

Detector Training. The training of the soft cas-

cade used 2048 depth-two trees in four bootstrapping

rounds where the misclassiﬁed negative samples are

reused in the training process.

5.2.2 Upper Body Pose Classiﬁer

Dataset. To train and test the body pose classiﬁer,

we use the TUD Multiview Pedestrians dataset (An-

driluka et al., 2010). The dataset contains 400 to 749

annotated pedestrians in each class. It also contains

a total of 248 validation and 248 test annotated im-

ages which, in this work, are used for test purposes

resulting in combined 496 test instances.

Pose Classiﬁcation Training. As presented in sec-

tion 4.2, the training matrix F depends of the number

of training samples per class n

and the dimensional-

ity m of the feature vector which is varied through our

feature evaluation process. The parameters affecting

m are speciﬁc to the features used. For the ACF, it

depends of the number of channels which varies be-

tween 7 and 9 in the classiﬁcation module and of the

block size which divides the channels in blocks during

the summation step. Here, we use a ﬁxed block size of

4×4 pixels. To make sure that the body pose classiﬁer

does not over-ﬁt, we test a classiﬁer trained with dif-

fering samples, n

, of equal sizes per each class varied

between 200 and 400. Finally, the classiﬁer is trained

using the whole training set (variable number of sam-

ples per class).

5.3 Evaluation Metrics

Detection Evaluation. The detection evaluation pro-

tocol used is the same as in the PASCAL Visual Ob-

ject Classes Challenge (Everingham et al., 2010). The

detection system returns a series of detected windows

after analyzing the image. These windows are ob-

tained after multiscale detection and non-maximum

suppression which avoid the accumulation of nearby

windows around a person. Given a detected bounding

UpperBodyDetectionandFeatureSetEvaluationforBodyPoseClassification

443

box (BB

) and a ground truth bounding box (BB

we consider the detection as correct if the overlap be-

tween the two windows exceed 50%:

overlap =

∩ BB

∪ BB

> 0.5 (6)

The non assigned BB

and BB

are counted respec-

tively as false positives and false negatives. The com-

parison of the detectors is realized as in (Doll´ar et al.,

2012), plotting the log-averagemiss rate (MR) against

the false positives per image (FPPI).

Pose Classiﬁcation Evaluation. The classiﬁca-

tion is evaluated using confusion matrices where the

columns corresponded to the predicted classes while

each row corresponds to the ground truth classes.

Concentrated detections along the diagonal indicate

preferred performance. We can extract the classiﬁca-

tion accuracy per class considering the exact instances

of the class normalized by all the classiﬁed instances

for this same class. Then we can average the accu-

racy for all the classes which is our ﬁrst performance

criteria, accuracy 1 (acc.1). We consider a second cri-

teria, accuracy 2 (acc.2), where the predictions to one

of the two adjacent classes are considered as correct

as in (Andriluka et al., 2010).

5.4 Results

Upper Body Detection. To compare the perfor-

mance of the proposed upper body detector variants

(using different constituent channel features) with no-

table approaches in the literature, we evaluated three

others upper body detectors including the OpenCV

implementation of the Viola and Jones detector (Vi-

ola and Jones, 2001), our implementation of the Dalal

and Triggs (2005) HOG-SVM detector, and the De-

formable Parts Model (DPM) based detector pro-

posed by the Calvin group (Eichner et al., 2012). All

detectors are trained with identical training set.

As can be seen from the results depicted in ﬁg-

ure 6, all the ACF based detectors outperform the

Figure 6: Log-average miss rate on InriaLite dataset.

HOG-SVM and Viola and Jones based upper body de-

tectors signiﬁcantly.

The best results are obtained by the original com-

bination of ACF combining the color, the magnitude

gradient, and the histogram of oriented gradient chan-

nels which also outperforms the Calvin detector–the

best upper body detector in the literature–achieving a

30.34% log-averge miss rate. This detector records a

0.18 Miss Rate at an average of 1 FPPI which can

be further ameliorated by using a ﬁltering mecha-

nism. The combination of magnitude gradient and

histogram of oriented gradients is the simplest chan-

nel combination used but also illustrates how infor-

mative the gradient orientation features are. Adding

only the LBP channel decrease the performances of

the detector whereas using all the channel features

presented improves it relatively but overall it tends to

over-ﬁt the detector due to the increased number of

feature sets. Computationally, with unoptimized Mat-

lab based code, all the ACF based detectors run at ap-

proximately 12.5 fps on 640× 480 images on an Intel

core i7 machine using a single thread. This is sufﬁ-

cient for most real time robotic application require-

ments and is much higher than the 0.63 fps achieved

by the Calvin detector.

Upper Body Pose Classiﬁcation. The upper body

pose classiﬁer is evaluated using the combined 496

annotated samples from the TUD multiview pedes-

trian dataset. We generate classiﬁcation matrices for

each type of features (11 in total) and consider the two

accuracies: acc.1 and acc.2. We evaluate the perfor-

mance of the proposed approach based on ACF along

with their multiscale variants and compare it with

the approach using multi-level HOG features (Chen

et al., 2011). Corresponding results are shown in ﬁg-

ures 7(a) and 7(b). Figure 7(a) shows the variations in

the performancesof the classiﬁers on the test set as the

data used for training is varied from 200 to 400 sam-

ples per class. The results conﬁrm that as the training

dataset is increased, there is no observed over-ﬁtting.

And indeed, the best results are obtained when using

all the data in the training set (which corresponds to

extreme data points in the plot). The confusion matrix

depicted in ﬁgure 7(b) corresponds to the best classi-

ﬁer which is the multiscale ACF: GM+HOG+LBP.

Table 1 shows the dimensionality and the the accu-

racies obtained for the upper body pose classiﬁcation

with the different combination of ACF using multi-

scale computation or not, in addition of the multi-

level HOG features. The combination of magnitude

gradient and histogram of oriented gradients results

are close to those using the multi-level HOG features

as we could expect because they use similar infor-

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

444

(a) (b)

Figure 7: (a) Classiﬁcation accuracy for upper bodies, acc1 (lower curves), acc2 (upper curves), plotted as a function of the

amount of samples per each class; the extreme data points correspond to using the whole training set. (b) Confusion matrix

for upper body pose estimation using multiscale ACF: GM+HOG+LBP.

Table 1: Upper body pose classiﬁcation.

upper body (64x64)

Approach dim. acc. 1 acc. 2

Multi-level HOG 756 0.3 0.67

Aggregated (a) 1792 0.3 0.65

Channel (b) 2048 0.27 0.65

Features (c) 2304 0.36 0.7

(d) 2048 0.37 0.69

Multiscale (a) 2352 0.28 0.63

Aggregated (b) 2688 0.28 0.66

Channel (c) 3024 0.36 0.7

Features (d) 2688 0.4 0.7

mations. But, contrary to the detection module, the

improvement of the results is given by the texture in-

formation from the LBP channel whereas the addi-

tion of the grayscale color channel decrease the re-

sults. Then the utilization of the multiscale informa-

tion keeps improving the accuracies. The best ex-

act precision is also obtained by the multiscale ACF:

GM+HOG+LBP.

The classiﬁcation matrix is quite full with a con-

centration of the scores on the diagonal. Some er-

rors come from misclassiﬁed estimations for adjacent

classes. The North-East and North-West orientations

are often confused with North orientation for exam-

ple. These errors are considered in the second ac-

curacy criteria. This criteria has the same order of

magnitude around 66 per cent of the estimates what-

ever the kind of features used. We can also quote the

fact that the best pose estimation is the North orien-

tation when the face is not visible whereas the score

are lower for the other orientations when the face is

visible. Even the symmetric confusions are less vis-

ible because of all the weak estimations along one

orientation. These estimations allows to have a ﬁrst

idea of the context in the image but would need to be

improved. Sample correct/incorrect classiﬁcation and

pose estimates are shown in ﬁgure 8.

6 CONCLUSIONS

In this work, we presented two important percep-

tual components using 2D images–upper body de-

tection and upper body pose classiﬁcation–that have

pertinent applications in machine perception of hu-

mans. The presented upper body detector based on

ACF features results in state-of-the-art detection re-

sult improving the previous best detector by a 2% av-

erage miss rate while improving computation speed

20x. We also presented upper body pose classiﬁcation

based on sparse representation using single scale and

multiscale ACF features. Generally, the pose classiﬁ-

cation results showed comparable accuracy compared

to the best approach in the literature. The results were

further improved by the addition of the LBP channel

features in the ACF framework leading to a 70% ac-

curacy (acc.2), outperforming the best approach in

the literature. Hence, the implemented perceptual

functionalities based on ACF lead to state-of-the-art

performance taking both accuracy and speed into ac-

count. In future works, the presented functionalities

will be coupled with stochastic ﬁltering approach to

further improve the results and will be ported on our

mobile robot platforms for human intention detection.

ACKNOWLEDGEMENTS

This work has been funded by ROMEO2 (http://

www.projetromeo.com/) and RIDDLE projects –

BPIFrance in the framework of the Structuring

Projects of Competitiveness Clusters (PSPC), and a

UpperBodyDetectionandFeatureSetEvaluationforBodyPoseClassification

445

(a) (b)

Figure 8: Some illustrations of upper body detections from the Inrialite dataset (a), and pose estimations from the TUD

Multiview Pedestrians Set (b).

grant from the French National Research Agency un-

der grant number ANR-12-CORD-0003 respectively.

REFERENCES

Andriluka, M., Roth, S., and Schiele, B. (2010). Monocular

3d pose estimation and tracking by detection. In IEEE

Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 623–630.

Baltieri, D., Vezzani, R., and Cucchiara, R. (2012). People

orientation recognition by mixtures of wrapped distri-

butions on random trees. In European Conference in

Computer Vision (ECCV), pages 270–283.

Bazzani, L., Cristani, M., Tosato, D., Farenzena, M.,

Paggetti, G., Menegaz, G., and Murino, V. (2013).

Social interactions by visual focus of attention in

a three-dimensional environment. Expert Systems,

30(2):115–127.

Bourdev, L. and Brandt, J. (2005). Robust object detection

via soft cascade. In IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), pages 236–

243.

Chen, C., Heili, A., and Odobez, J.-M. (2011). Combined

estimation of location and body pose in surveillance

video. In IEEE Advanced Video and Signal-Based

Surveillance (AVSS), pages 5–10.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection. In IEEE Conference

onComputer Vision and Pattern Recognition (CVPR),

volume 1, pages 886–893.

Doll´ar, P., Appel, R., Belongie, S., and Perona, P. (2014).

Fast feature pyramids for object detection. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence, 36(8):1532–1545.

Doll´ar, P., Wojek, C., Schiele, B., and Perona, P. (2012).

Pedestrian detection: An evaluation of the state of the

art. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 34(4):743–761.

Eichner, M., Marin-Jimenez, M., Zisserman, A., and Fer-

rari, V. (2012). 2D articulated human pose estimation

and retrieval in (almost) unconstrained still images.

International Journal of Computer Vision, 99(2):190–

214.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J.,

and Zisserman, A. (2010). The pascal visual object

classes (voc) challenge. International journal of com-

puter vision, 88(2):303–338.

Ferrari, V., Marin-Jimenez, M., and Zisserman, A. (2008).

Progressive search space reduction for human pose es-

timation. In IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 1–8.

Hu, R., Wang, R., Shan, S., and Chen, X. (2014). Ro-

bust head-shoulder detection using a two-stage cas-

cade framework. In International Conference on Pat-

tern Recognition (ICPR).

Jafari, O. H., Mitzel, D., and Leibek, B. (2014). Real-time

RGB-D based people detection and tracking for mo-

bile robots and head-worn cameras. In International

Conference on Robotics and Automation (ICRA’14).

Katzenmaier, M., Stiefelhagen, R., and Schultz, T. (2004).

Identifying the addressee in human-human-robot in-

teractions based on head pose and speech. In Interna-

tional Conference on Multimodal Interfaces (ICMI),

pages 144–151.

Li, M., Zhang, Z., Huang, K., and Tan, T. (2009). Rapid and

robust human detection and tracking based on omega-

shape features. In International Conference on Image

Processing (ICIP), pages 2545–2548.

Mekonnen, A. A., Lerasle, F., and Zuriarrain, I. (2011).

Multi-modal person detection and tracking from a mo-

bile robot in a crowded environment. In International

Conference on Computer Vision Theory and Applica-

tions (VISAPP’11), pages 511–520.

Sheikhi, S. and Odobez, J.-M. (2012). Recognizing the vi-

sual focus of attention for human robot interaction.

In Human Behavior Understanding, pages 99–112.

Springer.

Viola, P. and Jones, M. (2001). Rapid object detection using

a boosted cascade of simple features. In IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), volume 1, pages I–511.

Weinrich, C., Vollmer, C., and Gross, H.-M. (2012). Es-

timation of human upper body orientation for mobile

robotics using an svm decision tree on monocular im-

ages. In IEEE/RSJ International Conference on Intel-

ligent Robots and Systems (IROS), pages 2147–2152.

Wright, J., Yang, A. Y., Ganesh, A., Sastry, S. S., and Ma,

Y. (2009). Robust face recognition via sparse repre-

sentation. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 31(2):210–227.

Zeng, C. and Ma, H. (2010). Robust head-shoulder detec-

tion by pca-based multilevel HOG-LBP detector for

people counting. In International Conference on Pat-

tern Recognition (ICPR), pages 2069–2072.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

446