3D Point Cloud Descriptor for Posture Recognition

Margarita Khokhlova, Cyrille Migniot and Albert Dipanda

Le2i, FRE CNRS 2005, Univ. Bourgogne Franche-Comt´e, France

Keywords:

3D Descriptor, Point Cloud Structure, 3D Posture.

Abstract:

This paper introduces a simple yet powerful algorithm for global human posture description based on 3D Point

Cloud data. The proposed algorithm preserves spatial contextual information about a 3D object in a video

sequence and can be used as an intermediate step in human-motion related Computer Vision applications

such as action recognition, gait analysis, human-computer interaction. The proposed descriptor captures a

point cloud structure by means of a modiﬁed 3D regular grid and a corresponding cell s space occupancy

information. The performance of our method was evaluated on the task of posture recognition and automatic

action segmentation.

1 INTRODUCTION

3D pose estimation is a common task in Computer Vi-

sion applications. In the case of a rigid object, pose

estimation seeks to capture the appearanc e of an ob-

ject under certain viewing conditions. This task is

challengin g for natural images due to the ambiguity

of an object repr e sentation in 2D, poor texture and va-

rying view-points. With the introduction of con sumer

3D sensors, this problem has been revisited by rese-

archers developing a broad range of new descriptors.

They may be both handcra fted (Hinterstoisser et al.,

2012) or automa tic (Wohlhart and Lepetit, 2015), and

capture information from both global and local scales.

Non-rig id object pose estimation is inherently

more complicated. A human body is an articulated

object, and its motion can be build up from rigid and

non-rigid motion parts. Articulated pose estimation

seeks to e stimate the conﬁguration of a human body

in a given image or video sequence. Recognition of

body postures is an important step towards the f ully

automatic classiﬁcation of human motion.

A canonical work on human posture estimation

using RGBD camera data is by Shotton et al. (Shot-

ton et al., 2013), which proposes a r eal-time algorithm

which segments a human body fro m a corresponding

depth map and locates skeleton joints. This algorithm

shows go od results and its variations are widely used

today. However, it has certain limitations: in presence

of severe occlusions and noise, the positions of the

joints cannot be estimated correctly; it gives ap proxi-

mate joint positions and therefore coarse pose estima-

tion and is not able to capture very subtle variations

between posture s. For this reason, joint-based posture

estimation methods, although simple and powerful,

will fail if the initial joints were estimated wr ongly,

which gives the way to low-level attributes based met-

hods.

This paper proposes a simple yet effective descrip-

tor for pose recognition based directly on point cloud

data. The algorithm takes a holistic po se estima-

tion approach, capturing the slightest p osture chan-

ges using accumulated point cloud features. Our des-

criptor is based on the space occupancy for cells of

a modiﬁed 3D regu la r grid, super-imposed on a point

cloud. It is tr anslation, scale, and rotation invariant.

Originally, we aim at a descriptor which can be

used for a gait analy sis. The prop osed design sh ould

be able to reliably detect different postures in human

gait, where the prec ision of skeleton data is not suf-

ﬁcient (the K inect reliability is evaluated by (Cippi-

telli et al., 2015) for the side and front (Mentiplay

et al., 2015) views). The second problem addressed

is the symmetry of the gait which should be evalua-

ted based on the point cloud data. However, resulting

descriptor is very general and can be used as an in-

termediate step in a great number of computer vision

applications such as action recognition, gait analysis,

smart homes, assessing the quality of sports actions,

human-computer interaction and others, where pos-

ture estimation is an essential intermediate step. This

work presents the descriptor in the context of action

recogn ition, and postures are estimated from frames

of video sequences fr om MSR Action3D database.

Khokhlova, M., Migniot, C. and Dipanda, A.

3D Point Cloud Descriptor for Posture Recognition.

DOI: 10.5220/0006541801610168

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 5: VISAPP, pages

161-168

ISBN: 978-989-758-290-5

161

The paper is organized a s follows. Section II over-

views existing meth ods for human posture recogni-

tion. Section III in troduces the descriptor and its pa-

rameters. Section IV describes the data used in ex-

periments proposed in section V. Section VI summa-

rizes the results, proposes possible application s and

outlines th e future work.

2 RELATED WORK

Most methods for human pose estimation are ba-

sed on variations of a so called pictorial struc tu-

res model, which represents human body conﬁgura-

tion as a collection of connec ted rigid parts (Chen

and Yuille, 2014)(Jhuang et al., 2013)(Aga rwal and

Triggs, 2006)( Pishc hulin et al., 2013). To model an

articulation, parts of the structur e are parameterized

by their spatial location and orientation.

Holistic approa c hes (Agarwal an d Triggs,

2006)(Pishchulin et al., 2013)(Vieira et al., 2012)

and middle-part (Yang and Ramanan , 2011) based

methods form the other research direction in posture

recogn ition. Holistic appro a ches aim to directly

predict positions of body parts from image featu-

res without relying on an intermediate part-b ased

representation. Part-based approaches ﬁrst detect

intermediate parts independently or with some

constraints on body joints spatial relations.

Recently researchers signiﬁcantly advanced pos-

ture recognition from natural images with the increa-

sing popularity of machine learning based approaches

(Tompson et al., 20 14)(Ch´eron et al., 2015)(Chen and

Ramanan, 2017). Cheron et al (Ch´eron et al., 2015)

proposed a new Pose-based Convolutional Neural

Network descriptor (P-CNN) for 2D action recogni-

tion. A pre-trained CNN learns the features co rre-

sponding to 5 pre-selected body parts based on quan-

tized motion ﬂow data for each frame. Chen and Ra-

manan (Chen a nd Ramanan, 2017) extend an estima-

ted 2D mo del, using a neural network, to 3D using a

simple Nea rest Ne ighbor pose matchin g algorithm. A

good review on recent advances in 3D articulated pose

estimation is proposed by Saraﬁanos et al. ( Saraﬁanos

et al., 2016). Posture recognition is a part of action re-

cognition, since actions can be modeled as a postures

evaluation in time. Recent works on action recogni-

tion are based on CNNs (Han et al., 2017)(Lan et al.,

2017) a nd learn the the features atom ic a lly, which le-

ads to state-of-the-art results on available datasets.

Despite the signiﬁcant p rogress made, full- body

pose estimation from natur al images remains a difﬁ-

cult and a largely unsolved problem du e to numerous

difﬁculties in real-life applica tions: the many degrees

of freedom of the human body m odel, the variance in

appearance, the changes in viewpoints, and lastly, an

absence of data about an objects’ shape. 3D data give

a new important information which allows for impro-

ving posture recognition results. Depth-based pose

estimation can be categorized into two c la sses.

Generative approaches (Ye and Yang, 2014)(Ga-

napathi et al., 2012) use a geometric or probabilistic

human body model and estimate a pose by minimi-

zing the distanc e between the human model and the

input depth ma p. Human pose estimation is perfor-

med by optimizing the objective function for geome-

tric mo del ﬁtting by the means of variants of itera-

tive closest point (Ganapathi et al., 2012) and graphi-

cal models (Li et al., 20 14) or picto rial structures

(Charles and Everingham, 2011). A recent m ethod by

Wang et al. (Wang et al., 2016) uses several hand-

crafted descriptors to recognize 5 distinct po stures

from the data obtained by a Kinect camera. Their al-

gorithm is based on a simple 3D-2D projection met-

hod and the star skeleton technique. The ﬁnal p osture

descriptor is composed of skeleton feature p oints to -

gether with a center of gravity. A pre-trained Lear ned

Vector Quantization (LVQ) neural network is used for

classiﬁcation.

Discriminative approa c hes (Shotton et al.,

2013)(Yub Jung et al., 2015) perform classiﬁcation

on a pixel level and attempt to detect instances of

body par ts. Shotton et al. (Shotton et al., 2013)

trained a random forest classiﬁer fo r body part

segmentation fr om a single depth image and used

Mean Shift (Comaniciu and Meer, 2002) to estimate

joint locations. Chang et al. (Chang and Nam, 2013)

propose a fast random-forest-based human pose esti-

mation method, where classiﬁer is applied directly to

pixels of the segmented human depth image. Jung et

al. (Yub Jung et al., 2015) used randomized regres-

sion trees and made their algorithm even faster by

estimating th e relative direction to e ach joint to avoid

computationally demanding ag gregating pixel-wise

tree evaluations. The obtained skeleton data can later

be used as the base for action recognition in videos

as in recent Lo g-COV-Net method (Cavazza et al.,

2017).

Most of the work on 3D pose estimation uses

a single depth camer a. The most successful ex-

amples of single view pose estimation are ( Shotton

et al., 2013)(Ye and Yang, 2014)(Yub Jung et al.,

2015)(Chang and Nam, 2013) and most of them use

randomized trees and shape context features for pixel-

wise classiﬁcation which leads to real-time solutions.

Lately, multi-view depth image based posture re-

cognition approach es acquired the attention of re-

searchers (Shafaei and Little, 2016)( Peng and Luo,

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

162

2016). The recent framework proposed by (Shafaei

and Little, 2016) uses several Kinect sensors an d a

deep CNN architecture. Multi-v iew scenarios allow

to reconstruct 3D point clo uds in the reference space.

The authors use curriculum learnin g (Bengio et al.,

2009) to train the system on purely synthetic data.

Curriculum learning modiﬁes the order of th e trainin g

proced ure, gradually increasing the complexity of the

instances, which hypothetically im proves the conver-

gence speed and the quality of the ﬁnal local minima.

It is clear that the currently prevailing strategy is

to use Machine Learning methods, speciﬁcally rand-

omized trees (Shotton et al., 2013)(Yub Jung et al.,

2015)(Tang et al., 2014), and a huge amount of

training d a ta . Modern posture recognition methods

(Shotton et al., 2013)(Sh afaei and Little, 2016) have

shown to be both effective and efﬁcient in real-time

posture estimation. Similar, f or the following action

recogn ition fro m videos, hand-cr a fted method s were

overshadowed by d e ep learning based methods (Lan

et al., 2017).

This work introduces a new descriptor that estima-

tes 3D human pose from a sin gle point cloud. We are

not attemptin g to out perform machine-le arning ba-

sed algorithms (Shotton et al., 2013)(Yub Jung et al.,

2015), but mostly propose a simple alternative, which

does not require a priori human body model. In con-

trast to (Wang et al., 2016), we do not use a des-

criptor for a g iven posture but aim to use a general

3D point cloud structure. Unlike oth er popular des-

criptors (Shotton et al., 2013)(Yub Jung et al., 2015)

which use depth image features, our descriptor is ba-

sed on a 3D structure and therefore can be used in a

multi-camera scenario.

3 DESCRIPTOR

We propose a handcrafted compact and discriminative

descriptor for a single point cloud. The most similar

descriptor to ours is the Space-Time Occupancy Pat-

terns method proposed by Vieira et al. (Vieira et al.,

2012) for the task of action recognition. Similar to

this work, we propose to divide the 3D space by a

regular grid and base our descriptor on spatial occu-

pancy inf ormation. However, in (Vieira et al., 2012)

researchers compute the ﬁnal descriptor vector by re-

assigning weig hts based on cells where motion occur-

red. We are concentrated on a description of each sta-

tic frame in ord er to recognize the posture in it. Other

differences include the method of 3D space p artitio-

ning and descriptor cell initialization. Our partitio-

ning is inspired by the 3D partitioning for human re-

cognition fro m 3D po int clouds proposed in (Essma-

eel et al., 2016). Vieira et al. speciﬁcally design their

method for video sequences, taking the time dimen-

sion into account. We assume th at every initial frame

posture is more important and temporal information

can be encoded later in the process depending on the

speciﬁc application. For the g ait analysis and action

recogn ition, a Hidden Markov Model can be coupled

with a descrip tor to capture the temporal informatio n.

To construct our descriptor for each depth map vi-

deo frame, we perform the f ollowing steps. First, the

2D-3D transformation is done to obtain a point cloud

in 3D space from a depth map. We use a standard

equation f or basic geometric transformations:

X = Z ∗

( j − c

)

; Y = Z ∗

(i − c

)

; Z = z

(1)

where X, Y, Z are the point coordinates in 3D, j and i

are the pixel coordinates, and c

, c

, f

and f

are the

intrinsic matrix pa rameters obtained by a calibration

of the Kinect camera. Then the 3D spatial partitioning

is performed. The cen te r of gravity in 3D is calculated

and projected to the ground plane :

C(X , Y, Z) =

∑

(X, Y, Z)

(2)

where n is the total number of points in the point

cloud. A 3D cylinder of varying dimensions with a

base center in the computed centroid projection deﬁ-

nes the space partitioning limits. The height and ra-

dius of the cylinder are varying to adjust for the height

Figure 1: 3D spatial partitioning in 12 sections. Projected

center of gravity is shown in red, ﬁxed point view direction

is shown by a green arrow.

3D Point Cloud Descriptor for Posture Recognition

163

Figure 2: Descriptor spatial partitioning: 3 circles, 8 sec-

tors, 3 sections. Projected center of gravity is shown in red.

of a person. The data about human body proportions

ratio is u sed. A height of a pe rson is estimated, sim -

ply via the minimum and maximum value calculated

for the ﬁrst static point cloud of a video sequence. To

have an equal grid for all frames of a video sequence,

the normal is ﬁxed based on the viewing point. The

partitioning in sectors starts from the same position

for ea ch video frame.

Figure 1 shows an example of 3D partitioning f or

one of the frames fr om MSR3D dataset. The only pa-

rameters of the descriptor are the number o f sections,

the nu mber of sectors and the number of circles. An

visualization of the descriptor parameters is shown in

Figure 2. For this work, we use only a uniform space

subdividing scheme and the cylinder volume partitio-

ning is then perfor med as:

V = 2πRH (3)

; h

; s

angle

360

(4)

where R and H are the ﬁxed ra dius and height, r

and s

angle

are the circle a nd height intervals and the

angle correspon ding to each sector.

The ﬁnal descriptor is obtained by calculating the

number of points in each formed 3D cell i.e. the cell

occupancy. The descriptor is normalized by the total

number of points in the point cloud in order to com-

pensate for possible noise or shape differences.

The OpenNI Framework (Consortium et al., ) is

used by many 3D cameras and provides the user with

automatic body recognition and skeleton joints ex-

traction functionality. There fore, we are not addres-

sing the task of background subtraction in our work

and assum e that it is a prior step. For this paper, the

data from an RGBD c amera, where the human is lo-

cated and the background is subtracted, were used to

test the proposed descriptor.

Our descriptor design allows it to be used in a

multiple ca mera views scenario to grant a more relia-

ble and accurate pose description. For example, such

partitioning was successfully employed earlier for hu-

man recognition from complete point clouds (Essma-

eel et al., 2016) based on histograms of normal orien-

tations.

4 TRAINING AND TESTING

DATA

MSR Action3D Dataset (Li et al., 2010) was selected

to perform the experiments an d evaluate th e proposed

descriptor. This is one of the most used RGBD hu-

man action-detection an d recognition datasets. It is

also one of the ﬁrst RGBD datasets capturing moti-

ons (dated 2010) and it contains a big amount of dif-

ferent actions performed by different persons. It con-

sists o f 20 action types performed by 10 subjects 2 or

3 times. The actions are: high arm wave, horizontal

arm wave, hamm e r, hand catch, forward punch, high

throw, draw an x, draw tick, draw circle, hand clap,

two hand wave, side-boxin g, bend, fo rward kick, side

kick, jogging, tennis swing, tennis serve, golf swing,

pick up & throw. The resolution of the video is not

very high, namely 3 20x24 0 and so is the frame ra te ,

namely 15 fp s. The data was recorded with a depth

sensor similar to the Kinect d evice and contains color

and depth video sequence s. The sequences are pre-

segmented for the backgro und and foreground . An

example of superimposed point cloud s co rrespond ing

to 3 actions from M SR Action 3D dataset is shown

in Figure 3. Skeleton joints data are also provided

with a higher framerate than the depth maps. Ho-

wever, many joints are wro ngly estimated, as can be

seen in Figure 4. For our experimen ts, we had to furt-

Figure 3: Three actions from MSR Action 3D dataset

shown as point clouds: high arm wave, horizontal wave,

golf swing.

her manually segment the d ataset into key postures in

3D. There is no accurate database with full body hu-

man poses as depth maps publicly available, despite

several works where the features which represent the

posture are learned from real and synthetic examples

(Shotton et al., 2013)(Ganapa thi et al., 2010), neither

the data nor the implemen ta tion of these me thods are

available. Recently a new multi-kinect posture data-

set was published (Shafaei and Little, 2016), however,

this one is huge and is not ded ic ated to the global pose

estimation but bo dy p arts segmentation. Since we are

not using any deep learning and proposing a hand-

crafted descriptor, we considered that a well- known

and widely used MSR Action 3D will be sufﬁcient to

perform the test and training to show the capabilities

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

164

Figure 4: Examples of wrong skeleton estimation for MSR

3D dataset, actions ’High Arm Wave’, ’Horizontal Arm

Wave’, ’Hammer’. A person is always facing the camera

straight and his legs are not crossed.

Table 1: Postures selected from the MSR3D dataset.

Posture Training Test

1 Staying relaxed 160 54

2 Forward Kick 102 50

3 Hand lifted 45

◦

29 13

4 Right hand up 137 64

5 Right hand to the left 80 71

6 Clap 59 25

7 Hands wide open 35 13

8 Pick from the ground 33 34

9 Half bend 80 53

10 Full bend 65 69

11 Right leg kick 60 40

12 Right leg kick on side 49 34

13 Throw from the back 134 78

14 Right hand up 42 28

15 Both hands left half bend 62 30

16 Both hands to the left half bend 62 30

17 Both hands to the right half bend 88 37

18 Throw from the front 54 33

and limitations of our method. In this work, we are

aiming to perform a pose recognitio n with out a skele-

ton aligning or human-body parts segmentation. The

number of sequences for each action in MSR Action

3D data set is between 27 and 30. We separated the

data in a tr aining and testin g set, and selected bet-

ween 3 to 7 key poses for each action. For the posture

recogn ition test, 18 well distinguishable poses were

selected. The r esulting dataset structure is explained

in Table 1. All the d ata acquired fro m person 3 were

excluded from the dataset because half of the depth

informa tion was missing. Subjects 1-7 from the da-

taset are used for training an d 8-10 for testing. The

resulting dataset is not very big but corresponds to

our goal to evaluate the descriptive capacities of the

proposed solution.

5 EXPERIMENTS

We perform 3 series of experiments: unsuper vised

clustering of frames in to k -postures in a video se-

quence, posture recognition in one action sequence

and posture recognition for a set of postures. The

average e stima te d time for the descriptor c a lc ulation

(with the 2D-3D transformation perform e d before-

hand) is 0.2 µs on a Intel C602 machine, which is

compatible to th e time of the extraction of feature vec-

tors in (Wang et a l., 2016).

5.1 Unsupervised K-means Clustering

A simplistic way to compa re any two pose descriptors

is to calculate a Euclidean distance between them. At

ﬁrst, we observed the dynamics of the distance chan-

ges on all the frames of a single video sequence from

the dataset. Figure 5 visualizes the distance compu-

tations for the sequence ’Horizontal Wave’ of MSR

Action 3D dataset. There are 5 distinctive postures in

this action. The result shows that there is a small dis-

tance between similar postures (i.e. consequent fra-

mes, frames in the beginning and the end of the se-

quence corresp onding to the same ’neutral’ posture).

To exploit this trend furthe r, a simple test with K-

means is p erformed, which shows that the descriptor

captures the posture difference well. Automatic key

positions were obtained by performing the K-means

clustering for 3 video sequences when one person is

performing an action 3 times. The optimal number of

basis K was estimated using th e elbow method. Fi-

gure 6 shows the results of this experiment. Qualita-

tive visual analysis shows that automatically detected

poses correspond well with the 5 most different po-

ses in the action ’Horizontal Wave’ selected manu-

ally. These tests work well for each person perfor-

ming a single actio n m ultiple times, but the test for the

whole data gives worse results, probab ly due to the

fact that the neutral posture is dominant in the dataset

and people tend to perform similar actions differently.

Hence, we obtain more interme diate clusters which

do not correspond precisely to key-postur es. Nevert-

heless, the obtained results are interesting enough to

continue the tests and try to evaluate complete posture

recogn ition based on the proposed descrip tor.

Figure 5: Pairwise descriptor distance. T he video sequence

starts and ﬁnishes by the same posture. The distance bet-

ween consequent frames is smaller and distinct ’key’ posi-

tions can be viewed as peaks of the graph.

3D Point Cloud Descriptor for Posture Recognition

165

Table 2: Classiﬁcation results for 5 postures of the action

’Horizontal Wave’ show good results in terms of precision.

Posture initial arm 45

◦

kick arm front left arm right

Precision 0.94 0.81 1 1 1

Recall 1 0.8 0.76 1 0.71

F-measure 0.97 0.82 0.86 1 0.83

5.2 Single Performance Action

A Support Vector Ma chine (SVM) c lassiﬁer was trai-

ned, O ne vs All, in order to classify the postures, fol-

lowed by 3-fold cross validation.

Figure 6: a) Three video sequences are shown as a succes-

sion of cluster centers. In ﬁrst sequence person is starts to

perform the action faster than in sequence 2 and 3; b) 5 key

postures of the action ’Horizontal Wave’ (selected manu-

ally); c) 5 clusters obtained automatically. Pixel values are

averaged: the darker the color is, the more is the occurrence.

The F-measure, recall and precision were used to

evaluate the performance of the classiﬁer. The main

criteria for our ta sk is precision, but we included re-

call and F-measur e parameters in order to evaluate a

possibility to use the descriptor in a scenario where

accurate retrieval of all postures is essential.

The results for each posture recognition for the

action ’Horizontal Wave’ are summarized in Table 2.

Train and test data for this sequence were segmented

manually ac cording to the scheme intr oduced in the

previous section. This simple test shows excellent r e-

sults in terms of precision for all but one posture.

Figure 7: Confusion matrix for the SVM-based classiﬁca-

tion shows good results for all postures but one.

5.3 Set Retrieval Performance

The test for a single action postu re estimation shows

good results, hence we conducte d an extended version

of this test containing a bigger number of various pos-

tures. A full test for 18 postures was performed with

an SVM. Feature vectors o f the selected postures were

used for training and testing. Figure 7 shows the con-

fusion matrix for the classes obtained by the SVM.

The descriptor parameters (number of sections, cir-

cles and sectors) we re tuned for the best perfo rmance.

We obtained the best results w ith 12 sections, 10 cir-

cles and 10 sectors, corresponding average precision

is 0.94. The parameter tunin g is straight-forward and

shows that the different parameters combinations do

not have much of an effect on performance. The main

observation is that for postures selected the most im-

portant parameter is th e number of sections which

helps to separate the volume by vertical planes. Diffe-

rent com binations of parameters can give slightly bet-

ter or worse results in terms of precision, recall and

F-measure. Corresponding curves obtained for diffe-

rent parameters are sh own in Figure 8.

The results sh ow good performan c e in terms of

precision which is excellent for simple postures. Our

results are comparable with the results of (Wang et al.,

2016) where authors are usin g only 5 distinct postu -

res: standing, sitting, stoop ing, kneeling and lying.

Of these, several postures are similar to ours, plus we

are aiming at more complex and varied postures. The

original dataset of (Wang et al., 2016) is not availa-

ble, but we also performed a test with just 3 very dif-

ferent postures and a similar amou nt o f training and

testing data. As before, the training data and test

data are formed from different subjects. Our postu-

res are: staying, right-hand up, ben ding. The corre-

sponding numbers of training and testing images are:

384/125, 246/125, and 98/103. With this small da-

taset we obtain excellent results in term of precision

and r ecall, all the tests are assigned correctly. Our re-

sults and the results from (Wang et al., 2016) can not

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

166

Figure 8: Tuning of t he parameters. Precision, recall and F-measure curves for a) the number of section varies, sectors and

circles ﬁxed to 10; b) the number of sectors varies, sections and circles ﬁxed to 10; t he number of circles varies, sections and

sectors are ﬁxed t o 10.

be directly compared, but this test g ives an idea about

the descriptor capabilities. Wang et al. test their pos-

ture recognitio n method on 80-100 dep th images ta-

ken each for 8 per sons. The recognition rate is also

very high, with some minor errors (for example, for

the ﬁrst person the r ecognition rate is: 79/80, 99/100,

80/80, 80/80, 7 9/80). It should be mentioned, that

(Wang et al., 2016) uses same subjects for testing and

training, which is probably easier as we have shown

in our tests f rom the previous section.

6 CONCLUSIONS

This paper shows that body pose may be adequately

represented witho ut joint estimatio n. The proposed

descriptor can be used exclusively, or as an advanta-

geous addition to tradition skeleton-joints estimation

methods.

The introduced descriptor works well for captu-

ring the 3D spatial arrangemen t of a point cloud struc-

ture. Experiments show that our method a c hieves

competitive re sults compared to curr ent hand crafted

state of the art descriptors. Learned or trained des-

criptors may give superior performance but critically

depend on the availability of large amounts of labeled

data. Secondly, these architectures don’t generalize

outside their initial domain. Our algorithm is a sim-

ple and elegant solution, when jo int informatio n is not

available or unreliable.

Example applications include action recognition

and gait analysis. For the latter, the descriptor may be

deployed for cycle event or symmetry detection and

evaluation (Auvinet et al., 2015). Another possibility

is to be able to divide a video along the time axis using

posture information in the c ase of misalignm ent. De-

tected postures can be used to temporally align the

data or as key-words describing the action.

There are number of open issues. The descriptor is

noise sensitive, which becomes more apparent if part

of the d epth data is missing. Secondly, the Euclidean

distance metric b e tween two descriptor vectors cur-

rently excludes 3D spatial information. Semantically

different postures can th us result in descriptor vectors

that are near similar.

Future work will address these issues next to de-

veloping an application for real-time gait cycle event

recogn ition.

REFERENCES

Agarwal, A. and Triggs, B. (2006). Recovering 3d human

pose from monocular images. IEEE transactions on

pattern analysis and machine intelligence, 28(1):44–

58.

Auvinet, E., Multon, F., and Meunier, J. (2015). New lower-

limb gait asymmetry indices based on a depth camera.

Sensors, 15(3):4605–4623.

Bengio, Y., Louradour, J., Collobert, R., and Weston, J.

(2009). Curriculum learning. In Proceedings of the

26th annual international conference on machine le-

arning, pages 41–48. ACM.

Cavazza, J., Morerio, P., and Murino, V. (2017). When ker-

nel methods meet feature learning: L og-covariance

network for action recognition from skeletal data.

arXiv preprint arXiv:1708.01022.

Chang, J. Y. and Nam, S. W. (2013). Fast random-forest-

based human pose estimation using a multi-scale and

cascade approach. ETRI Journal, 35(6):949–959.

Charles, J. and Everingham, M. (2011). Learning shape mo-

dels for monocular human pose estimation from the

microsoft xbox kinect. In IEEE International Confe-

rence on Computer Vision Workshops (ICCV Works-

hops),, pages 1202–1208.

Chen, C.-H. and Ramanan, D. (2017). 3d human pose es-

timation= 2d pose estimation+ matching. Computer

Vision and Pattern Recognition (CVPR).

Chen, X. and Yuille, A. L. (2014). Articulated pose esti-

mation by a graphical model with image dependent

pairwise relations. In Advances in Neural Information

Processing Systems, pages 1736–1744.

3D Point Cloud Descriptor for Posture Recognition

167

Ch´eron, G., Laptev, I., and Schmid, C. (2015). P- cnn: Pose-

based cnn features for action recognition. In Procee-

dings of the IEEE international conference on compu-

ter vision, pages 3218–3226.

Cippitelli, E., Gasparrini, S., Spinsante, S., and Gambi, E.

(2015). Kinect as a tool for gait analysis: validation of

a real-time joint extraction algorithm working in side

view. Sensors, 15(1):1417–1434.

Comaniciu, D. and Meer, P. (2002). Mean shift: A robust

approach toward feature space analysis. IEEE Tran-

sactions on pattern analysis and machine intelligence,

24(5):603–619.

Consortium, O. et al. Openni, the standard framework for

3d sensing. URL as accessed on. 2017-09-30.

Essmaeel, K., Migniot, C., and Dipanda, A . (2016). 3d des-

criptor for an oriented-human classiﬁcation from com-

plete point cloud. In VISIGRAPP (4: VISAPP), pages

353–360.

Ganapathi, V., Plagemann, C., Koller, D., and Thrun, S.

(2010). Real time motion capture using a single time-

of-ﬂight camera. In IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), pages 755–

762.

Ganapathi, V., Plagemann, C., Koller, D., and Thrun, S.

(2012). Real-time human pose tracking from range

data. In European conference on computer vision, pa-

ges 738–751. Springer.

Han, Y., Zhang, P., Zhuo, T., Huang, W., and Zhang, Y.

(2017). Video action recognition based on deeper con-

volution networks with pair-wise frame motion conca-

tenation. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition Workshops,

pages 8–17.

Hinterstoisser, S. , Lepetit, V., Ilic, S., Holzer, S., Bradski,

G. R., Konolige, K., and Navab, N. (2012). Model ba-

sed training, detection and pose estimation of texture-

less 3d objects in heavily cluttered scenes. In ACCV

(1), pages 548–562.

Jhuang, H., Gall, J., Zufﬁ, S., Schmid, C., and B lack, M. J.

(2013). Towards understanding action recognition. In

Proceedings of the IEEE international conference on

computer vision, pages 3192–3199.

Lan, Z., Zhu, Y., Hauptmann, A. G., and Newsam, S.

(2017). Deep local video feature for action recog-

nition. In IEEE Conference on Computer Vision

and Pattern Recognition Workshops (CVPRW), pages

1219–1225.

Li, S., Liu, Z.- Q ., and Chan, A. B. (2014). Heterogeneous

multi-task learning for human pose estimation with

deep convolutional neural network. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition Workshops, pages 482–489.

Li, W., Zhang, Z., and Liu, Z. (2010). A ction recognition

based on a bag of 3d points. In IEEE Computer So-

ciety Conference on Computer Vision and Pattern Re-

cognition Workshops (CVPRW), pages 9–14.

Mentiplay, B. F., Perraton, L. G., Bower, K. J., Pua, Y.-H.,

McGaw, R., Heywood, S., and Clark, R. A. (2015).

Gait assessment using the microsoft xbox one kinect:

Concurrent validity and inter-day reliability of spatio-

temporal and kinematic variables. Journal of biome-

chanics, 48(10):2166–2170.

Peng, B. and Luo, Z. (2016). Multi-view 3d pose estimation

from single depth images. Technical report, Techni-

cal report, St anford University, USA, Report, Course

CS231n: Convolutional Neural Networks for Visual

Recognition.

Pishchulin, L., Andriluka, M., Gehler, P., and Schiele, B.

(2013). Poselet conditioned pictorial structures. In

Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition, pages 588–595.

Saraﬁanos, N., Boteanu, B., Ionescu, B., and Kakadiaris,

I. A. (2016). 3d human pose estimation: A review

of the literature and analysis of covariates. Computer

Vision and Image Understanding, 152:1–20.

Shafaei, A. and Little, J. J. (2016). Real-time human motion

capture with multiple depth cameras. In IEEE 13th

Conference on Computer and Robot Vision (CRV), pa-

ges 24–31.

Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Fi -

nocchio, M., Blake, A., Cook, M., and Moore, R.

(2013). Real-time human pose recognition in parts

from single depth i mages. Communications of the

ACM, 56(1):116–124.

Tang, D., Jin Chang, H., Tejani, A., and Kim, T.-K. (2014).

Latent regression forest: Structured estimation of 3d

articulated hand posture. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion, pages 3786–3793.

Tompson, J. J., Jain, A., LeCun, Y., and Br egler, C. (2014).

Joint trai ning of a convolutional network and a graphi-

cal model for human pose estimation. In Advances in

neural information processing systems, pages 1799–

1807.

Vieira, A., Nascimento, E., Oliveira, G., Liu, Z., and Cam-

pos, M. (2012). Stop: Space-time occupancy pat-

terns for 3d action recognition fr om depth map se-

quences. Progress in Pattern Recognition, Image Ana-

lysis, Computer Vision, and Applications, pages 252–

259.

Wang, W.-J. , Chang, J.-W., Haung, S. -F., and Wang, R.-J.

(2016). Human posture recognition based on images

captured by the kinect sensor. International Journal

of Advanced Robotic Systems, 13(2):54.

Wohlhart, P. and Lepetit, V. (2015). Learning descriptors

for object recognition and 3d pose estimation. In Pro-

ceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 3109–3118.

Yang, Y. and Ramanan, D. (2011). Articulated pose esti-

mation with ﬂexible mixtures-of-parts. In IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 1385–1392.

Ye, M. and Yang, R. (2014). Real-time simultaneous pose

and shape estimation for articulated objects using a

single depth camera. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 2345–2352.

Yub Jung, H., Lee, S., Seok Heo, Y., and Dong Yun, I.

(2015). Random tree walk toward instantaneous 3d

human pose estimation. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion, pages 2467–2474.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

168