Human Skeleton Detection from Semi-constrained Environment Video

Palwasha Afsar, Paulo Cortez and Henrique Santos

ALGORITMI Centre, Department of Information Systems, University of Minho, 4804-533 Guimaraes, Portugal

palo afsar77@yahoo.com, {pcortez, hsantos}@dsi.uminho.pt

Keywords:

Human Action, Action Recognition, Video Data, Skeleton Detection.

Abstract:

The correct classiﬁcation of human skeleton from video is a key issue for the recognition of human actions

and behavior. In this paper, we present a computational system for a passive detection of human star skeleton

from raw video. The overall system is based on two main modules: segmentation and star skeleton detection.

For each module, several computer vision methods were adjusted and tested under a comparative analysis that

used a challenging video dataset (e.g., different daylight and weather conditions). The obtained results show

that our system is capable of detecting human skeletons in most situations.

1 INTRODUCTION

Due to the widespread use of digital cameras, human

activity recognition from video is becoming a trendy

topic in the area of computer vision (Poppe, 2007;

Turaga et al., 2008; Enzweiler and Gavrila, 2009;

Geronimo et al., 2010; Afsar et al., 2015a). The pas-

sive and automatic recognition of human actions (e.g.,

walking, sitting, exhibiting interest, aggression be-

havior), from monocular images and videos is a valu-

able tool in several computer applications, such as

human computer interaction, video content retrieval,

virtual reality, analysis of sports events, video index-

ing and surveillance. For instance, in (Cortez et al.,

2016) a computer vision system was used to monitor

and forecast the entrance of customers into commer-

cial store, allowing to enhance the store management.

Unlike other well deﬁned computer vision tasks (e.g.,

motion blur, edge detection), human behavior recog-

nition (e.g., exhibiting interest or aggression behav-

ior) does not have a clear algorithmic deﬁnition and

thus this task can be challenging.

Human star skeleton recognition from video is a

crucial element of several human action recognition

systems (Orrite-Urunuela et al., 2004; Fujiyoshi et al.,

2004; Chen et al., 2006; Yang and Tian, 2014; Vemu-

lapalli et al., 2016). However, most of these systems

were proposed to work in controlled nonrestrictive en-

vironments (e.g., with humans close to the camera

and wearing distinctive clothing colors when com-

pared with the background). In this position paper, we

present a computational system for a passive detec-

tion of human star skeleton from raw video and that

was built to work in a semi-constrained but more re-

alistic environment. The ﬁnal and future goal of our

system is to use the detected human skeletons to cre-

ate human movement features (e.g., speed or accelera-

tion) that will fed a machine learning classiﬁer, which

will be trained to detect interesting human behaviors

(e.g., walking, sitting, making a cellular call).

The proposed computer vision system is com-

posed of two main modules (Figure 1): segmentation

(including background subtraction and shadow and

highlight removal) and star skeleton detection. For

each module, we experimented, adjusted and com-

pared several computer vision methods (e.g., adaptive

background mixture model based on Gaussian mix-

ture model versus simpler background subtraction,

thinning algorithm versus simpler zero-crossing). We

report here the results so far achieved using video data

collected in (Afsar et al., 2015b) and that corresponds

to a semi-constrained but realistic university campus

environment. The recorded digital videos are related

with two cameras that were used to capture two par-

ticular examples of interior and exterior human walk-

ing areas from our campus. Since we adopted a real

environment, the recorded data includes several re-

strictions that pose challenges: the cameras were set

in front of a glass window (thus some reﬂection is

captured) and far away from the human walking envi-

ronment (some humans are captured with a low pixel

deﬁnition); there are different weather conditions in

the exterior campus area (e.g., rain and wind) and

varying illumination in both interior and exterior ar-

eas due to different daytime recordings); there are

clutter scenes in both areas due to the presence of

384

Afsar P., Cortez P. and Santos H.

Human Skeleton Detection from Semi-constrained Environment Video.

DOI: 10.5220/0006245803840389

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 384-389

ISBN: 978-989-758-226-4

trees and bushes; often, the human clothing includes

colors that are very similar when compared with the

background; and other uncontrolled conditions. Af-

ter tuning the two system modules, interesting human

star skeleton results were achieved (shown in Section

3).

Figure 1: The overall framework of the system.

2 RELATED WORK

A key issue within the context of human action recog-

nition is the automatic detection of human star skele-

ton, typically consisting of a 5-point star (with head,

hands and legs – the human body extremes) and a

body center of mass. In effect, several works have

proposed and adopted methods based on human star

skeletons for action recognition. Two dimensional

images were used in:

• (Orrite-Urunuela et al., 2004) – where manual ef-

fort was used to deﬁne the skeleton points; and

• (Chen et al., 2006) and (Fujiyoshi et al., 2004)

– where preprocessing (e.g., segmentation, di-

latation) and zero-crossing methods were used to

build the 5-point human stars.

More recently, three dimensional video has also been

modeled to use similar human skeleton approaches

(Yang and Tian, 2014; Vemulapalli et al., 2016).

However, most of these human skeleton usage or de-

tection works were proposed to work with clean, non-

restrictive environments where: the camera is posi-

tioned close to the human subjects, the background is

clear and colored with a distinctive color when com-

pared with humans, most environments are interior

and thus immune to weather conditions, etc. While

such 2D controlled environments facilitate the devel-

opment of the computer vision system, they are less

realistic and thus are of less practical usage. More-

over, the usage of depth information (3D) does not

have a good precision and/or requires very expensive

equipment when human subjects are far away from

the camera. In this position paper, we use a realistic

2D environment from our university campus and that

contains several challenges (detailed in Section 3.1).

As such, we tune and test several computer vision

methods under this semi-constrained video environ-

ment, resulting in a computational system that is ca-

pable of detecting human skeletons.

3 PROPOSED APPROACH

The task of ﬁnding extreme points of a human sil-

houette in realistic environments is challenging due

to self-occlusion, articulated human body, missing

depth information, invariant appearance due to cam-

era viewpoints, illuminations and loose clothing. In

this paper, and similarly to (Chen et al., 2006), we

assume that a human skeleton is developed by de-

tecting extreme points such as head, hands and feet.

These extreme points can be used for the recognition

of many human actions (e.g., walking). The over-

all computational system is based on two main mod-

ules (Figure 1): segmentation (including background

subtraction and shadow and highlight removal); and

star skeleton detection. All experiments were con-

ducted using the Matlab computational environment

(https://www.mathworks.com). In each module, we

experimented, adapted and compared several com-

puter vision methods, as detailed in the next subsec-

tions.

3.1 Video data

For this research, we adopt the video dataset that was

presented in (Afsar et al., 2015b). For capturing the

data, two cameras HIK Vision and IR Network were

installed. Both cameras were “hidden” behind a win-

dow glass (Figure 2) and set to capture real interior

and exterior human walking campus areas (Figure 3).

The whole dataset includes hundreds of small videos

(with few seconds to few minutes each) that corre-

spond to 32GB and that included a non controlled

capturing of human (students, researchers and other

staff) actions (e.g., walking, running, drinking cof-

fee).

Figure 2: Cameras placement for recording the dataset.

The dataset is challenging, due to several uncon-

trolled factors, such as: there is some reﬂection in

the video (since cameras were set behind a glass);

wide scenes are captured and some humans are far

away from the camera, thus captured with a low res-

olution; there are different weather conditions (e.g.,

Human Skeleton Detection from Semi-constrained Environment Video

385

Figure 3: Examples of the exterior (top frames) and interior

(bottom frames) areas.

sun, rain and wind) and varying illumination due to

different daytime recordings); there are clutter scenes

due to the present of trees and bushes (in both interior

and exterior environments); and often, human wear

clothes that are very similar when compared with the

background. Several of these elements can be seen in

Figure 3.

3.2 Segmentation

3.2.1 Background Subtraction

The video is processed frame by frame and blob anal-

ysis is performed to look for any active blobs. In order

to select only humans and to remove noise, we de-

ﬁned a minimum pixel area for blob selection, which

was set to 2000 pixels (e.g., 45x45, 60x34) after some

preliminary experiments.

For background subtraction, we ﬁrst tested

the gaussian mixture model proposed in (Kaew-

trakulpong and Bowden, 2002). for separating the

foreground pixels from the background. This detec-

tor works on data collected by a stationary camera

and compares a color or gray scale video frame to a

background model to ﬁgure out whether its part of

the background or foreground. It then computes a

foreground mask based on Gaussian Mixture Mod-

els (GMM). This algorithm has the ability to adapt

itself slowly to the changing environment but for our

dataset, the results obtained were not satisfactory. As

shown in Figure 4, the algorithm was not able to de-

tect legs, although there was a large space between

the legs. Note that in Figures 4 and 5 we hide the hu-

man face with a red rectangle to preserve anonymity.

Moreover, the GMM algorithm is computationally ex-

pensive.

We then tested a much simpler and lightweight ap-

proach, based on the absolute subtraction of input and

Figure 4: Segmentation results (ﬁrst column, a), denotes

the original input frame; second column, b), the Gaussian

mixture model result; and third column, c), the background

subtraction result).

background image. Background was updated by us-

ing the same position of the bounding box obtained

through blob detection as both frames are of the same

size. As an initial step, we subtracted the input Image

I from the background Image B, both in RGB color

space.

M = |B − I| (1)

The obtained mask M was converted to grayscale

and ﬁnally to binary using Otsu’s thresholding (Otsu,

1975). This method has the drawback that when the

clothing of the individual’s were light colored, the

results were not good even after apply morphologi-

cal operations. There were some information loss in

the human silhouette, which generated several smaller

blobs rather than one blob for an individual (Figure 5).

Figure 5: a) Original input image, b) result of absolute sub-

traction, c) binary image and d) result with shadow removal

and highlight detection.

Due to light color clothing, illuminations, and dis-

tance from the camera, there is some information loss.

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

386

In the resultant image (in RGB space), we tried to:

(x, y) =

(

1 if M(x, y) ≥ τ.

0 otherwise.

(2)

where M

is the ﬁnal mask, M(x, y) is the pixel value

of the mask, and τ is a threshold. We tested several τ

values (e.g., τ ∈ {10, 20, 30}). The best results were

achieved with τ = 30 (as shown in Figure 6) and thus

this was the selected threshold value.

Figure 6: a) Original image, b) mask for τ=20, c) mask for

τ=30 and d) grayscale mask for τ=30.

Then, we experimented to compute M

using the

HSV space, instead of RGB, since this is the same

color space used by the shadow and highlight re-

moval method (Section 3.2.2). The results obtained

are shown in Figure 7. The human silhouette still has

some information loss.

Figure 7: a) Original image, b) HSV image, c) Value com-

ponent, d) binary image and e) result after shadow and high-

light removal.

3.2.2 Shadow Removal and Highlight Detection

We tested the method proposed in (Duque et al.,

2005), where both images, background B and input

image I are transformed into the HSV color space.

A shadow mask S M and highlight mask LM are gen-

erated from these images, deﬁning the areas where

shadows or highlights are present. In these masks

each pixel will have value “1” if it is marked as

shadow or highlight, and value “0” otherwise. The

following equations deﬁne the process of computing

the shadow and highlight masks:

SM(x, y) =











1 if α ≤

(x,y)

≤ β

∧ | I

(x, y) − B

(x, y) |≤ τ

∧ I

(x, y) − B

(x, y) |≤ τ

0 otherwise

(3)

SM(x, y) =











1 if

≤

(x,y)

≤

∧ | I

(x, y) − B

(x, y) |≤ τ

∧ I

(x, y) − B

(x, y) |≤ τ

0 otherwise

(4)

The I

(x, y) , I

(x, y) and I

(x, y) represent re-

spectively the hue, saturation and value components

at coordinate (x,y) of the input image I. The same no-

tation is applied to the background image B.

The α is the main parameter and its value depends

on the light source, radiance and reﬂectance proper-

ties of the objects in the scene. High reﬂective and

high intensive light sources or irradiant objects can

have low α values. For our dataset, α varies from

0.60 to 0.90. If we decrease the value below 0.60,

there will be information loss. Similarly, if the value

goes above 0.90, the ﬁnal image will contain noise.

We tested the algorithm with different alpha values,

within the range {0.65,0.66,67,...,0.90} and the best

results were achieved for α=0.70. After setting α,

we experimented distinct values for β, which prevents

misclassiﬁcation and varies in the data from 0.75 to

0.79. Since this is a less relevant parameter, we exper-

imented distinct values but achieved the same results,

thus this parameter was ﬁxed to β = 0.75. The param-

eters τ

and τ

are the maximal variation allowed for

the saturation and hue components. We deﬁne τ

15% of the digitizers saturation range. The variation

of hue should not pass the 60 degrees. This value is

obtained through the division of the hue range (360

◦

)

by the six colors (red, yellow, green, cyan, blue and

magenta). The results obtained were satisfactory, as

shown in Figures 5 d), 7 e) and 8.

Figure 8: Example of human silhouettes using background

subtraction combined with shadow and highlight removal.

3.3 Star Skeleton

The main idea of a star skeleton is to connect the ex-

treme points (head, legs, hands) with the centroid (the

body of mass). In this paper, the human contour is

used as the main feature for the construction of a 5

point star skeleton. Depending on the posture of the

human, the points in the skeleton can be either 5 or

less than 5. These points can represent a human pos-

ture effectively and make faster the execution of the

ﬁnal human action detection. As an initial step, the

Human Skeleton Detection from Semi-constrained Environment Video

387

contour is extracted from the human silhouettes. For

the removal of noise from the human contour, convo-

lution was applied, which removes noise and smooths

a function. A threshold value of 12 was used in order

to achieve the desired level of smoothing. To calcu-

late the distance of individual boundary points from

centroid, the euclidean function was used. The whole

human contour was processed in a clock-wise order.

In a function, extreme points (or local maxima)

are the high peaks points or the points where a zero-

crossing is detected when analyzing smoothed dis-

tance differences. For the construction of the 5-star

skeleton, when the number of points are greater than

5, a threshold value of 40 was used to ﬁnd the bound-

ary distance between those points, i.e., detect which

boundary points are closer to each other. Then, the

median of such points is used as a representative of

the extreme human part. An example of this is shown

in the third row of Figure 10, where two points are re-

placed by the median for the right hand. The last step

is to connect all of the points to the centroid. Figure 9

depicts the overall procedure for the construction of

the star skeleton. The points A, B, C, D, E and F rep-

resent the extreme points of the smooth distance func-

tion. Since both points A and F are closer in terms of

the boundary space, we take the median of these two

points to represent the left leg when deﬁning the star

skeleton. More examples of the obtained star skele-

tons are shown in Figure 10.

Figure 9: Process ﬂow for the construction of a star skele-

ton.

The full steps for the construction of a star skele-

ton algorithm are:

1 Calculate the centroid of the contour of the input

image (x

, y

∑

i=1

∑

i=1

(5)

where B

are the number of boundary points and

, y

) denotes the centroid of the input contour.

2 Determine the distance d

from each boundary

point (x

, y

) to centroid (x

, y

− x

)

+ (y

− y

)

(6)

3 To remove the noise or unwanted peaks, the dis-

tance function is smoothed using convolution.

4 Find the local maximum by detecting zero cross-

ing of the distance function differences (d

i+1

−d

For a comparative analysis, we compared the sim-

pler zero crossing method with a thinning method

(based on the Matlab “bwmorph” function). Figure

10 shows some results obtained by our star skeleton

method and “bwmorph”. It is clear that our approach

performs better for calculating the posture of human

as compared with “bwmorph” function. Moreover,

our approach requires less computation when com-

pared with the thinning method.

Figure 10: Comparative analysis of star skeleton with “bw-

morph” function a) Binary Image b) Modiﬁed bwmorph re-

sult c) our skeleton algorithm

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

388

4 CONCLUSIONS

In this paper, we propose a computational system for

a passive detection of a 5-star human skeleton based

on raw video. The system includes two main mod-

ules, segmentation and star skeleton detection, and

it was adjusted and evaluated using a semi-restricted

realistic environment. The realistic videos are re-

lated with interior and exterior human walking ar-

eas with varying illumination, clutter, and other un-

controlled conditions (e.g., weather). Several com-

puter vision methods were explored for the segmenta-

tion and star skeleton modules. The best results were

achieved using simpler approaches: background sub-

traction and shadow and highlight removal using HSV

color space; smoothed Euclidean distance to centroid

and zero-crossing of distance differences to detect the

human extremes. In future work, we intend to use

motion and memory to estimate the position of hu-

man parts (e.g., hands) that might be temporarily hid-

den. Also, we plan to create motion skeleton features

(e.g., velocity, acceleration) in order to train a ma-

chine learning classiﬁer such that it can learn to detect

human actions (e.g., walking, making a cellular call).

ACKNOWLEDGEMENTS

This work has been supported by COMPETE:

POCI-01-0145-FEDER-007043 and FCT - Fundac¸

para a Ci

encia e Tecnologia within the Project

Scope: UID/CEC/00319/2013 and research grant

FCT SFRH/BD/84939/2012.

REFERENCES

Afsar, P., Cortez, P., and Santos, H. (2015a). Automatic

human action recognition from video using hidden

markov model. In Computational Science and Engi-

neering (CSE), 2015 IEEE 18th International Confer-

ence on, pages 105–109. IEEE.

Afsar, P., Cortez, P., and Santos, H. (2015b). Automatic

visual detection of human behavior: A review from

2000 to 2014. Expert Systems with Applications,

42(20):6935–6956.

Chen, H.-S., Chen, H.-T., Chen, Y.-W., and Lee, S.-Y.

(2006). Human action recognition using star skele-

ton. In Proceedings of the 4th ACM international

workshop on Video surveillance and sensor networks,

pages 171–178. ACM.

Cortez, P., Matos, L. M., Pereira, P. J., Santos, N., and

Duque, D. (2016). Forecasting store foot trafﬁc us-

ing facial recognition, time series and support vector

machines. In Soft Computing Models in Industrial and

Environmental Applications (SOCO), Advances in In-

telligent and Soft Computing (AISC) Vol. 527, pages

267–276, San Sebastian, Spain. Springer.

Duque, D., Santos, H., and Cortez, P. (2005). Moving object

detection unaffected by cast shadows, highlights and

ghosts. In IEEE International Conference on Image

Processing 2005, volume 3, pages III–413. IEEE.

Enzweiler, M. and Gavrila, D. M. (2009). Monocular pedes-

trian detection: Survey and experiments. IEEE trans-

actions on pattern analysis and machine intelligence,

31(12):2179–2195.

Fujiyoshi, H., Lipton, A. J., and Kanade, T. (2004). Real-

time human motion analysis by image skeletonization.

IEICE TRANSACTIONS on Information and Systems,

87(1):113–120.

Geronimo, D., Lopez, A. M., Sappa, A. D., and Graf, T.

(2010). Survey of pedestrian detection for advanced

driver assistance systems. IEEE transactions on pat-

tern analysis and machine intelligence, 32(7):1239–

1258.

Kaewtrakulpong, P. and Bowden, R. (2002). An im-

proved adaptive background mixture model for real-

time tracking with shadow detection. In Video-based

surveillance systems, pages 135–144. Springer.

Orrite-Urunuela, C., del Rincon, J. M., Herrero-Jaraba,

J. E., and Rogez, G. (2004). 2d silhouette and 3d

skeletal models for human detection and tracking. In

Pattern Recognition, 2004. ICPR 2004. Proceedings

of the 17th International Conference on, volume 4,

pages 244–247. IEEE.

Otsu, N. (1975). A threshold selection method from gray-

level histograms. Automatica, 11(285-296):23–27.

Poppe, R. (2007). Vision-based human motion analysis: An

overview. Computer vision and image understanding,

108(1):4–18.

Turaga, P., Chellappa, R., Subrahmanian, V. S., and Udrea,

O. (2008). Machine recognition of human activities:

A survey. IEEE Transactions on Circuits and Systems

for Video Technology, 18(11):1473–1488.

Vemulapalli, R., Arrate, F., and Chellappa, R. (2016). R3dg

features: Relative 3d geometry-based skeletal repre-

sentations for human action recognition. Computer

Vision and Image Understanding.

Yang, X. and Tian, Y. (2014). Effective 3d action recogni-

tion using eigenjoints. Journal of Visual Communica-

tion and Image Representation, 25(1):2–11.

Human Skeleton Detection from Semi-constrained Environment Video

389