AN ACTION-TUNED NEURAL NETWORK ARCHITECTURE

FOR HAND POSE ESTIMATION

Giovanni Tessitore, Francesco Donnarumma and Roberto Prevete

Department of Physical Sciences, University of Naples Federico II, Naples, Italy

Keywords:

Neural networks, Grasping action, Hand pose estimation, Mixture density networks.

Abstract:

There is a growing interest in developing computational models of grasping action recognition. This interest

is increasingly motivated by a wide range of applications in robotics, neuroscience, HCI, motion capture and

other research areas. In many cases, a vision-based approach to grasping action recognition appears to be more

promising. For example, in HCI and robotic applications, such an approach often allows for simpler and more

natural interaction. However, a vision-based approach to grasping action recognition is a challenging problem

due to the large number of hand self-occlusions which make the mapping from hand visual appearance to

the hand pose an inverse ill-posed problem. The approach proposed here builds on the work of Santello

and co-workers which demonstrate a reduction in hand variability within a given class of grasping actions.

The proposed neural network architecture introduces specialized modules for each class of grasping actions

and viewpoints, allowing for a more robust hand pose estimation. A quantitative analysis of the proposed

architecture obtained by working on a synthetic data set is presented and discussed as a basis for further work.

1 INTRODUCTION

Over the last few years, there has been a keen in-

terest in developing computational models for action

recognition. Notably, grasping actions are of par-

ticular interest for various research areas including

robotics, neuroscience, motion capture, telemanipu-

lation, and human–computer interaction (HCI). Sev-

eral works have been proposed in literature which ad-

dress the problem of recognizing of grasping actions

(Palm et al., 2009; Ju et al., 2008; Aleotti and Caselli,

2006). More speciﬁcally, as the direct use of the hand

as input source is an attractive method, most of these

works make use of wired gloves in order to perform

input computation. Moreover, the only technology

that currently satisﬁes the advanced requirements of

hand-based input for HCI is glove-based sensing. For

example, recognition can be based on wired glove

kinematic information (Ju et al., 2008), or hybrid ap-

proaches in which glove information is put together

with tactile sensors information (Palm et al., 2009;

Keni et al., 2003; Aleotti and Caselli, 2006).

This technology has several drawbacks including

the fact that it hinders the natural user interactions

with the computer-controlled environment. Further-

more, it requires time-consuming calibration and se-

tup procedures.

In contrast with this, vision-based approaches

have the potential to provide more natural, non-

contact solutions, allowing for simpler and more natu-

ral interactions between user and computer-controlled

environment in HCI, as well as in robotics, where

grasping actions recognition is mainly framed in the

context of programming by demonstration (PbD).

There are approaches which make use of ad hoc

solutions, like markers, in order to simplify this task

(Chang et al., 2007). In general, as reported in (Wein-

land et al., 2010; Poppe, 2007), markless vision-based

action recognition is acknowledged to be a challeng-

ing task. In particular, a major problem with grasping

actions is the occlusion problem: hand-pose estima-

tion from an acquired image can be extremely hard

because of possible occlusions among ﬁngers or be-

tween ﬁngers and object being grasped.

For this reason, a body model approach seems

more appropriate in this context. A body model ap-

proach usually consists of two steps: in a ﬁrst step

one estimates a 3D model of the human body (in the

case of grasping actions this step coincides with the

estimation of hand pose), in a second step recognition

is made on the basis of joint trajectories.

Vision-based hand pose estimation is itself a chal-

358

Tessitore G., Donnarumma F. and Prevete R..

AN ACTION-TUNED NEURAL NETWORK ARCHITECTURE FOR HAND POSE ESTIMATION.

DOI: 10.5220/0003086403580363

In Proceedings of the International Conference on Fuzzy Computation and 2nd International Conference on Neural Computation (ICNC-2010), pages

358-363

ISBN: 978-989-8425-32-4

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

lenging problem (see (Erol et al., 2007) for a review).

More speciﬁcally, addressing such a problem without

any constraints on the hand pose makes the mapping

from visual hand appearance to hand conﬁguration

very difﬁcult to estimate. In this work, we present

a neural network architecture composed of a series of

specialized modules, each one implementing a map-

ping from visual hand appearance to hand conﬁgura-

tion only when the hand belongs to a particular grasp-

ing action and is observed from a speciﬁc viewpoint.

The paper is organized as follows: in Section 2,

we present the functional architecture and its actual

implementation by means of Mixture Density Net-

works. In Section 3, we present the experimental set

up in which the tests of Section 4 are carried out. Sec-

tion 5 is devoted to conclusions and future work.

2 MODEL ARCHITECTURE AND

IMPLEMENTATION

The main idea behind the present approach is to de-

velop a set of specialized functional mappings tuned

to predeﬁned classes of grasping actions, and to use

these specialized mappings to estimate hand poses of

visually presented grasping actions. This approach is

based on the assumption that grasping actions can be

subdivided into different classes of grasping actions,

and that coordinated movements of hand ﬁngers re-

sult, during grasping actions, in a reduced number of

physically possible hand shapes (see (Santello et al.,

2002; Prevete et al., 2008)).

Moreover, view-independent recognition was ob-

tained by developing view-dependentfunctional map-

pings, and by combining them in an appropriate way.

Thus, the system is basically based on a set of special-

ized functional mappings, and a selection mechanism.

The specialized mapping functions perform a map-

ping from visual image features to likely hand pose

candidates represented in terms of joint angles. Each

functional mapping is tuned to a predeﬁned viewpoint

and a predeﬁned class of grasping actions. The selec-

tion mechanism selects an element of the set of can-

didate hand poses. In the next two subsections we

will ﬁrst provide an overall functional description of

the system, followed by a detailed description of the

system neural network structure.

2.1 Functional Model Architecture

The system proposed here can functionally be subdi-

vided into three different modules: Feature Extraction

(FE) module, Candidate Hand Conﬁguration (CHC)

module, and Hand Conﬁguration Selection (HCS)

module. The whole functional architecture of the sys-

tem is shown in Figure 1. The functional roles of each

module can be described as follows:

FE module. This module receives an R× S gray-

level image as input. The output of the module is a

1 × D vector x. The FE module implements a PCA

linear dimensional reduction.

CHC module. The CHC module receives the out-

put x of FE as input. This module is composed of a

bank of N×M sub-modulesCHC

, with i = 1, 2,...,N

and j = 1,2, ...,M. N is the number of predeﬁned

classes of grasping actions, and M is the number of

predeﬁned viewpoints. Given the input x, each sub-

module CHC

provides the most likely hand conﬁg-

uration, in terms of a 1 × N

dof

vector t

. Moreover,

each t

is associated with an estimation error err

assuming that the input x is obtained during the ob-

servation of a grasping action belonging to the i-th

class from the j-th viewpoint. Thus each sub-module

CHC

, for the i-th grasping action and j-th viewpoint,

performs a specialized mapping from visual image

features to hand poses. The basic functional unit of

each sub-module CHC

is an inverse-forward model

pair. The inverse model extracts the most likely hand

conﬁguration t

, given x, while the forward model

gives as output an image feature vector x

, given t

The error err

is computed on the basis of x and x

HCS module. The HCS module receives the out-

put of CHC as input. It extracts a hand pose estima-

tion on the basis of estimation errors err

, by select-

ing the hand pose associated with the minimum error

value.

Action N

View-point M

Action 1

View-point M

Action N

View-point 1

Action 1

View-point 1

Extract best

candidates

Inverse Model

Fo rward Mo del

(t|x)

Error

err

Features

Extraction

(FE module)

Hand

Configuration

Selection

(HCS module)

err

Condidate Hand

Configuration

(CHC mo dule)

err

Figure 1: The system is functionally composed of three dif-

ferent modules: Feature Extraction (FE) module, Candi-

date Hand Conﬁguration (CHC) module, and Hand Conﬁg-

uration Selection (HCS) module.

2.2 Neural Network Implementation

The FE module implements a PCA linear dimen-

sional reduction by means of an autoassociative neu-

ral network (Bishop, 1995). This is a multilayer per-

ceptron composed of R × S input nodes, D hidden

nodes, and R×S output nodes. Both hidden nodes and

AN ACTION-TUNED NEURAL NETWORK ARCHITECTURE FOR HAND POSE ESTIMATION

359

output nodes have linear activation and output func-

tion. The input vectors are obtained by linearizing

the gray-level input images of size R× S into single

vectors of size 1 × R · S. The network is trained to

associate input vectors with themselves by a standard

back-propagation algorithm. Once trained, the vec-

tors x are obtained as the output of the hidden units.

As theCHC module is solely composed by CHC

modules, we focus now on a generic CHC

module,

and describe how it can be developed by means of a

neural network. As described above, a CHC

mod-

ule is composed of an inverse-forward model pair.

Let us ﬁrstly consider the inverse-model. This pro-

vides a hand conﬁguration t, given a visual image

feature vector x. A major mathematical problem aris-

ing in this context concerns the ill-posed character of

the required transformation from x to hand conﬁgu-

rations insofar as the same visually presented hand

can be associated with various hand conﬁgurations.

Therefore the mapping from x to hand conﬁgura-

tions t is not a functional mapping, and assumes the

form of an inverse ill-posed problem (Friston, 2005;

Kilner et al., 2007). According to (Bishop, 1995),

one can cope with this problem by estimating p(t|x)

in terms of a Mixture Density Network (MDN) ap-

proach: p(t|x) =

∑

i=1

(x)φ

(t|x). The φ

(x) are ker-

nel functions, which are usually Gaussian functions

of the form: φ

(t|x) =

(2π)

D/2

(x)

exp

−

kt−µ

(x)k

2σ

(x)

The parameters c

(x) can be regarded as prior proba-

bilities of t to be generated from the i-th component

of the mixture. The coefﬁcients of the mixture, c

(x),

and the parameters of the kernel functions, φ

(t,x)

(µ

(x) and σ

(x) for a Gaussian kernel), depend on

the sensory inputs x. A two-layer, feed-forward neu-

ral network can be used to model the relationship

between visual inputs x and corresponding mixture

parameters c

(x), µ

(x) and σ

(x). Accordingly, the

problem of estimating the conditional probability dis-

tribution p(t|x) can be approached in terms of neural

networks by combining a multi-layer perceptronand a

Radial Basis Function (RBF) like network. The RBF

network will be composed of N

dof

input, K hidden

nodes and one output node. The form of basis func-

tion is the same as the Gaussian functions expressed

above, with the Gaussian parameters of the ﬁrst layer

set to µ

(x) and σ

(x), and the second layer weights

set to c

(x). Thus, given a previously unseen hand vi-

sual description x, one can obtain an estimate of hand

conﬁguration t as the central value of the more prob-

able branch of p(t|x).

The network was trained using a dataset composed

of visual feature vectors x

and hand conﬁgurations t

collected during the observation of grasping actions

belonging to the C

-th class from the j-th viewpoint,

and by using E = −

∑

{

∑

)φ(t

)} as error

function. Once trained, the network output, i.e., the

candidate hand conﬁguration t

, is obtained as the

vector µ

(x) associated to the highest c

(x) value.

This operation is achieved by the module Extract best

candidate.

The design of the forward-model, involves a

multi-layer perceptron composed of N

dof

input and

D output units. The neural network receives a hand

conﬁguration t as input, and computes an expected

image feature vector x

as output. The vector x

the expected image feature vector corresponding to

the hand conﬁguration t when the hand is observed

during a grasping action belonging to the C

-th class

from the j-th viewpoint. The network was trained us-

ing the RProp learning algorithm by exploiting again

a dataset composed of hand conﬁgurations and visual

feature vectors, collected during the observation of

grasping actions belonging to the C

-th class from the

j-th viewpoint.

The error err

associated with the candidate hand

conﬁguration t

is computed as the sum-of-square er-

ror between the vectors x and x

. Finally, given the

set of candidate hand conﬁgurations x

and associ-

ated errors err

, with i = 1, 2,.., N and j = 1,2,..., M,

the candidate hand conﬁguration with the lowest as-

sociated error is identiﬁed as the output of the whole

system.

3 EXPERIMENTAL SETTING

We performed two main types of experiments to con-

trol the performance of the proposed architecture. In

the ﬁrst type of experiments we used different action

classes (DA-TEST); and in the second type we used

different viewpoints (DV-TEST).

A major problem in testing the performance of a

hand pose estimation system is to obtain a ground

truth. In fact, in the case of a quantitative analysis,

one needs to know the actual hand conﬁguration cor-

responding to the hand picture which is fed as input to

the system. This is difﬁcult to achieve with real data.

For this reason, we decided to work with a synthetic

dataset constructed by means of a dataglove and a 3D

rendering software. The dataglove used for these ex-

periments is the HumanGlove(HumanGlove, Human-

ware S.r.l., Pontedera (Pisa), Italy) endowed with 16

sensors. This dataglove feeds data into the 3D ren-

dering software which reads sensor values and con-

stantly updates a 3D human hand model. Thus, this

experimental setting enables us to collect hand joints

conﬁguration - hand image pairs.

In the DA-TEST two different types of grasps

ICFC 2010 - International Conference on Fuzzy Computation

360

were used in accordance with the treatment in

(Napier, 1956): precision-grasp (PG) and power-

grasp, the latter being also known as whole hand

grasp (WH). In performing a power grasp, the ob-

ject is held in a clamp formed by ﬁngers and palm;

in performing a precision grasp, the object is pinched

between the ﬂexor aspects of the ﬁngers and the op-

posing thumb. Two different objects were used: a

tennis ball was used for the WH actions, and a pen

cup for the PG actions. We collected 20 actions for

each class of actions. In the DV-TEST we rendered

the 3D hand model of the PG grasping actions data,

from 9 different viewpoints as reported in Table 2. In

both DA-TEST and DV-TEST, the capability of the

proposed architecture in recovering a hand pose was

measured in terms of Euclidean distance between the

actual hand pose t and the estimated hand pose

t, that

is E

NORM

=k t−

t k. This measure was reported in

other works such as (Romero et al., 2009). However,

due to the differences in the experimental settings it

is difﬁcult to make a clear comparison between re-

sults. For this reason, we decided to compute a fur-

ther term, that is, the Root-Mean-Square (RMS) er-

ror, in order to obtain a more complete interpretation

of our results. The RMS error over a set of actual hand

poses t

and estimated hand poses

is computed as:

RMS

∑

n=1

−t

∑

i=1

−

. Here

t is deﬁned as the average

of actual hand pose vectors, that is:

t =

∑

n=1

In this way the RMS error approaches 1 when the

model predicts the mean value of the test data, and

approaches to 0 value when the model’s prediction

captures the actual hand poses. Thus, we expect to

have a good performance when the RMS error for our

system is very close to zero.

Moreover, we havecomputed the “selection error”

SEL

) which measures how many times the input im-

age belonging to the i-th action class and j-th view-

point does not result in the lowest error for the module

CHC

. The E

SEL

error is expressed as percentage of

frames (belonging to the action class i and the view-

point j) which do not give rise the lowest error for the

module CHC

The architecture of the model used for these tests

is composed of a number of CHC

modules depend-

ing on the number of different action classes and

viewpoints. For each CHC

module the MDN com-

ponent, implementing the inverse model, was trained

using different values of H hidden units and K ker-

nels. The feedforward neural network (FNN), imple-

menting the forward model, is instead trained with

different values of the hidden unit number L. For both

MDN and FNN only the conﬁguration was consid-

ered which gives rise to the highest likelihood, for the

MDN, and to the lowest error for the FNN, computed

on a validation set.

Table 1 summarizes the parameters used for both

DA-TEST and DV-TEST.

Table 1: Parameters used for both DA-TEST and DV-TEST.

DA-TEST DV-TEST

viewpoints (N) 1 9

Action class (M) 2 1

Hidden Nodes (H) From 5 to 10 at step 2

Kernels (K) From 3 to 10 at step 1

Hidden nodes (L) From 5 to 10 at step 2

Num inputs (D) 30

Input Image dim (R,S) 135× 98

Table 2: Sample input images from the 9 different view-

points used for DV-TEST.

−80

−60

−40

−20

4 DA-TEST RESULTS

In this test we used 20 PG grasping actions and

20 WH grasping actions. The architecture consists

of two CHC modules. Each of these modules was

trained on 5 of the corresponding actions (PG or WH)

whereas 5 actions have been used for validation, and

the remaining 10 actions for testing.

In Table 3 the mean and the standard deviation of

the E

NORM

error over all hand poses contained in the

10 test actions is reported. Moreover the RMS Error

is reported there, insofar as it provides more meaning-

ful error information. Finally, we reported the error of

selection E

SEL

As one can see from the E

SEL

error, the system

is almost always able to retrieve the correct module

CHC for processing the current input image. More-

over, the system gives a good hand pose estimation

since the RMS Error is close to zero for both PG ac-

tions and WH actions. The E

NORM

error is also re-

ported for comparison with other works. In Table 4

sample input images together with their hand estima-

tions

and corresponding E

NORM

errors are reported.

One can see that if the E

NORM

error is close to mean

The picture of the estimated hand conﬁguration is ob-

tained by means of the 3D simulator.

AN ACTION-TUNED NEURAL NETWORK ARCHITECTURE FOR HAND POSE ESTIMATION

361

error (reported in Table 3), then the input and esti-

mated hand images are very similar.

Table 3: RMS error and the Mean and standard deviation of

the E

NORM

error for test actions belonging to class PG and

WH. Moreover the E

SEL

error is reported.

PG WH

RMS 0.24 0.12

NORM

(µ± σ) 20.1± 27 11.7± 19

SEL

3% 9%

Table 4: Sample input images together with the correspond-

ing hand estimated images drawn from both PG and WH

test actions.

Actual Estimated Error

PG 118.3

PG 19.7

WH 74.9

WH 12.3

5 DV-TEST RESULTS

The DV-TEST is divided into two phases. In a ﬁrst

phase, we test the system on the same viewpoint

which it was trained. Note that although test view-

points are the same as training viewpoints, when an

image frame of a test action is fed as input, the system

does not “know” what is the viewpoint correspond-

ing to that action, and must retrieve such information

from the input. In a second phase, we test the sys-

tem on viewpoints that are different with respect to

the ones it has been trained with.

In the ﬁrst phase, we used 20 PG grasping actions

rendered from 9 different viewpoints. The model ar-

chitecture consists of 9 CHC

modules, one for each

viewpoint, with i = 1 and j = 1,... ,9. The forward

and inverse models of the CHC

module were trained

on data related to the j-th viewpoint only. In partic-

ular, 5 actions were used for training and 5 other ac-

tions for validation. Once trained, the remaining 10

actions for each viewpoint were fed as input to the

system.

Table 5 shows the selection error E

SEL

together

with the two estimation errors, E

NORM

and RMS for

DV-TEST. One can see that the system is able to re-

trieve the right viewpoint for the hand input image

SEL

almost 0 for all viewpoints) and is able to give

a reasonable hand pos estimation as conﬁrmed by the

RMS error which is close enough to zero.

Table 5: Selection error E

SEL

, estimation error E

NORM

, and

RMS error for test viewpoints used in DV-TEST ﬁrst phase.

viewpoint 0

RMSError 0.02 0.04 0.2

NORM

(µ±σ) 9.1± 9 12.4± 15 25.2± 38

SEL

0% 0% 1%

viewpoint 60

−20

RMSError 0.08 0.13 0.06

NORM

(µ±σ) 15.9±24 21.2± 30 14.9± 21

SEL

0% 1% 0%

viewpoint −40

−60

−80

RMSError 0.19 0.08 0.03

NORM

(µ±σ) 23.6±37 16.5± 23 12.1± 14

SEL

0% 0% 0%

In the second phase of the DV-TEST we used ﬁve

CHC modules only, corresponding to the viewpoints

at 0, 40, 80, −40, and −80 degrees. The system was

tested on the remaining viewpoints at 20, 60, −20,

and −60 degrees.

Table 6 reports errors in recovering hand pose

from viewpoints that the system was not trained on:

only for viewpoint corresponding to −20 degrees the

error is acceptable, in the other cases, the error is high.

Table 6: Selection error E

SEL

together with estimation error

NORM

and RMS error for all test viewpoints used in DV-

TEST second phase.

viewpoint 20

RMSError 1.69 2.22

NORM

(µ±σ) 119± 58 137.5± 65

viewpoint −20

−60

RMSError 0.42 1.47

NORM

(µ±σ) 59.8± 28 114.5±46

6 CONCLUSIONS

The neural architecture described in this paper was

deployed to address the problem of vision-based hand

pose estimation during the execution of grasping ac-

tions observed from different viewpoints. As stated

in the introduction, vision-based hand pose estima-

tion is, in general, a challenging problem due to

the large amount of self-occlusions between ﬁngers

which make this problem an inverse ill-posed prob-

lem. Even though the number of degrees of free-

dom is quite large, it has been showed (Santello et al.,

ICFC 2010 - International Conference on Fuzzy Computation

362

2002) that a hand, during grasping action, can effec-

tively assume a reduced number of hand shapes. For

this reason it is reasonable to conjecture that vision-

based hand pose estimation becomes simpler if one

“knows” which kind of action is going to be executed.

This is the main rationale behind our system. The re-

sults of the DA-TEST show that this system is able

to give a good estimation of hand pose in the case of

different grasping actions.

In the ﬁrst phase of the DV-TEST comparable re-

sults with respect to the DA-TEST have been ob-

tained. It must be emphasized, moreover, that al-

though the system has been tested on the same view-

points it was trained on, the system does not know in

advance which viewpoint a frame drawn from a test

action belongs to.

In the second phase of the DV-TEST an accept-

able error was obtained from one viewpointonly. This

negative outcome is likely to depend on excessively

high differences in degrees between two consecutive

training viewpoints. Thus a more precise investiga-

tion must be performed with a more comprehensive

set of viewpoints. A linear combination of the out-

puts of the CHC modules, on the basis of the produced

errors, can be investigated too. Furthermore, the FE

module can be replaced with more sophisticated mod-

ules, in order to extract more signiﬁcant features such

as Histograms of Oriented Gradients (HOGs) (Dalal

and Triggs, 2005). A comparison with other ap-

proaches must be performed. In this regard, however,

the lack of some benchmark datasets make meaning-

ful comparisons between different systems difﬁcult to

produce. Finally, an extension of this model might

proﬁtably take into account graspable object proper-

ties (Prevete et al., 2010) in addition to hand visual

features.

ACKNOWLEDGEMENTS

This work was partly supported by the project Dex-

mart (contract n. ICT-216293) funded by the EC un-

der the VII Framework Programme, from Italian Min-

istry of University (MIUR), grant n. 2007MNH7K2

003, and from the project Action Representations

and their Impairment (2010-2012) funded by Fon-

dazione San Paolo (Torino) under the Neuroscience

Programme.

REFERENCES

Aleotti, J. and Caselli, S. (2006). Grasp recognition in vir-

tual reality for robot pregrasp planning by demonstra-

tion. In ICRA 2006, pages 2801–2806.

Bishop, C. M. (1995). Neural Networks for Pattern Recog-

nition. Oxford University Press.

Chang, L. Y., Pollard, N., Mitchell, T., and Xing, E. P.

(2007). Feature selection for grasp recognition from

optical markers. In IROS 2007, pages 2944 – 2950.

Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-

dients for human detection. In CVPR’05 - Volume 1,

pages 886–893, Washington, DC, USA. IEEE Com-

puter Society.

Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., and

Twombly, X. (2007). Vision-based hand pose estima-

tion: A review. Computer Vision and Image Under-

standing, 108(1-2):52–73.

Friston, K. (2005). A theory of cortical responses. Philos

Trans R Soc Lond B Biol Sci, 360(1456):815–836.

Ju, Z., Liu, H., Zhu, X., and Xiong, Y. (2008). Dynamic

grasp recognition using time clustering, gaussian mix-

ture models and hidden markov models. In ICIRA ’08,

pages 669–678, Berlin, Heidelberg. Springer-Verlag.

Keni, B., Koichi, O., Katsushi, I., and Ruediger, D. (2003).

A hidden markov model based sensor fusion ap-

proach for recognizing continuous human grasping

sequences. In Third IEEE Int. Conf. on Humanoid

Robots.

Kilner, J., James, Friston, K., Karl, Frith, C., and Chris

(2007). Predictive coding: an account of the mirror

neuron system. Cognitive Processing, 8(3):159–166.

Napier, J. R. (1956). The prehensile movements of the hu-

man hand. The Journal of Bone and Joint Surgery,

38B:902–913.

Palm, R., Iliev, B., and Kadmiry, B. (2009). Recognition of

human grasps by time-clustering and fuzzy modeling.

Robot. Auton. Syst., 57(5):484–495.

Poppe, R. (2007). Vision-based human motion analysis:

An overview. Computer Vision and Image Under-

standing, 108(1-2):4 – 18. Special Issue on Vision

for Human-Computer Interaction.

Prevete, R., Tessitore, G., Catanzariti, E., and Tamburrini,

G. (2010). Perceiving affordances: a computational

investigation of grasping affordances. Accepted for

publication in Cognitive System Research.

Prevete, R., Tessitore, G., Santoro, M., and Catanzariti,

E. (2008). A connectionist architecture for view-

independent grip-aperture computation. Brain Re-

search, 1225:133–145.

Romero, J., Kjellstrom, H., and Kragic, D. (2009). Monoc-

ular real-time 3d articulated hand pose estimation .

In IEEE-RAS International Conference on Humanoid

Robots (Humanoids09).

Santello, M., Flanders, M., and Soechting, J. F. (2002). Pat-

terns of hand motion during grasping and the inﬂu-

ence of sensory guidance. Journal of Neuroscience,

22(4):1426–1235.

Weinland, D., Ronfard, R., and Boyer, E. (2010). A Survey

of Vision-Based Methods for Action Representation,

Segmentation and Recognition. Technical report, IN-

RIA.

AN ACTION-TUNED NEURAL NETWORK ARCHITECTURE FOR HAND POSE ESTIMATION

363