Efﬁcient Multi-stream Temporal Learning and Post-fusion Strategy for

3D Skeleton-based Hand Activity Recognition

Yasser Boutaleb

1,2

, Catherine Soladie

, Nam-Duong Duong

, Amine Kacete

, J

ome Royan

and Renaud Seguier

IRT b-com, 1219 Avenue des Champs Blancs, 35510 Cesson-Sevign

e, France

IETR/CentraleSupelec, Avenue de la Boulaie, 35510 Cesson-Sevign

e, France

{catherine.soladie, renaud.seguier}@centralesupelec.fr

Keywords:

First-person Hand Activity, Multi-stream Learning, 3D Hand Skeleton, Hand-crafted Features, Temporal

Learning.

Abstract:

Recognizing ﬁrst-person hand activity is a challenging task, especially when not enough data are available.

In this paper, we tackle this challenge by proposing a new hybrid learning pipeline for skeleton-based hand

activity recognition, which is composed of three blocks. First, for a given sequence of hand’s joint positions,

the spatial features are extracted using a dedicated combination of local and global spacial hand-crafted fea-

tures. Then, the temporal dependencies are learned using a multi-stream learning strategy. Finally, a hand

activity sequence classiﬁer is learned, via our Post-fusion strategy, applied to the previously learned temporal

dependencies. The experiments, evaluated on two real-world data sets, show that our approach performs better

than the state-of-the-art approaches. For more ablation study, we compared our Post-fusion strategy with three

traditional fusion baselines and showed an improvement above 2.4% of accuracy.

1 INTRODUCTION

Understanding human activity from a ﬁrst-person

(egocentric) perspective by focusing on the hands is

getting interest of both the computer vision and the

robotics communities. Indeed, the range of potential

applications includes, among others, Human Com-

puter Interaction (Sridhar et al., 2015), Humanoid

Robotics (Ramirez-Amaro et al., 2017), and Vir-

tual/Augmented Realty (Surie et al., 2007). In partic-

ular, with the development of effective and low cost

depth camera sensors over the last years (e.g., Mi-

crosoft Kinect or Intel RealSense), 3D skeletal data

acquisition becomes possible with a sufﬁcient accu-

racy (Yuan et al., 2018; Moon et al., 2018). This has

promoted the problem of human activity recognition.

The 3D Skeletal data provides a robust high-level

description regarding common problems in RGB

imaging, such as background subtraction and light

variation. To this end, many skeleton-based ap-

proaches have been proposed. Most of them are based

on end-to-end Deep Learning (DL) (Du et al., 2015;

Wang and Wang, 2017) which have been proven to

be effective when a large amount of data is available.

Hence, for some industrial applications, providing

block

Temporal Dependencies

Learning

Spatial Features

Extraction

Transfer Learning

Fusion

Classification

block

Pour juice bottle

Transfer Learning

Figure 1: Our proposed learning pipeline for 3D skeleton-

based ﬁrst-person hand activity recognition. For a given

3D hand skeleton activity sequence, in the ﬁrst block, the

spacial features are extracted using existent and new hand-

crafted methods. Then, in the second block, the temporal

dependencies are learned. In the last block, a hand activity

sequence classiﬁer is learned, using a Post-fusion strategy,

which is applied to the previously learned temporal depen-

dencies. For the test predictions, only the ﬁrst and the last

blocks are involved.

large-scale labeled data sets is still hard and expan-

sive to achieve due to the manual data annotation pro-

Boutaleb, Y., Soladie, C., Duong, N., Kacete, A., Royan, J. and Seguier, R.

Efﬁcient Multi-stream Temporal Learning and Post-fusion Strategy for 3D Skeleton-based Hand Activity Recognition.

DOI: 10.5220/0010232702930302

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 4: VISAPP, pages

293-302

ISBN: 978-989-758-488-6

293

cess. On the other hand, pure hand-crafted (HC) fea-

tures based approaches (Smedt et al., 2016; Devanne

et al., 2015; Kacem et al., 2017; Zhang et al., 2016;

Hussein et al., 2013) can deal with limited amount

of data. Yet, they are still struggling to learn tempo-

ral dependencies along the sequence time-steps. As a

tuning alternative between performance and data ac-

quisition cost, hybrid methods combines DL and pure

HC methods (Avola et al., 2019; Chen et al., 2017;

Liu et al., 2019; Zhang et al., 2019).

Motivated by all these observations, we introduce

in this paper, a new hybrid approach for 3D skeleton-

based ﬁrst-person hand activity recognition. We high-

light the contributions as follows:

• A novel hybrid learning pipeline for ﬁrst-person

hand activity recognition that consists of three se-

quential blocks as illustrated in Figure 1: In the

ﬁrst block, we extract the spatial features using

our proposed selection of existent and new HC

features extraction methods. Then, in the sec-

ond block, we differ from the existing methods

by learning the temporal dependencies indepen-

dently on each HC features, using a simpliﬁed

separated Neural Network (NN) to avoid the over-

ﬁtting problem. Finally, we exploit the knowledge

from the previous block to classify activities us-

ing a tuning strategy that we called a Post-fusion.

Once the learning is completed, for the predic-

tions, only the ﬁrst and the last blocks are in-

volved. This multi-steps learning pipeline allows

training on a limited number of samples while en-

suring good accuracy.

• A combination of three local and global HC fea-

tures extraction methods for 3D skeleton-based

ﬁrst-person hand activity recognition, that we

summarize as follows. (1) Inspired by (Smedt

et al., 2016), we use a Shape of Connected Joints

(SoCJ), which characterizes the activity sequence

by the variation of the physical hand shape at

each time-step. (2) Our proposed Intra/Inter Fin-

ger Relative Distances (IIFRD) which also rele-

vantly characterizes the activity sequence at each

time-step by the variation of the Inter-ﬁngers rel-

ative distances between the physically adjacent

ﬁngers pairs, and the Intra-Finger Relative Dis-

tances which belong to the distance between two

opposite joints of a pairs of directly connected

segments of the ﬁnger. (3) Since the SoCJ and

IIFRD only focus on the local features of the hand

at each time-step, we proposed a complimentary

HC method called Global Relative Translations

(GRT), that focuses on the global features of the

entire sequence, by exploiting the displacement of

the hand during the activity.

Hand-Crafted (HC)

Spatial Features

Learning

Temporal Dependencies

Learning

Classification

HC Temporal Dependencies

Extraction

Classification

Temporal Dependencies

Learning

Classification

Deep Learning (DL)

Hand-Crafted (HC)

Deep Learning (DL)

Spatial Features

Extraction

Spatial Features

Extraction

(a)

(c)

(b)

Figure 2: A categorization of 3D Skeleton-based activity

recognition methods. (a) Deep Learning based methods

(DL). (b) Hand-Crafted methods (HC). (c) Hybrid methods

that combine both DL and HC methods.

The remainder of this paper is organised as fol-

lows. After giving a review on the related work in

Section 2, we describe our proposed approach for 3D

skeleton-based hand activity recognition in Section 3.

Then, we show the beneﬁt of the proposed approach

by presenting and discussing the experimental results

in Section 4. Section 5 concludes the paper.

2 RELATED WORK

This section presents related work for 3D skeleton-

based ﬁrst-person hand activity recognition. We

also introduce other similar 3D skeleton-based ap-

proaches, namely, dynamic hand gesture recognition

and human activity recognition. We categorize these

approaches as DL based (Sec 2.1), HC based (Sec

2.2), and Hybrid methods (Sec 2.3) as illustrated in

Figure 2.

2.1 Deep Learning based Methods

Recently, end-to-end DL approaches (Figure 2. (a))

showed great success in different applications, includ-

ing skeleton-based activity recognition (Wang et al.,

2019). The great performance of Convolutional Neu-

ral Networks (CNNs) has motivated (Caetano et al.,

2019a; Caetano et al., 2019b) to formulate the recog-

nition problem as an image classiﬁcation problem by

representing a sequence of 3D skeleton joints as a 2D

image input for a deep CNN. Also, attracted by the

success of CNNs, (Li et al., 2017) proposed a two-

stream feature convolutional learning to manage both

spatial and temporal domains. On the other hand,

other recent works (Yan et al., 2018; Tang et al., 2018;

Shi et al., 2018; Li et al., 2019; Si et al., 2019) fo-

cused on the Graph Convolutional Networks (GCNs)

to exploit the connections between the skeleton joints

formed by the physical structure. Furthermore, to

better exploit information in the temporal dimension,

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

294

many other works focused on Recurrent Neural Net-

works (RNNs) equipped with Long Short Term Mem-

ory (LSTMs) cells (Du et al., 2015; Liu et al., 2016) or

Gated Recurrent Units (GRUs) (Maghoumi and LaVi-

ola, 2018) for their capabilities of reasoning along the

temporal dimension to learn the temporal dependen-

cies. More recently, (Nguyen et al., 2019) proposed

a new NN layer that uses Symmetric Positive Deﬁ-

nite (SPD) matrix as low-dimensional discriminative

descriptor for classiﬁcation, which has been proven

to be effective. However, most of these methods still

have difﬁculties to extract relevant spatial features due

to the limited number of training samples and the

sparseness of the 3D skeleton data.

2.2 Hand-crafted based Methods

In spite of the success of DL approaches in the last

few years, classical methods that mainly use HC tech-

niques to perform recognition (Figure 2. (b)) still get

attention. Most of these approaches focus on repre-

senting the data in a non-Euclidean domain. A refer-

ence work was proposed by (Vemulapalli et al., 2014)

where they represented the 3D skeleton data as points

on a Lie group. Also, in (Devanne et al., 2015), a

Riemannian manifold is used as a non-Euclidean do-

main to formulate the recognition problem as a prob-

lem of computing the similarity between the shape

of trajectories. Similarly and besides, (Zhang et al.,

2016; Kacem et al., 2017), exploited the Gram Ma-

trix to handle the temporal dimension and distances

measurement for the classiﬁcation.

On the other hand, more closer to our HC fea-

tures extraction methods, some recent work focused

on the exploitation of the 3D geometrical informa-

tion. (Smedt et al., 2016) proposed a set of HC ge-

ometrical features based on the connection between

the hand joints and rotations, that they represented as

histograms input vectors for a SVM to classify hand

gestures. Similarly to (Evangelidis et al., 2014; Zhang

et al., 2013), (Smedt et al., 2016) used a Temporal

Pyramid representation to manage the temporal di-

mension. In (Ohn-Bar and Trivedi, 2013), joint an-

gles similarities and a Histogram of Oriented Gradi-

ents (HOG) are used as input into a SVM to classify

activities. This category of methods has been proven

to be very effective in providing relevant spatial fea-

tures, but most of them are still struggling to learn

long-term temporal dependencies.

2.3 Hybrid Methods

This category (Figure 2. (c)) combines the two pre-

viously introduced approaches (Sec 2.1 and Sec 2.2)

to overcome their limitations. Seeking to exploit their

advantages, our proposed approach falls in this cat-

egory since DL methods have been proven power-

ful in learning temporal dependencies, while HC fea-

tures based methods gives very discriminating spatial

features. (Avola et al., 2019) concatenated a set of

HC features as an input into a deep LSTMs to clas-

sify American Sign Language (ASL) and semaphoric

hand gestures. Similarly, aiming at classifying Hu-

man 3D Gaits, (Liu et al., 2019) concatenated relative

distances and angles as a feature input vector into an

LSTM to manage the temporal dimension, while in

parallel a CNN is exploited to learn spatial features

from 2D Gait Energy Images. Yet, the early fusion

of different features spaces increases the input com-

plexity and the learning noise (Ying, 2019), especially

when only few training data are provided. To prevent

this problem, (Chen et al., 2017), proposed an end-

to-end slow fusion based architecture. But this type

of complex architecture requires a lot of data amount.

More recently (Zhang et al., 2019) proposed a Quater-

nion Product Unit (QPU), that describes 3D skeleton

with rotation-invariant and rotation-equivariant fea-

tures, which can be fed into a Deep Neural Networks

(DNN) to recognize activities. Yet, results given by

their experiments showed a low score of accuracy.

3 PROPOSED METHOD

This section details our proposed hybrid approach fol-

lowing the illustration of Figure 1. In the ﬁrst block,

we extract the HC features (Sec 3.1). Then, in the sec-

ond block, we learn the temporal dependencies (Sec

3.2). Once the temporal learning is ended, in the last

block we transfer and exploit the knowledge from the

previous block to learn classifying activities (Sec 3.3).

3.1 Hand-crafted Features Extraction

A 3D hand skeleton activity sequence is the only in-

put, that we denote S(t). At each time-step t, the

hand is represented by a conﬁguration of physically

connected n joints {J

= (x

, y

, z

)}

j=1:n

. Each joint

is represented by 3D Cartesian coordinates forming

a set of segments that yields to the hand bones, the

phalanges and metacarpals (Figure 3). We deﬁne and

formulate the activity sequence as follows:

S(t) = {{J

}

j=1:n

}

t=0:T

(1)

where T is the max length of the sequence.

In order to exploit the 3D geometrical informa-

tion, in the ﬁrst block, we use three HC features ex-

traction methods, that provide relevant features for

Efﬁcient Multi-stream Temporal Learning and Post-fusion Strategy for 3D Skeleton-based Hand Activity Recognition

295

(a)





















(c)

Distal Phalanges

Intermediate Phalanges

Proximal Phalanges

Metacarpals

(b)









Figure 3: The proposed selection of hand-crafted features. (a) Shape of Connected Joints (SoCJ), the hand shape is repre-

sented by 3D vectors between physically connected joints (Smedt et al., 2016). (b) Our proposed Intra/Inter Finger Relative

Distances (IIFRD) which characterize the activities by the high variation of intra-ﬁnger distances RD

(in black) and inter-

ﬁnger distances RD

(in red). (c) Our proposed Global Relative Translation which aims to characterize the activity sequence

by the translation of the joints centroid at each time-step.

ﬁrst-person hand activity recognition. Note that be-

fore proceeding to feature extraction, the 3D hand

skeleton data are normalized, such as all the hands of

all the subjects are adjusted to the same average size

while keeping the angles intact. In the following de-

scriptions, we model the hand skeleton by ﬁxing the

number of joints to n = 21.

Shape of Connected Joints (SoCJ). Inspired by

(Smedt et al., 2016), we use the SoCJ to represent

the variation of the hand shape during the activtiy.

For each ﬁnger, we compute the 3D vectors between

physically connected joints, from the wrist up to the

ﬁngertips (Figure 3 (a)). Let F

= {J

}

j=1:5

be a set

of joints which are ordered such as they represent the

physical connections of the thumb ﬁnger and wrist

joint J

as shown in Figure 3 (a), the SoCJ(F

) can

be computed as follows:

SoCJ(F

) = {J

− J

j−1

}

j=5:2

(2)

By applying the SoCJ to all the ﬁngers, as a result,

for each time-step t, we obtain a feature descriptor

{SoCJ(F

)}

l=1:5

∈ R

4×5×3

, where F

is the l-th ﬁnger.

(.) denote the SoCJ method applied to the entire

activity sequence S(t), that we deﬁne as follows:

(S(t)) = {{SoCJ(F

)}

l=1:5

}

t=0:T

(3)

Intra/Inter Finger Relative Distances (IIFRD). We

exploit the periodic variation of the intra-ﬁnger and

inter-ﬁngers relative distances, which relevantly char-

acterizes the activity sequence (Figure 3 (b)).

• The intra-ﬁnger relative distances, that we denote

, gives strong internal dependencies between

ﬁnger’s connected segments. It represents the dis-

tance between two opposite joints of a pairs of

directly connected segments from each ﬁngertip

down to the wrist (Figure 3 (b) in black). Lets

take F

as described previously. The RD

) can

be computed as follows:

) = {d(J

, J

j−2

)}

j=5:3

(4)

where d is the Euclidean distance. By applying

the RD

to all ﬁngers, for each time-step t, we get

a feature descriptor a(t) = {RD

)}

l=1:5

∈ R

• The inter-ﬁnger relative distances, that we denote

by RD

(Figure 3 (b) in red), gives external de-

pendencies between adjacent ﬁngers pairs. For

instance, lets take F

as described previously and

= {J

}

j=7:9

, two sets of connected joints, that

refers to the thumb and the index ﬁngers respec-

tively. The RD

, F

) is computed as follows:

, F

) = {d(J

, J

j+4

)}

j=3:5

(5)

By applying the RD

to the four pairs of adjacent

ﬁngers, for each time-step t, we obtain a feature

descriptor e(t) = {RD

, F

l+1

)}

l=1:4

∈ R

Finally, by concatenating the two descriptors a(t) and

e(t), for each time-step t, we obtain a ﬁnal feature de-

scriptor {a(t), e(t)} ∈ R

15+12

. We denote by Ψ

(.)

the IIFRD method applied to the entire activity se-

quence S(t), that we deﬁne as follows:

(S(t)) = {a(t), e(t)}

t=0:T

(6)

Global Relative Translations (GRT). Unlike the

IIFRD and SoCJ descriptors, which only consider the

local features that belong to the ﬁngers motion at each

time-step, the GRT characterize the activity sequence

by computing the relative displacement of all the hand

joints along the sequence time-steps (Figure 3 (c)). To

this end, for each sequence, we ﬁx the wrist joint J

of the ﬁrst time-step t = 0 as the origin. Then, we

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

296

LSTM(s)

Softmax

LSTM(s)

Activity













Activity

























(



 )







(



 )







(



 )

SoCJ

IIFRD

GRT

block











Figure 4: Illustration of the ﬁrst and second blocks of the proposed learning pipeline. For each hand-crafted feature descriptor

(SoCJ, RD, and GRT) seen in Figure 3, a Neural Network composed of staked LSTM layers and a softmax layer are trained

independently to learn temporal dependencies.

transform all remaining joints of the sequence to this

new coordinate system as follows:

= J

− J

(7)

where the

the new transformed j-th joint at the

time-step t. Once the transformation is done, at each

time-step, we compute the centroid of the transformed

joints C

∑

j=1

. We denote by Ψ

(.) the ap-

plication of GRT to the entire sequence S(t), that we

deﬁne as follows:

((S(t)) = {C

}

t=0:T

(8)

The GRT gives discriminate complementary in-

formation to the IIFRD and the SoCJ by considering

the global trajectory of the hand along the activity. In

Section 4.3 we quantitatively show the beneﬁt of these

complementary information.

3.2 Temporal Dependencies Learning

Learning long and complex activities requires consid-

ering the temporal dimension to make use of the long-

term dependencies between sequence time-steps. To

this end, we use LSTMs cells for its great success

and capabilities to learn these long/short term depen-

dencies. Moreover, in contrast to traditional RNNs,

LSTMs overcome the vanishing gradient problem

by using a speciﬁc circuit of gates (Hochreiter and

Schmidhuber, 1997).

(Avola et al., 2019; Liu et al., 2019) concatenate

different types of features spaces as one input vector,

which may complicate the input and confuse the NN.

In contrast, for each HC features descriptor (seen in

Sec 3.1), we train separately a simple NN that consists

of staked LSTM layers followed by a softmax layer to

classify activities. Therefore, in total, we train three

NN separately as shown in Figure 4.

More formally, lets {Ψ

(S)}

k=1:3

be the set of the

three feature descriptors corresponding to Eq.3, Eq.6

and Eq.8 deﬁned in Section 3.1, where S is the activity

sequence input. For each feature descriptor Ψ

(S), we

model the temporal dependencies with a composite

function g

(Ψ

(S)), where g

(·) is the k-th LSTM

sub-network with θ the learnable parameters, while

the output of g

(·) refers to the last hidden state of

the last LSTM unit. For each network we deﬁne a

cross entropy loss function L

as follows:

= −

∑

c=1

log( ˆy

) (9)

where N is the number of classes, y

the target la-

bel and ˆy

the softmax output that refers to the pre-

dicted label. The temporal learning parameters are

optimized by minimizing over a labeled data set:

∗

= arg min

(y, ˆy

) (10)

At the end of the training, as a result, we have a

set of three trained LSTM sub-networks, with θ

∗

optimised parameters:

∗

(Ψ

(S))}

k=1:3

(11)

We note that the purpose of this second block is to

learn the temporal dependencies, and all the classiﬁ-

cation results ˆy

are ignored. Only the results shown

in Eq.11 are needed for the next block.

This pre-training strategy of multiple networks

avoid the fusion of different features spaces, which re-

duces the input complexity and the noise learning. It

also allows the LSTM to focus only on learning over

one speciﬁc feature input independently, which also

helps to avoid the over-ﬁtting problem (Ying, 2019).

3.3 Post-fusion Strategy and

Classiﬁcation

Once the temporal dependencies are learned in the

second block (Sec 3.2), we proceed to the ﬁnal clas-

siﬁcation. To this end, we train another multi-input

NN that exploits the resulted three pre-trained LSTM

layers introduced in (Sec 3.2) that we transfer with a

ﬁxed optimized parameters θ

∗

as illustrated in Figure

Seeking to ensure the best classiﬁcation accuracy,

the three parallel outputs branches of the transferred

LSTMs are concatenated, then fed into a Multi Lay-

ers Perception (MLP) that consists of two Fully Con-

nected (FC) layers, followed by a softmax layer (Fig-

ure 5). We model this network as shown in Eq.12,

Efﬁcient Multi-stream Temporal Learning and Post-fusion Strategy for 3D Skeleton-based Hand Activity Recognition

297

LSTM(s)

Transfer Learning Fusion

MLP

Softmax

Fusion

Activity

Classification

block

Temporal Dependencies

Learning

Transfer Learning

block

Spatial Features

Extraction

block





























Figure 5: Illustration of the third block of the proposed pipeline. Once the temporal dependencies are learned in the second

block. The LSTM layers are transferred to the third block with ﬁxed parameters. Their outputs are fused and fed into a MLP

followed by a softmax layer for the ﬁnal classiﬁcation.

where f

is a MLP+softmax with learnable parame-

ters φ, and h is the concatenation function:

(h({g

∗

(Ψ

(S))}

k=1:3

)) (12)

The learnable parameters φ are optimized using

the same loss function as in the previous block (Sec

3.2) by minimizing over the same training data set.

Note that for the test predictions, only this network is

involved using the three HC features descriptors in-

troduced in (Sec 3.1) as a multi-input.

The proposed Post-fusion strategy aims at ensur-

ing a good accuracy score through tuning between

the pre-trained networks outputs. In Section 4.4,

we quantitatively show the efﬁciency of this strategy

compared to other traditional fusing and classiﬁcation

methods.

4 EXPERIMENTS

4.1 Data Sets

FPHA Data Set. Proposed by (Garcia-Hernando

et al., 2018). It’s the only publicly available data set

for ﬁrst-person hand activity recognition. This data

set provides RGB and depth images with the 3D an-

notations of the 21 hand joints, the 6 Dof object poses,

and the activity classes. It is a diverse data set that in-

cludes 1175 activity videos belonging to 45 different

activity categories, in 3 different scenarios performed

by 6 actors with high inter-subject and intra-subject

variability of style, speed, scale, and viewpoint. It

represents a real challenge for activity recognition al-

gorithms. We note that, for the proposed method, we

only need the 3D coordinates of the hand joints. For

all the experiments, we used the setting proposed in

(Garcia-Hernando et al., 2018), with exactly the same

distribution of data: 600 activity sequences for train-

ing and 575 for testing.

Dynamic Hand Gesture (DHG) 14/28 Data Set.

Proposed by (Smedt et al., 2016), which is basically

devoted to a related domain, namely the hand ges-

ture recognition. We use it in order to better validate

our proposed approach. The data set contains 14 ges-

tures performed in two ways: using one ﬁnger and

the whole hand. Each gesture is performed 5 times

by 20 participants in 2 ways, resulting in 2800 se-

quences. Sequences are labelled following their ges-

ture, the number of ﬁngers used, the performer and the

trial. Each frame contains a depth image, the coordi-

nates of 22 joints both in the 2D depth image space

and in the 3D world space forming a full hand skele-

ton. Note that for our experiment we only need the

3D hand skeleton joints. We ignored the palm center

and we only considered the remaining 21 hand joints.

4.2 Implementation Details

The Learning of Temporal Dependencies. For ev-

ery extracted HC features, we trained different conﬁg-

urations of separated NNs that consists of 1,2,3 and 4

staked LSTM layers followed by a softmax. We se-

lected the best conﬁguration that gives the best accu-

racy score: only one LSTM layer of 100 units for the

FPHA data set, and two staked LSTMs of 200 units

for the DHG 14/28 data set. We set the probabil-

ity of dropout to 0.5 (outside and inside the LSTM

gates). We use Adam with a learning rate of 0.001 for

the optimization. All the networks are trained with a

batch size of 128 for 2000 to 3000 epochs. We also,

padded all the sequence lengths to 300 time-steps per

sequence.

Post-fusion and Classiﬁcation. Once all the tempo-

ral dependencies are learned (end of block 2), in the

Post-fusion step, we recover the pre-trained LSTM

networks, we ﬁx all their weights, and we discard the

softmax layers. Then, the three outputs branches from

the three parallel transferred LSTMs are concatenated

and followed by a MLP that consists of two dense

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

298

layers of 256 and 128 neurons respectively, equipped

with a relu activation function. At the end of the

network, a softmax layer is used for the ﬁnal clas-

siﬁcation. This network is trained until 100 epochs,

with the same batch size and optimization parameters

as the previous networks. Our implementations are

based on the Keras framework.

4.3 Hand-crafted Features Analysis

Table 1: Test accuracy results on the FPHA data set. The

selection hand-crafted features independently, and the com-

binations using our proposed approach.

Hand-crafted Feature Acc.(%)

Shape of Connected Joints (SoCJ) 89.91

Intera/Inter Finger Relative Distance (IIFRD) 88.17

Global Relative Translations (GRT) 58.26

IIFRD + GRT 93.73

SoCJ + GRT 92.17

SoCJ + IIFRD 93.91

SoCJ + IIFRD + GRT 96.17

In order to analyse the effectiveness of the selected

HC features, we evaluated each one independently

using a simpliﬁed end-to-end NN architecture com-

posed of one LSTM layer of 100 units with a dropout

of 0.5 (outside and inside the LSTM gates) and a soft-

max layer. We also evaluated possible HC features

combinations by using our approach with the same

conﬁguration introduced in (Sec 4.2). The results in

Table 1. show that the SoCJ and IIFRD, alone, are

capable of achieving a good accuracy of 89.91% and

88.17% respectively. As expected, the GRT alone is

unable to classify activities by achieving only 58.26%

of accuracy. But, it boosts the performance if com-

bined with the SoCJ, the IIFRD, or both, by achieving

the best accuracy of 96.17%. This can be explained

by the fact that the SoCJ, and the IIFRD focus on

the local features based on the motion of the ﬁngers,

ignoring translations between the activity sequence

time-steps, while the GRT focuses on the global fea-

ture based on the displacement of the hand during the

activity, which provides an important complimentary

information. The combination of the three selected

HC features allowed us to overcome the commonly

confused classes ”open wallet” and ”use calculator”

even if the hand poses are dissimilar but more sub-

tle (Garcia-Hernando et al., 2018). Nevertheless, we

still get confusion between ”open wallet” and ”ﬂip

sponge” classes, due to the limited displacement of

the hand, the shortness of the activities, and the lim-

ited number of samples in the data set compared to the

other classes (Garcia-Hernando et al., 2018). Please

refer to the appendix for more details.

4.4 Post-fusion Strategy and

Classiﬁcation Analysis

We compared our Post-fusion strategy with three tra-

ditional baselines, the early, the slow, and the late fu-

sion, that we deﬁne as follows:

Softmax

LSTM(s)

Softmax

Fusion

(a)

(b)

LSTM(s)

MLP

Softmax

Fusion

(c)

Activity

SoCJ

IIFRD

GRT

SoCJ

IIFRD

GRT

Activity

SoCJ

IIFRD

GRT

Activity

𝑡

𝑇

𝑡

𝑇

𝑡

𝑇

Early fusion

Slow fusion

Late fusion

Figure 6: Hybrid fusion and classiﬁcation baselines, (a)

Early fusion, is an end-to-end architecture, where the ex-

tracted hand-crafted features are concatenated and fed into

a temporal learning network followed by a classiﬁer. (b)

Slow fusion, is an end-to-end architecture, where, for each

extracted hand-crafted feature, the temporal dependencies

are learned separately, then concatenated and fed into a clas-

siﬁer. (c) Late fusion, is a multi-stream learning, where for

each extracted features, an end-to-end temporal network is

trained separately, and at the end, a majority vote is applied

to the their classiﬁer outputs.

Early Fusion. (Figure 6. (a)) As in (Avola et al.,

2019; Liu et al., 2019), we concatenate our extracted

HC features descriptors in one uniﬁed vector that we

fed into deep staked LSTM layers of 200 units fol-

lowed by a softmax layer. We evaluated the baseline

with a conﬁguration of 2, 3 and 4 staked layers, then,

the best accuracy results are selected for the compari-

son.

Slow Fusion. (Figure 6. (b)) As in (Chen et al.,

2017), for each extracted HC features descriptor, we

used 2 stacked LSTM layers of 200 units, followed

by a Fully Connected layer (FC) of 128 neurons. The

outputs from the three parallel FC branches are con-

catenated and followed by 2 sequential FC layers of

256 and 128 neurons respectively. At the end of the

network, a softmax layer is used for the classiﬁcation.

All the layers are followed by a dropout layer, and

all the FC layers are equipped with a relu activation

function.

Late Fusion. (Figure 6. (c)) In contrast to the previ-

ously introduced end-to-end baselines, in this archi-

tecture, for each HC features a NN composed of a

LSTM layer of 100 units and softmax layer is trained

separately. At the end of training, a majority vote is

applied by adding the softmax outputs scores.

Efﬁcient Multi-stream Temporal Learning and Post-fusion Strategy for 3D Skeleton-based Hand Activity Recognition

299

Table 2: Accuracy results on 50%, and 100% of the 600

FPHA data set training samples. For the test, all the 575

testing samples are kept. We compared our proposed ap-

proach with three traditional fusion and classiﬁcation base-

lines.

Architecture 300 samples Acc.(%) 600 samples Acc.(%)

Early fusion 75.65 90.95

Slow fusion 63.47 86.43

Late fusion 76.26 93.73

Our 79.78 96.17

We trained our network architecture and the se-

lected baselines with 100%, then only 50% of the 600

training samples of the FPHA data set that belongs to

the subject 1, 3, and 6. For the test, we kept all the

575 testing samples.

Our proposed approach outperforms the baselines

with more then 3.52% using 50% and 2.44% of accu-

racy using 100% of 600 training samples respectively,

which conﬁrms the effectiveness of our fusion strat-

egy. Moreover, by using only 300 samples (a half)

of the FPHA data set training samples, our approach

is achieving the state-of-the-art performance. Thanks

to its simpliﬁed architecture, the early fusion outper-

forms the slow fusion, which is more complex and

implies the over-ﬁtting problem. The late fusion out-

Table 3: Test accuracy comparison of our proposed ap-

proach and the state-of-the-art approaches on the FPHA

data set. The bests results are marked in bold.

Method Color Depth Pose Acc.(%)

(Feichtenhofer et al., 2016) X 7 7 61.56

(Feichtenhofer et al., 2016) X 7 7 69.91

(Feichtenhofer et al., 2016) X 7 7 75.30

(Ohn-Bar and Trivedi, 2014) 7 X 7 59.83

(Ohn-Bar and Trivedi, 2014) 7 X X 66.78

(Oreifej and Liu, 2013) 7 X 7 70.61

(Rahmani and Mian, 2016) 7 X 7 69.21

(Garcia-Hernando et al., 2018) 7 7 X 78.73

(Garcia-Hernando et al., 2018) 7 7 X 80.14

(Zanﬁr et al., 2013) 7 7 X 56.34

(Vemulapalli et al., 2014) 7 7 X 82.69

(Du et al., 2015) 7 7 X 77.40

(Zhang et al., 2016) 7 7 X 85.39

(Garcia-Hernando et al., 2018) 7 7 X 80.69

(fang Hu et al., 2015) X 7 7 66.78

(fang Hu et al., 2015) 7 X 7 60.17

(fang Hu et al., 2015) 7 7 X 74.60

(fang Hu et al., 2015) X X X 78.78

(Huang and Gool, 2016) 7 7 X 84.35

(Huang et al., 2016) 7 7 X 77.57

(Tekin et al., 2019) X 7 7 82.26

(Zhang et al., 2019) 7 7 X 82.26

(Lohit et al., 2019) 7 7 X 82.75

(Nguyen et al., 2019) 7 7 X 93.22

(Rastgoo et al., 2020) X 7 X 91.12

Our 7 7 X 96.17

performs both, thanks to its simpliﬁed NNs trained in-

dependently, which helps to overcome the over-ﬁtting

problem. The late fusion preforms well, but its naive

fusion can not ensure good tuning between the NNs

outputs. Please refer to the appendix for more com-

parison.

4.5 State-of-the-Art Comparison

Table 3 shows the accuracy of our approach compared

with the state-of-the-art approaches on the FPHA data

set. We note that the accuracy results of (Feichten-

hofer et al., 2016; Ohn-Bar and Trivedi, 2014; Or-

eifej and Liu, 2013; Rahmani and Mian, 2016; Zanﬁr

et al., 2013; Vemulapalli et al., 2014; Du et al., 2015;

Zhang et al., 2016; fang Hu et al., 2015) and (Huang

and Gool, 2016; Huang et al., 2016) are reported by

(Garcia-Hernando et al., 2018) and (Nguyen et al.,

2019) respectively, where the recognition may need

the full body joints instead of hands and some of them

might not be tailored for hand activities.

The best performing approaches among state-of-

the-art methods are the NN based on SPD manifold

learning (Nguyen et al., 2019), and the multi-modal

approach proposed by Razieh et al (Rastgoo et al.,

2020), which gives 93.22% and 91.12% of accu-

racy respectively, 3.26% inferior to our proposed ap-

proach. The remaining methods are outperformed by

our approach by more than 11% of accuracy.

Table 4: Accuracy comparison of our proposed approach

and the state-of-the-art approaches on DHG-14/28 data set.

The bests results are marked in bold.

Method Color Depth Pose

Accuracy (%)

14 gest 18 gest

(Oreifej and Liu, 2013) 7 X 7

78.53 74.03

(Devanne et al., 2015) 7 7 X

79.61 62.00

(Huang and Gool, 2016) 7 7 X

75.24 69.64

(Ohn-Bar and Trivedi, 2014) 7 7 X

83.85 76.53

(Chen et al., 2017) 7 7 X

84.68 80.32

(Smedt et al., 2016) 7 7 X

88.24 81.90

(Devineau et al., 2018) 7 7 X

91.28 84.35

(Nguyen et al., 2019) 7 7 X

94.29 89.40

(Maghoumi and LaViola, 2018) 7 7 X

94.50 91.40

(Avola et al., 2019) 7 7 X

97.62 91.43

Our 7 7 X

95.21 90.10

Table 4 shows that our proposed approach is

achieving the state-of-art results on the DHG-14/28

data set, even that our selected HC features methods

are adapted to the ﬁrst-person hand activity recogni-

tion and not to the hand gesture recognition problem.

The approach proposed by (Avola et al., 2019) out-

performs all the state-of-the-art approaches including

ours, thanks to theirs proposed HC features which

are well adapted to American Signe Language (ASL)

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

300

and semaphoric hand gestures. Furthermore, the time

sampling strategy used in (Avola et al., 2019) allows

to better classify the least dynamic and shortest ges-

tures, unlike our proposed approach which is adapted

to deal with the hand activities where the hand is sup-

posed to be more dynamic.

5 CONCLUSIONS

In this paper, a novel learning pipeline for ﬁrst-person

hand activity recognition is presented. The proposed

pipeline is composed of three blocks. The ﬁrst block

is a new combination of HC features extraction meth-

ods. The second block is our multi-stream temporal

dependencies learning strategy. In the last block, we

introduced our proposed Post-fusion strategy, which

has been proven to be more efﬁcient than other ex-

isting traditional fusion methods. The proposed ap-

proach is evaluated on two real world data set and

showed a good accuracy results.

As future improvements, we plan to exploit the

color and object pose information in addition to the

skeletal data, in order to avoid the ambiguous case

where the manipulated objects in different activities

may have the same dimension but with different col-

ors.

REFERENCES

Avola, D., Bernardi, M., Cinque, L., Foresti, G. L., and

Massaroni, C. (2019). Exploiting recurrent neural net-

works and leap motion controller for the recognition

of sign language and semaphoric hand gestures. IEEE

Transactions on Multimedia, 21:234–245.

Caetano, C., Br

emond, F., and Schwartz, W. R. (2019a).

Skeleton image representation for 3d action recogni-

tion based on tree structure and reference joints. 2019

32nd SIBGRAPI Conference on Graphics, Patterns

and Images (SIBGRAPI), pages 16–23.

Caetano, C., de Souza, J. S., Br

emond, F., dos Santos, J. A.,

and Schwartz, W. R. (2019b). Skelemotion: A new

representation of skeleton joint sequences based on

motion information for 3d action recognition. 2019

16th IEEE International Conference on Advanced

Video and Signal Based Surveillance (AVSS), pages 1–

Chen, X., Guo, H., Wang, G., and Zhang, L. (2017).

Motion feature augmented recurrent neural network

for skeleton-based dynamic hand gesture recognition.

2017 IEEE International Conference on Image Pro-

cessing (ICIP), pages 2881–2885.

Devanne, M., Wannous, H., Berretti, S., Pala, P., Daoudi,

M., and Bimbo, A. D. (2015). 3-d human action

recognition by shape analysis of motion trajectories

on riemannian manifold. IEEE Transactions on Cy-

bernetics, 45:1340–1352.

Devineau, G., Moutarde, F., Xi, W., and Yang, J. (2018).

Deep learning for hand gesture recognition on skele-

tal data. 2018 13th IEEE International Conference

on Automatic Face & Gesture Recognition (FG 2018),

pages 106–113.

Du, Y., Wang, W., and Wang, L. (2015). Hierarchical recur-

rent neural network for skeleton based action recogni-

tion. 2015 IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), pages 1110–1118.

Evangelidis, G. D., Singh, G., and Horaud, R. (2014).

Skeletal quads: Human action recognition using joint

quadruples. 2014 22nd International Conference on

Pattern Recognition, pages 4513–4518.

fang Hu, J., Zheng, W.-S., Lai, J.-H., and Zhang, J. (2015).

Jointly learning heterogeneous features for rgb-d ac-

tivity recognition. In CVPR.

Feichtenhofer, C., Pinz, A., and Zisserman, A. (2016). Con-

volutional two-stream network fusion for video action

recognition. 2016 IEEE Conference on Computer Vi-

sion and Pattern Recognition (CVPR), pages 1933–

1941.

Garcia-Hernando, G., Yuan, S., Baek, S., and Kim, T.-K.

(2018). First-person hand action benchmark with rgb-

d videos and 3d hand pose annotations.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural Computation, 9:1735–1780.

Huang, Z. and Gool, L. V. (2016). A riemannian network

for spd matrix learning. ArXiv, abs/1608.04233.

Huang, Z., Wu, J., and Gool, L. V. (2016). Building deep

networks on grassmann manifolds. In AAAI.

Hussein, M. E., Torki, M., Gowayyed, M. A., and El-Saban,

M. (2013). Human action recognition using a tem-

poral hierarchy of covariance descriptors on 3d joint

locations. In IJCAI.

Kacem, A., Daoudi, M., Amor, B. B., and Paiva, J. C.

(2017). A novel space-time representation on the pos-

itive semideﬁnite cone for facial expression recogni-

tion. 2017 IEEE International Conference on Com-

puter Vision (ICCV), pages 3199–3208.

Li, C., Zhong, Q., Xie, D., and Pu, S. (2017). Skeleton-

based action recognition with convolutional neural

networks. 2017 IEEE International Conference on

Multimedia Expo Workshops (ICMEW), pages 597–

600.

Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., and Tian,

Q. (2019). Actional-structural graph convolutional

networks for skeleton-based action recognition. In

CVPR.

Liu, J., Shahroudy, A., Xu, D., and Wang, G. (2016).

Spatio-temporal lstm with trust gates for 3d human ac-

tion recognition. In ECCV.

Liu, Y., Jiang, X., Sun, T., and Xu, K. (2019). 3d gait

recognition based on a cnn-lstm network with the fu-

sion of skegei and da features. 2019 16th IEEE Inter-

national Conference on Advanced Video and Signal

Based Surveillance (AVSS), pages 1–8.

Lohit, S., Wang, Q., and Turaga, P. K. (2019). Temporal

transformer networks: Joint learning of invariant and

Efﬁcient Multi-stream Temporal Learning and Post-fusion Strategy for 3D Skeleton-based Hand Activity Recognition

301

discriminative time warping. 2019 IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 12418–12427.

Maghoumi, M. and LaViola, J. J. (2018). Deepgru: Deep

gesture recognition utility. In ISVC.

Moon, G., Chang, J., and Lee, K. M. (2018). V2v-

posenet: Voxel-to-voxel prediction network for accu-

rate 3d hand and human pose estimation from a single

depth map. In The IEEE Conference on Computer Vi-

sion and Pattern Recognition (CVPR).

Nguyen, X. S., Brun, L., L

ezoray, O., and Bougleux, S.

(2019). A neural network based on spd manifold

learning for skeleton-based hand gesture recognition.

2019 IEEE/CVF Conference on Computer Vision and

Pattern Recognition (CVPR), pages 12028–12037.

Ohn-Bar, E. and Trivedi, M. M. (2013). Joint angles sim-

ilarities and hog2 for action recognition. 2013 IEEE

Conference on Computer Vision and Pattern Recogni-

tion Workshops.

Ohn-Bar, E. and Trivedi, M. M. (2014). Hand gesture recog-

nition in real time for automotive interfaces: A multi-

modal vision-based approach and evaluations. IEEE

Transactions on Intelligent Transportation Systems,

15:2368–2377.

Oreifej, O. and Liu, Z. (2013). Hon4d: Histogram of ori-

ented 4d normals for activity recognition from depth

sequences. 2013 IEEE Conference on Computer Vi-

sion and Pattern Recognition, pages 716–723.

Rahmani, H. and Mian, A. S. (2016). 3d action recogni-

tion from novel viewpoints. 2016 IEEE Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 1506–1515.

Ramirez-Amaro, K., Beetz, M., and Cheng, G. (2017).

Transferring skills to humanoid robots by extracting

semantic representations from observations of human

activities. Artif. Intell., 247:95–118.

Rastgoo, R., Kiani, K., and Escalera, S. (2020). Hand sign

language recognition using multi-view hand skeleton.

Shi, L., Zhang, Y., Cheng, J., and Lu, H. (2018). Non-

local graph convolutional networks for skeleton-based

action recognition.

Si, C., Chen, W., Wang, W., Wang, L., and Tan, T. (2019).

An attention enhanced graph convolutional lstm net-

work for skeleton-based action recognition. In CVPR.

Smedt, Q. D., Wannous, H., and Vandeborre, J.-P. (2016).

Skeleton-based dynamic hand gesture recognition.

2016 IEEE Conference on Computer Vision and Pat-

tern Recognition Workshops (CVPRW), pages 1206–

1214.

Sridhar, S., Feit, A. M., Theobalt, C., and Oulasvirta, A.

(2015). Investigating the dexterity of multi-ﬁnger in-

put for mid-air text entry. In CHI ’15.

Surie, D., Pederson, T., Lagriffoul, F., Janlert, L.-E., and

olie, D. (2007). Activity recognition using an ego-

centric perspective of everyday objects. In UIC.

Tang, Y., Tian, Y., Lu, J., Li, P., and Zhou, J. (2018).

Deep progressive reinforcement learning for skeleton-

based action recognition. 2018 IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

5323–5332.

Tekin, B., Bogo, F., and Pollefeys, M. (2019). H+o: Uniﬁed

egocentric recognition of 3d hand-object poses and in-

teractions. 2019 IEEE/CVF Conference on Computer

Vision and Pattern Recognition (CVPR), pages 4506–

4515.

Vemulapalli, R., Arrate, F., and Chellappa, R. (2014). Hu-

man action recognition by representing 3d skeletons

as points in a lie group. 2014 IEEE Conference on

Computer Vision and Pattern Recognition, pages 588–

595.

Wang, H. and Wang, L. (2017). Modeling temporal dynam-

ics and spatial conﬁgurations of actions using two-

stream recurrent neural networks. 2017 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 3633–3642.

Wang, L., Huynh, D. Q., and Koniusz, P. (2019). A compar-

ative review of recent kinect-based action recognition

algorithms. IEEE Transactions on Image Processing,

29:15–28.

Yan, S., Xiong, Y., and Lin, D. (2018). Spatial temporal

graph convolutional networks for skeleton-based ac-

tion recognition. In AAAI.

Ying, X. (2019). An overview of overﬁtting and its solu-

tions.

Yuan, S., Garcia-Hernando, G., Stenger, B., Moon, G.,

Yong Chang, J., Mu Lee, K., Molchanov, P., Kautz,

J., Honari, S., Ge, L., Yuan, J., Chen, X., Wang, G.,

Yang, F., Akiyama, K., Wu, Y., Wan, Q., Madadi, M.,

Escalera, S., Li, S., Lee, D., Oikonomidis, I., Argy-

ros, A., and Kim, T.-K. (2018). Depth-based 3d hand

pose estimation: From current achievements to future

goals. In The IEEE Conference on Computer Vision

and Pattern Recognition (CVPR).

Zanﬁr, M., Leordeanu, M., and Sminchisescu, C. (2013).

The moving pose: An efﬁcient 3d kinematics descrip-

tor for low-latency action recognition and detection.

2013 IEEE International Conference on Computer Vi-

sion, pages 2752–2759.

Zhang, C., Yang, X., and Tian, Y. (2013). Histogram of

3d facets: A characteristic descriptor for hand gesture

recognition. 2013 10th IEEE International Confer-

ence and Workshops on Automatic Face and Gesture

Recognition (FG), pages 1–8.

Zhang, X., Qin, S., Xu, Y., and Xu, H. (2019). Quaternion

product units for deep learning on 3d rotation groups.

ArXiv, abs/1912.07791.

Zhang, X., Wang, Y., Gou, M., Sznaier, M., and Camps, O.

(2016). Efﬁcient temporal sequence comparison and

classiﬁcation using gram matrix embeddings on a rie-

mannian manifold. 2016 IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR).

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

302