Using a 1D Pose-Descriptor on the Finger-Level to Reduce the

Dimensions in Hand Posture Estimation

Amin Dadgar and Guido Brunnett

Computer Science, Chemnitz University of Technology, Straße der Nationen 62, 09111, Chemnitz, Germany

Keywords:

One-D Finger Pose-Descriptor, Distance-Based Descriptor, Anatomy-Based Dimensionality Reduction,

Temporal a-Priori, Hand Posture Estimation, Single RGB Camera, Virtual Hand Models, Computer Vision.

Abstract:

We claim there is a simple measure to characterize all postures of every ﬁnger in human hands, each with

a single and unique value. To that, we illustrate the sum of distances of ﬁngers’ (movable) joints/nodes (or

of the ﬁnger’s tip) to a locally ﬁxed reference point on that hand (e.g., wrist joint) equals a unique value for

each ﬁnger’s posture. We support our hypothesis by presenting numerical justiﬁcation based on the kinematic

skeleton of a human hand for four ﬁngers and by providing evidence on two virtual hand models (which

closely resemble the structure of human hands) for thumbs. The employment of this descriptor reduces the

dimensionality of the ﬁnger’s space from 16 to 5 (e.g., one degree of freedom for each ﬁnger). To demonstrate

the advantages of employing this measure for ﬁnger pose estimation, we utilize it as a temporal a-priori in the

analysis-by-synthesis framework to constrain the posture space in searching and estimating the optimum pose

of ﬁngers more efﬁciently. In a set of experiments, we show the beneﬁts of employing this descriptor in time

complexity, latency, and accuracy of the pose estimation of our virtual hand.

1 INTRODUCTION

Three-D hand pose estimation systems aim at detect-

ing the joint conﬁguration of human hands in 3D

space. These systems are essential requirements for

disciplines such as human behavior understanding,

human-computer interaction, and augmented reality.

However, the high degrees of freedom of ﬁngers (16

out of total 28 DoF of hands) is cumbersome for a

fast and/or accurate performance. Therefore, it is ad-

vantageous to discover the feasibility of reducing this

high dimensionality by exploiting the hands’ inherent

kinematic/anatomic properties.

The main idea of our work is to investigate

whether a ﬁnger pose could be uniquely described

as a distance between the keypoints on ﬁngers and a

locally ﬁxed reference point on the hands (e.g., the

wrist joint, palm center, etc.). Such a relation would

simplify the representation of the ﬁngers’ postures to

merely ﬁve numbers and thus drastically reduce the

dimensionality of the problem.

Similar approaches have been suggested in two-

D (Liao et al., 2012) though the authors failed to

generalize them to three-D. There were endeavors on

three-D (Shimizume et al., 2019) but their generaliza-

tions to all postures and orientations remained chal-

lenging. Thus in those works, simplifying the prob-

lem, by modeling the camera-hand distance, assum-

ing stretched outward ﬁngers, and strictly ﬁxating the

hand orientation, seems to the expected trend.

In our investigation, by employing two artiﬁcial

hand models, we were able to observe two different

yet highly related pose-descriptors (Eq1, 2). First,

the distance of the ﬁngers or thumb’s tip to a refer-

ence point (wrist) is always a unique value for differ-

ent poses, Fig1-middle. Second, the sum of distances

of ﬁngers/thumb (movable) joints/nodes to that refer-

ence point is also a unique value, Fig1-right.

After validating the uniqueness of the pose-

descriptors for both models, we extended our investi-

gation to justify our ﬁndings based on the kinematic

relations of the human hand. In the Section 3, we

numerically justify, given a range of the allowable ro-

tation for joints, if one merely selects the place of the

reference point carefully, the uniqueness of these dis-

tances will be maintained. We label the suitable spot

of reference points in Fig3 as Re f

In brief, we require the following components

to compute and assess the uniqueness of our pose-

descriptors: A) Positions of moving nodes of ﬁngers

(e.g., two inter-phalangeal joins and the tip), and the

thumb’s (i.e., two phalangeal joints and the tip) as

Dadgar, A. and Brunnett, G.

Using a 1D Pose-Descriptor on the Finger-Level to Reduce the Dimensions in Hand Posture Estimation.

DOI: 10.5220/0011781000003411

In Proceedings of the 12th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2023), pages 501-508

ISBN: 978-989-758-626-2; ISSN: 2184-4313

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

501

seen in Fig1-top. B) Appropriate reference point that

always remains locally static with respect to ﬁngers

(e.g., the wrist joint). C) Plausible rotation range of

each movable node. This descriptor is independent

of camera distance to the hands, hand orientation (or

camera view-point), and ﬁngers pose.

We employ this pose-descriptor to reduce the high

degrees of freedom of ﬁngers to ﬁve (one-D for each

ﬁnger). Additionally, we will incorporate this as a-

priori in ﬁve one-D temporal models (one model for

each ﬁnger) and achieve a real-time estimation of the

ﬁnger poses of a virtual hand in the costly synthesis-

by-analysis framework on CPUs.

2 LITERATURE REVIEW

For accurate and real-time estimation of the hand’s

three-D postures, researchers consider a wide range

of approaches (Zhang et al., 2020; Zhao et al., 2013).

To review the fore works regarding global relation-

ships between the ﬁngers and a point on the hands,

however, we focus on a speciﬁc type of posture es-

timation that detects ﬁngertips position. As one of

the earliest works in ﬁnding some global features on

ﬁngers, (Hardenberg, 2001) demonstrated a circular-

diametric relationship between the ﬁngers on two-D.

Then he tried to ﬁnd the ﬁngertip’s position based on

those relationships.

There are several distance-based approaches to

detecting ﬁngertips in two-D. The work by (Dung and

Mizukawa, 2010) suggested a distance-based method

in connected component labeling (Paralic, 2012) fash-

ion to extract the two-D ﬁngertips, hand region, and

palm center on images. However, their approach

works only if the ﬁngers are wide open. A more

advanced strategy proposed by (Liao et al., 2012)

addresses more challenging hand postures such as

closed-ﬁngers poses. They employed distance trans-

formation to ﬁlter ﬁngers and remain with merely the

palm area. However, they used a simpliﬁed two-D

hand model with strongly local geometric constraints.

These approaches have two main drawbacks that

are intrinsic to their two-D characteristics. First, they

rely on local properties that are assumed will remain

unchanged. However, these two-D properties inher-

ently are against the innate three-D global invariabil-

ity (unless the hand faces forward, with a speciﬁed

distance to the camera and a ﬁxed orientation). Sec-

ond, their extensions to other scenarios and applica-

tions are not straightforward.

To alleviate the problems above, several three-D

distance-based approaches were proposed. For in-

stance, amid the depth sensors’ era, (Raheja et al.,

Figure 1: Left: CMC carpal-metacarpal joint, MCP

metacarpal- phalangeal joint, PIP proximal inter-phalangeal

joint, DIP distal interphalangeal joint, IP inter-phalangeal

joint. Both of these values, Middle: the distance of ﬁngers’

tip to a local reference point (e.g., wrist joint) and Right:

The sum of the distances of 3 movable ﬁngers’ nodes, on a

hand gives us a unique value for each ﬁnger posture.

2011) proposed a method for ﬁngertips and centers

of palms detection using KINECT. However, they as-

sumed that ﬁngertips are closer to the camera than the

other hand’s parts, which imposes brute constraints on

the orientation of hands.

By incorporating monocular sensors and propos-

ing a novel ﬁnger constraint, (Yamamoto et al., 2012;

Shimizume et al., 2019) estimated hand postures us-

ing detected ﬁngertip positions with inverse kinemat-

ics (IK). However, there is a major issue in their ap-

proach. In IK, there are relations from ﬁngertips to

the joints’ conﬁguration (not the reverse). Therefore,

in practice, one can not meaningfully reduce the high

complexity (or DoF) of ﬁngers. For example, an efﬁ-

cient temporal model can not be introduced on the IK.

Thus the system has to estimate hand postures from

ﬁngertip positions by solving IK for a high DoF prob-

lem. That novel constraint was employed to identify

the touch state of the ﬁngers’ tips. However, the ex-

tension of the approach for every posture of the hand

seems infeasible.

To the best of our knowledge, we propose a novel

distance-based measure to describe the poses of each

ﬁnger. Our descriptor is in three-D and is view-point

and camera-distance independent. Therefore, it can

assist one in designing ﬁnger pose estimation systems

with merely ﬁve DoF.

3 METHODOLOGY

Our pose-descriptor for each ﬁner is a one-

dimensional number that characterizes the postures of

that ﬁnger. To compute this descriptor we require the

distance of ﬁngertips (and movable ﬁnger joints) to a

locally ﬁxed point on the hands (e.g., the wrist joint).

In this section, we demonstrate how we calculate this

descriptor and justify its uniqueness within plausible

ranges of ﬁngers kinematics (degrees). Additionally,

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

502

Figure 2: The sorted pose-descriptor Type

(Eq1) and

Type

(Eq2), for middle ﬁnger & thumb. The four ﬁn-

gers have the same number of DoF and a similar descriptor

pattern. We illustrate the Mid-Res database consisting of

≈1400 poses for the ﬁnger and 3600 poses for the thumb.

we show how to incorporate it in a temporal model

as a-priori and design a search engine to estimate the

poses of a virtual hand in real-time.

Calculation of a Novel Pose-Descriptor. The basis

of our calculations is a data structure that comprises

the positions of all ﬁnger joints and tips. Considering

21 joints in hands, we deﬁne 7 limbs Lt, Rn, Md, Id,

T h, Wr, and Fr which stand for little, ring, middle,

index, thumb, wrist, and forearm, respectively. Ad-

ditionally, each ﬁnger has a tip t node and u, m, l, p

joints which stand for upper, middle, lower, and palm

(note the thumb has no middle joint).

By choosing the wrist as the Re f point, the ﬁnger-

tips distances to the wrist are calculated as follows,

and call it the pose-descriptor Type

Descriptor

(Fin

) =D(Tip(Fin

),W r)

where, Fin

={Lt,Rn,Md,Id,T h}

(1)

We also deﬁne the Type

descriptor as the sum of

the movable nodes’ distances to that Re f point.

Descriptor

(Fin

) =

∑

Node=t,u,m

D(Node(Fin

),W r)

where, Fin

={Lt,Rn,Md,Id}

Descriptor

(T h) =

∑

Node=t,u,l

D(Node(T h),W r)

(2)

In this equation, all ﬁngers have the joint’s indexes of

t, u, m, and the thumb joint’s indexes are t, u, l.

In our experiments, we investigate the advantages

of the second type of this pose-descriptor. However,

as the following paragraphs will clarify, both descrip-

tors are unique. Thus in some applications, where the

position of other joints is not known, the ﬁrst type

would work just as ﬁne.

Pose-Descriptor Uniqueness: Justiﬁcation & Evi-

dence. To employ this metric as a pose-descriptor,

we justify it has a unique value for each ﬁnger’s most

(or ideally all) poses for an excessively large num-

ber of ﬁnger posture set (e.g., high-resolution pose

database). To simplify the investigation, as seen in

Fig3, we begin with the hand kinematics for pose-

descriptor Type

. Also, we initially consider the dis-

tance of the ﬁngers moving nodes to the palm joint of

each ﬁnger, PalmJ (not the wrist joint), to eliminate

the calculations on the Y -axis (using Eq3). Next, we

set the ranges for the ﬁnger joints’ angles. For the

four angles of the ﬁngers, θ

, θ

, and θ

we ob-

serve the ranges of [0

◦

, 0

◦

] (e.g., L

is a ﬁxed part),

◦

, 90

◦

], [0

◦

, 120

◦

], and [0

◦

, 45

◦

], respectively.

Z =Z

+ Z

= (z

+ z

) + (z

+ z

) = 3z

+ 3z

+ 2z

+ z

=3L

cos(θ

) + 3L

cos(θ

) + 2L

cos(θ

) + L

cos(θ

)

X =X

+ X

= (x

+ x

) + (x

+ x

) = 3x

+ 3x

+ 2x

+ x

=3L

sin(θ

) + 3L

sin(θ

) + 2L

sin(θ

) + L

sin(θ

)

(3)

Then considering T heta

s, we have a function on

the hyperspace in which we can form a parametric

line (with parameter t) that lies on the intersection of

this function and an arbitrary plane. Now, for every

value of t and T heta

, if the derivative of this line is

non-zero, we can conclude Eq3 is injective (See ap-

pendix for developing this derivative). To perform a

thorough numerical justiﬁcation, we also investigate

the inﬂuence of the bones’ length and calculate the

derivative for different length values. To that, we nor-

malize L

s w.r.t L

(the lowest ‘moving’ limb on each

ﬁnger) and alter the values of others (L

> 0). Fi-

nally, we calculate this derivative for any two points

on the high-resolution database of 1-degree-step for

each joint (e.g. 486K poses).

In our investigations, the smallest value the deriva-

tive gets mostly is in the order of 10

−9

and rarely to

−16

. Considering the precision of python as 10

−28

we can conclude these smallest values are non-zero

(even with L

= L

= 1 only if L

̸= L

). Thus,

Eq3 is unique for all poses of ﬁngers in our exces-

sively high-resolution database, and in the uniqueness

property of the descriptor, the ﬁxed part’s length has

the most signiﬁcant role. One can select PalmJ in a

spot where the condition of L

̸= L

satisﬁes.

The next step is to incorporate the Y -axis dis-

tance into the computation, and to show the pose-

descriptor stays unique if the reference point is

not on the z-axis of the ﬁngers. As demon-

strated in Fig3-right, Y

= Z

+ConstY

, Y

= Z

+ConstY

and Y

= Z

+ConstY

. Therefore, the total distance on Y-

axis is Y

= Z

+ Z

+ 3 ×ConstY

. However, one term

of this equation is constant, and the Z terms are the

previously calculated Z-distance. Therefore, we con-

clude that our justiﬁcation is extendable to an arbi-

trary Re f J on the Y -axis. The same conclusion is

derivable for Type

descriptor by conducting a sim-

ilar analysis. Because by a closer look at the Eq3,

Using a 1D Pose-Descriptor on the Finger-Level to Reduce the Dimensions in Hand Posture Estimation

503

Figure 3: The kinematics formulation to calculate pose-

descriptor Type

: The left image shows the side view of

a ﬁnger, to ﬁnd Z and X values. The middle image demon-

strates the case when Y = 0 and D

PalmJoint

+ Z

. The right

image illustrates that the incorporation of the Y-distance

PalmJoint

+ Z

) into the formula for pose-descriptor

Type

and how no new variable enters the equation.

Figure 4: The actual and normalized pose-descriptor Type

values of four ﬁngers. Thumb has a different number of

poses thus its visualization with other ﬁngers is not possible.

Type

is a special case of Type

where the z

and x

(i ∈ {1, 2, 3, 4}) coefﬁcients equal one.

Finally, we extend the uniqueness assessment to

the thumbs. Thumbs, similar to ﬁngers, have three

moving nodes. However, unlike other ﬁngers, there

is no inherent ﬁxed part in the thumbs’ bone kine-

matics. Nevertheless, by deﬁning the Re f J some-

where below its PalmJ, we can assume a ﬁxed limb

for it. Therefore, a similar hierarchical structure of

ﬁngers is imitable for thumbs. However, a more sig-

niﬁcant discrepancy here is, ﬁngers have three de-

grees of freedom while thumbs have four with en-

tirely distinct ranges of degrees ([−20, 90], [−10, 10],

[−30, 10], [−20, 20]) and complex motion’s structure.

Arguing a similar kinematic formulations for thumbs

(in the appendix) exceeds the limits for this paper.

Therefore to justify the uniqueness of thumbs, we

employ two synthetic hand models with appropriate

rigging and skinning. After using the deﬁned pose

representation and setting the virtual camera parame-

ters, we compute the joints’ three-D position. Then,

we select a Low

Res

, Mid

Res

, and High

Res

database for

each joint (e.g., less than 10

◦

-step, more than < 10

◦

step, 4

◦

-step, respectively). These resolutions lead to

a database of 300+, 3400+, and > 32500 postures for

the thumb on one of the hand model (for the second

model, these numbers are slightly less as the degree

ranges for that model is different). Then, we calculate

the pose-descriptor Type

and Type

, as explained

Figure 5: Various poses of all ﬁve ﬁngers can relate together

using ﬁve one-dimensional temporal model. Thumb has a

different number of poses thus with could not visualize it’s

sorted pose-descriptor with other ﬁngers.

in the previous subsection, and sort the values. Simi-

lar to the ﬁngers, we witness unique descriptor values

for all different resolutions of thumb postures for both

types of descriptors and on both models (See Fig2 for

Type

, and for Type

on the ﬁrst model).

In real-life applications (and our experiments), the

number of considered poses is usually much smaller

than the considered ones. Therefore, those large pos-

ture sets provide a high level of conﬁdence about the

uniqueness of our proposed descriptor.

One-D Temporal Model. Initially, we sought to il-

lustrate the advantages of the pose-descriptor in the

mere estimation of the ﬁnger poses (without consider-

ing the time complexity). Conducting a set of experi-

ments, we achieved accurate pose estimations for one

ﬁnger where the gradient descent utterly failed. How-

ever, the time complexity of the approach was high,

thus we altered the roadmap and considered real-time

performance as a crucial criterion in our evaluations.

To that, we utilize our descriptor in a one-D tem-

poral Markovian model. As seen in Fig4, the model

uses the previous and current states to estimate the

next state. The ﬁgure visualizes all four ﬁngers’ one-

dimensional pose-descriptor. Thumb has a different

number of poses thus, simultaneous visualization of

its descriptor in that ﬁgure is not possible.

Hands, as high DoF objects, require high dimen-

sional (and complex) temporal model. However, with

our pose-descriptor we can employ ﬁve 1D models

to enhance the time-complexity considerably (Fig5).

If n is the number of states and h is the number of

searched ﬁngers, the search algorithm will have O(n

)

complexity. In a mid-resolution pose database, with-

out the temporal model and considering merely the

generate-and-test search strategy (GAT ), n will be

1500(for four ﬁngers) or 3500 (for thumbs). How-

ever, if we employ the temporal model, n will be 3,

which leads to considerable time improvements.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

504

Figure 6: GAT

Dir

2st

(left) and GAT

Dir

1st

(right) to constraint

the temporal model of ﬁve ﬁngers based on the previous

movement direction (open/closing) of ﬁngers motion.

Further Improvement of Search Time. For ﬁnd-

ing the optimal pose, our approach is to compare

the contours of the projected 3D model (into the 2D

plane, S) to the contour of the input image (I) using

Chamfer Distance (CD). However, CD is a costly pro-

cedure, because for each point on I, CD computes its

distance to all points in S to ﬁnd the minimum (Eq4)

CD =

|I|

∑

i∈(0,I)

Min

s∈(0,S)

[d(I

, S

)]

(4)

However, our S and I are the ‘sorted’ (ordered)

contour points’ coordinates, so we can modify and

speed up the Chamfer computation (Eq5): By solely

calculating the distance around the neighborhood (nn)

of the previously found point. That reduces the time

complexity from O(n

) to O(h ∗n). For the ﬁrst point

on I, we consider the entire points of S (initialization).

In our experiments, nn is usually 5.

CD =

|I|

∑

i∈(0,I)

Min

s∈(s−nn,s+nn)

[d(I

, S

)]

(5)

Real-Time Performance. Though these improve-

ments considerably enhanced the time complexity, the

performance is not real-time. Because in the tempo-

ral model, the GAT algorithm has O(n

) complexity

where, thanks to our one-D temporal model, n = 3.

That is, at each instance of time (S

), there are for-

ward, backward, and self transitions states each tem-

poral model could undergo (Fig5). By considering

ﬁve ﬁngers, the overall complexity will amount to

= 3

= 243 poses to ﬁnd the solution for one frame.

If we constrain them only to the left-to-right di-

rection (as in Fig6-left, for example), at S

, the tran-

sition can not go backward. So coming from S

t−1

, it

can either remain on S

or go forward. That is, there

are only two possible states each ﬁnger can under-

take (GAT

Dir

2st

), equating the total number of com-

plexity to n

= 2

= 48. A similar strategy exists for

Fig6-right. If merely forward transition is possible

(GAT

Dir

1st

), the number of poses the algorithm would

search equals to one: n

= 1

= 1.

Three Motion Patterns. With these considerations,

real-time estimation even on CPUs is feasible. Now,

we need application-speciﬁc scenarios to allow the

search, at each time step, to eliminate one or two

states (out of three possibilities). To that, we intro-

Figure 7: Three motion patterns of the little ﬁnger. We use

them to design application-dependent scenarios. These pat-

terns can also be very useful to analyze and model the time

series of ﬁngers’ motion.

duce three motion patterns we observed in the ﬁn-

gers/thumb which assist us designing our scenarios.

Four ﬁngers have similar movement patterns but

have different ones compared to the thumbs. As

shown in Fig7 (using our Descriptor Type

for the lit-

tle ﬁnger), there are three possible paths the four ﬁn-

gers can undertake to move between open and close

states. First, is freestyle motion, denoted as path

, in

which all the ﬁnger’s joints move together. Second,

hook-like motion (path

) in which, during the clos-

ing, the lowest joint of the ﬁngers moves at the ﬁnal

phase, but ﬁrst, other joints close until their limits.

Third, pinch-like motion (path

), in which during the

closing, the lowest joint closes at ﬁrst until its limit,

then the other joints close together.

For the thumb with upper, lower, and palm joints,

the path

is the freestyle, in which all the joints

close together. For the path

, (stretching motion), the

thumb initially approaches the index ﬁnger by mov-

ing the lower joint. Then by moving the palm joint,

the tip accosts the other ﬁngers’ parent joint. Next, by

moving the upper joint, it closes the upper part, and

ﬁnally, it ends the motion by returning the palm joint

to its resting position. The path

, (wiping motion),

starts by moving its lower joint to approach the index

ﬁnger (similar to the second path). Then, it closes the

upper joint while being close to the index ﬁnger. Fi-

nally, it brings the lower joint to its resting position.

4 EXPERIMENT

We test the efﬁcacy of our descriptor in ﬁnding the op-

timal solution in three experiments on a virtual hand

as input. We use the path

to design application-

speciﬁc scenarios with considerations from Table 1.

We employ four metrics to evaluate the perfor-

mance. First is the time complexity, indicated by fps.

Second is the initial and total latency caused by the

initialization or unexpected costs in the middle of the

estimation. Third is the average 3D Joint Position er-

Using a 1D Pose-Descriptor on the Finger-Level to Reduce the Dimensions in Hand Posture Estimation

505

Table 1: Eleven Variables could be considered in designing

the scenarios that contain ﬁngers’ motion.

Exp Variables

Do the ﬁngers move A/Synchronously?

Do they move orderly: Adjacent/Apart? Is the order Right-to-Left?

If failed, do we Srch Adj Fings Only? Is ﬁngers’ Motion Full-Cycle?

If not, do they Conserve their Direction? Is there Hand-lvl Open-Closing?

Do we consider Previous Direction? If yes, is it Initialized?

No. of Prtl-Cycled Finger’s Dynamics? 1, 2, or 3?

No. of End-Cycled Fing’s Dynamics? 1, 2, or 3?

ror (JPos3D) calculated as the normalized sum of esti-

mated joints distances from the ground truth. Last is

the Accuracy (Acc). We attend to highly constrained

cases. Therefore, we indicate the amount of accuracy

only if it is not 100%. The search algorithm knows

the previous state of ﬁngers in the sequence.

To animate the ﬁngers, we employ a similar hi-

erarchical hand database as proposed by (Dadgar

and Brunnett, 2018). They Deﬁne their hierarchical

database on various layers of complexity. That en-

ables us to animate hand limbs individually with a

speciﬁc emphasize on the layer of interest (e.g., ﬁn-

gers). Its ﬁnger’s layer (e.g., L

) is further parti-

tionable into ﬁve sublayers (one for each ﬁnger) with

modiﬁable step degree (e.g., resolution). These prop-

erties makes the database a suitable choice to exam-

ine the uniqueness of our pose-descriptor on different

resolutions, reﬁne it with different paths, and consider

various variables to design speciﬁc scenarios.

We create a sequence of postures for each

ﬁnger based on the temporal evolution of each

ﬁnger speciﬁed for every experiment using a

Viterbi-like algorithm (Viterbi, 1967). Re-

turning S = {S

|n = 1,2, 3, 4, 5} sequences (where

1 ≡ Little, 2 ≡ Ring,3 ≡ Middle, 4 ≡ Index, 5 ≡ T humb) of

Q = {q

, q

, ..., q

} states, where m ≈ 2000 is a usual

practice in this work. After selecting a speciﬁc global

orientation, we retrieve the input sequences. Finally,

the OpenCV’s contour extraction method (Bradski,

2000) is applied to the input frames to extract the

contours when searching for the optimal posture. All

experiments employ the previous direction for the

search. An elaborated version of the deﬁnitions and

their evaluations are in the following subsections:

Experiment1. In this experiment, we consider all dif-

ferent digit combinations of ﬁngers, including their

transitions (see Fig8). The ﬁngers start at closed-pose,

and one by one (thus asynchronously), from the lit-

tle ﬁnger, each of them opens and stays in the open

pose (thus full-cycled on the ﬁnger level and having

hand-level cycles) until all ﬁngers open from right to

left (therefore orderly or adjacent). After one-by-one

closing, the next opening cycle starts from the next

Figure 8: A sample of inputs for experiment one.

Figure 9: A sample of inputs for experiment two.

(e.g., ring) ﬁnger. For this scenario, considering the

path

where each ﬁnger has 21 possible poses from

close to open, we have 2100 poses. For the ﬁrst frame,

we do not use the initialization (so the number of dy-

namics is 2

). Nevertheless, for the end poses, at

open or close cycles, the re-initialization information

is known (thus, the number of possible dynamics for

each ﬁnger is 1).

Experiment2. For this experiment, the collective free

motion of ﬁngers from close-posture to open (and re-

verse) is under investigation (see Fig9). Here, other

ﬁngers do not have to wait at their end states for one

ﬁnger to reach its closed or opened states (thus there is

no hand-level cycle). Identical to the previous exper-

iment, the hand starts in the closed pose and not one-

by-one (yet still asynchronously) opens from the little

ﬁnger. Each ﬁnger opens till the end (thus full-cycled)

until all ﬁngers get opened from right to left (therefore

adjacent). Considering the 21 poses of path

, we have

1840 frames. Similarly, for the ﬁrst frame we do not

employ the initialization (thus the number of dynam-

ics is 2

). For the end poses, at each cycles, also, the

re-initialization is known.

Experiment3. Here, we consider collective free mo-

tion (so there is no hand-level cycle) from close-

posture to open (and reverse). Analogous to the previ-

ous experiment, the hand starts in the closed pose and

asynchronously opens its ﬁngers from the little one.

However, unlike in two other experiments, each ﬁnger

does not fully open until the end (thus partial-cycled)

until all ﬁngers reach their end-pose from right to left

(therefore adjacent). For this scenario, considering

the path

, we have 2230 poses. We do not use the

initialization for the ﬁrst frame. However, the number

of dynamics for that initial frame is 3

since the mo-

tion starts not at the beginning of the cycle. Here, for

the end cycles, at open or close postures, the initial-

ization information is available (thus, the number of

possible dynamics for each ﬁnger is 1).

Evaluation. Using our pose-descriptor in the one-D

temporal model enables us to achieve a real-time esti-

mation, as shown in Table 2. In all experiments, aver-

age output frame rates are above 31 fps. These output

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

506

Table 2: The results of our three experiments indicate that

our pose-descriptor can assist one to estimate the poses of

all ﬁve ﬁngers in real-time. The latency of the search is also

within toleration if the frame rate of the input video is less

than or equal to 31.

Exp Descriptor Type

Descriptor Type

f ps

Int

Ttl

f ps

Int

Ttl

1 25 32.5 62 142 32.5 50 156

2 25 32.4 52 173 31.1 44 302

3 24 31 440 515 32 294 359

1 32 32.5 632 1825 32.5 857 1791

2 31 32.4 237 971 31.1 1130 1734

3 31 31 2194 2202 32 1208 1474

frame rates are suitable to estimate the poses when

the input videos have an fps of 25 or lower. There-

fore, we can conclude our pose-descriptor provides

an appropriate tool for real-time applications. The

initialization of the ﬁrst frame is the primary cause

of the increase in the total latency (L

Ttl

). The latency

increases slightly during the rest of ≈ 2000 frames.

That is usually the result of cost variations in contour

extraction and Chamfer comparison caused by the al-

terable shape of the hand. However, because of much

higher average output frame rates (compared to input

fps), such increases in latency have marginal effects

on the overall performance. The third experiment has

a worse initial latency (L

Init

). For, the third experi-

ment’s motions do not start at the full-cycled of close-

postures. At those partial-cycled poses, there are three

(previous, current, and the next) states for each ﬁnger

to search (3

= 243). Whereas at full-cycled postures,

there are solely two states (current and next). That

leads to 2

= 32 searches, and results in experiments

one and two have better latency.

In rows four, ﬁve, and six of the table, we eval-

uate these experiments, this time with a higher input

frame rate. As expected, the average output frame

rates of the searches remain unchanged compared to

the ﬁrst set of experiments. However, the latency is

different, and that affects the performance. Though

the L

Init

is worse, it is L

Ttl

that experiences the highest

diminution and affects the performance by large mar-

gins. When we increase the frame rates of the input

video, the search algorithm ﬁnds the correct answer

for each frame as before. However, a slight variation

in the shapes of the contours and, thus, the estima-

tion cost causes the search stays behind the video’s

current frames, and the L

Ttl

drastically worsens. The

primary purpose of the experiments was to demon-

strate the suitability of our descriptor in estimating

the ﬁnger poses in real-time applications, even when

merely CPU resources are available. The conclusion

section elaborates on the JPose3D and Acc metrics we

achieved during the experimentation.

5 CONCLUSION

We proposed a simple pose-descriptor that character-

izes the postures at ﬁnger level. We showed this de-

scriptor has unique values for different ﬁnger poses,

reducing the ﬁngers’ DoF to 5 and eliminates the ne-

cessity of constraining the problem (as needed in re-

lated works). We incorporated this pose-descriptor

into a temporal model and with further modiﬁcations

could achieve real-time performance on CPUs. To

share more insights about the JPose3D and Acc, we

brieﬂy touch on other conducted experiments using

the GAT paradigm, with various image scales and ﬁn-

gers’ combinations.

To start systematically, we deﬁned ﬁve categories

of ﬁnger combinations: Cat

means merely the little

ﬁnger is under the search. Cat

means solely the lit-

tle and the ring ﬁngers are the subjects of estimation.

Cat

means we estimate the little, ring, and middle

ﬁngers pose. Cat

means we search the poses of the

little, ring, middle, and index ﬁngers. Finally, Cat

means we search all ﬁve ﬁngers.

Beginning with GAT search and 100%-

scale, we achieved Acc = 100% and JPose3D = 0, on

Cat

. As we proceeded to Cat

, the results

shows a slight decrements on the accuracy:

Cat

(Acc = 100%,JPose3D = 0), Cat

(Acc ≈ 100%,JPose3D = 0.0003),

Cat

(Acc ≈ 96%,JPose3D = 0.001), Cat

(Acc ≈ 95%,JPos3D = 0.0008).

However, the low average f ps of the search was

a signiﬁcant obstacle since we were aiming for

real-time results: Cat

(Out

f ps

= 1.23), Cat

(Out

f ps

= 0.41),

Cat

(Out

f ps

= 0.14), Cat

(Out

f ps

= 0.045), and Cat

(Out

f ps

= 0.015).

Continuing with the GAT search, we down

scaled the input and searched images. The low-

est scale which led to fast yet accurate (stable) re-

sults was the 10%-scale. For example, on Cat

with

that scale, we faced a slight decrease in accuracy

Cat

(Acc = 97%,JPose3D = 0.003). However, the average f ps

gain was considerable (Out

f ps

= 25). A similar pattern

was observable in all categories insofar that for Cat

we achieved the accuracy of Cat

(Acc ≈ 80%,JPose3D = 0.02)

and the average (Out

f ps

= 0.3). Though having an accept-

able accuracy, the time complexity for Cat

was still

far from being real-time. However, the enhanced time

complexity motivated us to modify the CD.

Proceeding to GAT

Dir

1st

search, we once consid-

ered various scales and coupled them with our mod-

iﬁed Chamfer distance computation and reached the

time complexity as high as Cat

(Out

f ps

= 42) on 10%-

scale. However, without the scaling down, the aver-

age frame rate was around 32 for that speciﬁc experi-

ment (not much difference). Thus we did not include

scaling down the images in our experiment section to

avoid the plethora of information.

Using a 1D Pose-Descriptor on the Finger-Level to Reduce the Dimensions in Hand Posture Estimation

507

Applications. The applications of our pose-

descriptor can fork in several directions. First, as a

consequence of reducing the ﬁnger’s DoF, one could

build a new motion capture system with fewer sensors

(e.g., markers, haptics), wires, and circuitry. Second,

our descriptor assists in constructing a training set that

is as diverse as possible images in machine learning)

to let the nets generalize better. Finally, one can ben-

eﬁt from our one-dimensional descriptor for studying

and modeling of sign languages. We touched on this

brieﬂy and experienced the convenience of designing

synthetic sign gestures with our descriptor.

ACKNOWLEDGMENTS

This project was funded by the Deutsche Forschungs-

gemeinschaft CRC 1410 / project ID 416228727.

REFERENCES

Bradski, G. (2000). The OpenCV Lib. Dobb Sftwr Tl s Jrnl.

Dadgar, A. and Brunnett, G. (2018). Multi-Forest Classiﬁ-

cation and Layered Exhaustive Search Using a Fully

Hierarchical Hand Posture / Gesture Database. In VIS-

APP, Funchal.

Dung, L. and Mizukawa, M. (2010). Fast ﬁngertips posi-

tioning based on distance-based feature pixels. ICCE.

Hardenberg, C. V. (2001). Bare-Hand Human-Computer

Interaction.

Liao, Y., Zhou, Y., Zhou, H., and Liang, Z. (2012). Finger-

tips detection algorithm based on skin colour ﬁltering

and distance transformation. Quality Sftwr Int Conf.

Paralic, M. (2012). Fast connected component labeling in

binary images. 35th Intern Conf on TSP.

Raheja, J. L., Chaudhary, A., and Singal, K. (2011). Track-

ing of ﬁngertips and centers of palm using KINECT.

CIMSim, 3rd Intern Conf.

Shimizume, T., Umezawa, T., and Osawa, N. (2019). Esti-

mation of the Distance Between Fingertips Using Sil-

houette and Texture Information of Dorsal of Hand,

volume 11844 LNCS. Springer International.

Viterbi, A. (1967). Error bounds for convolutional codes

and an asymptotically optimum decoding algorithm.

IEEE Trans on Informtin Theory, 13(2):260–269.

Yamamoto, S., Funahashi, K., and Iwahori, Y. (2012). A

study for vision based data glove considering hidden

ﬁngertip with self-occlusion. 13th Int Conf on Soft-

ware Eng, AI, Netwrk, & Parallel/Distrb Compt.

Zhang, F., Bazarevsky, V., Vakunov, A., Tkachenka, A.,

Sung, G., Chang, C.-L., and Grundmann, M. (2020).

MediaPipe Hands: On-device Real-time Hand Track-

ing.

Zhao, W., Zhang, J., Min, J., and Chai, J. (2013). Robust re-

altime physics-based motion control for human grasp-

ing. ACM Trans on Graphics, 32(6).

APPENDIX: UNIQUENESS

According to Fig3, we have z

= L

× cos(θ

= L

× cos(θ

), z

= L

× cos(θ

), z

= L

× cos(θ

= L

× sin(θ

), x

= L

× sin(θ

), x

= L

× sin(θ

= L

× sin(θ

). Therefore, descriptor Type

will be:

= F(θ

, θ

) = X

+ Z

= (x

+ x

)

+ (z

+ z

)

= L

sin

(θ

) + L

sin

(θ

) + L

sin

(θ

) + L

sin

(θ

cos

(θ

) + L

cos

(θ

) + L

cos

(θ

) + L

cos

(θ

) + 2 × [

sin(θ

)sin(θ

) + L

sin(θ

)sin(θ

) + L

sin(θ

)sin(θ

sin(θ

)sin(θ

) + L

sin(θ

)sin(θ

) + L

sin(θ

)sin(θ

cos(θ

)cos(θ

) + L

cos(θ

)cos(θ

) + L

cos(θ

)cos(θ

cos(θ

)cos(θ

) + L

cos(θ

)cos(θ

) + L

cos(θ

)cos(θ

)]

Using cos(x − y) = cosxcosy + sinxsiny and

considering θ

is always zero, we can simplify

the kinematic function F(θ

, θ

) as following:

F = L

+ L

+ 2 × [L

cos(θ

) + L

cos(θ

) + L

cos(θ

− θ

) + L

cos(θ

− θ

) + L

cos(θ

− θ

)]

A Line Connecting Two Points. To show

that F(θ) is unique/injective in a given

interval (e.g., (θ

, θ

) ̸= (θ

, θ

)

with

F(θ

, θ

) ̸= F(θ

, θ

)), we connect these two

points with a line, and represent it in a vector form:

l (t) =



, θ





− θ

, θ

− θ

, θ

− θ



, t ∈ ℜ

In terms of F(θ), this line has the following form:

F(t) =F



+ (θ

− θ

)t, θ

+ (θ

− θ

)t, θ

+ (θ

− θ



With the following components:

(t) =L

+ L

+ 2 × L

cos(θ

+ (θ

− θ

)t)+

2 × L

cos(θ

− θ

+ (θ

− θ

)t − (θ

− θ

)t)+

2 × L

cos(θ

− θ

+ (θ

− θ

)t − (θ

− θ

)t)

(t) =L

+ L

+ 2 × L

cos(θ

+ (θ

− θ

)t)+

2 × L

cos(θ

− θ

+ (θ

− θ

)t − (θ

− θ

)t)+

2 × L

cos(θ

− θ

+ (θ

− θ

)t − (θ

− θ

)t)

(t) =L

+ L

+ 2 × L

cos(θ

+ (θ

− θ

)t)+

2 × L

cos(θ

− θ

+ (θ

− θ

)t − (θ

− θ

)t)+

2 × L

cos(θ

− θ

+ (θ

− θ

)t − (θ

− θ

)t)

(6)

If the derivative of this function is non-zero for each t

and any pair of points, then F is injective. We have

′

(t) =(θ

− θ

(t) + (θ

− θ

(t) + (θ

− θ

(t)

(7)

Also, for human hands, one can realize the following:

= [0, 90

◦

] , θ

= θ

+ [0, 120

◦

] = [0, 210

◦

] =⇒ θ

− θ

= [0, 120

◦

]

=θ

+ [0, 45

◦

] = θ

+ [0, 120

◦

] + [0, 45

◦

] = θ

+ [0, 165

◦

] =

[0, 255

◦

] =⇒ θ

− θ

= [0, 165

◦

] , θ

− θ

= [0, 45

◦

]

Now, by numerically populating T heta

s (combi-

nation of 2 from 486K, C

486K

), L

s, and t, we check if

the F(θ) is injective.

ICPRAM 2023 - 12th International Conference on Pattern Recognition Applications and Methods

508