3D Hand and Object Pose Estimation for Real-time Human-robot

Interaction

Chaitanya Bandi

, Hannes Kisner and Urike Thomas

Robotics and Human-Machine Interaction Lab.,

Technical University of Chemnitz, Reichenhainer str.70, Chemnitz, Germany

Keywords:

Pose, Keypoints, Hand, Object.

Abstract:

Estimating 3D hand pose and object pose in real-time is essential for human-robot interaction scenarios like

handover of objects. Particularly in handover scenarios, many challenges need to be faced such as mutual

hand-object occlusions and the inference speed to enhance the reactiveness of robots. In this paper, we present

an approach to estimate 3D hand pose and object pose in real-time using a low-cost consumer RGB-D camera

for human-robot interaction scenarios. We propose a cascade of networks strategy to regress 2D and 3D pose

features. The ﬁrst network detects the objects and hands in images. The second network is an end-to-end

model with independent weights to regress 2D keypoints of hands joints and object corners, followed by a 3D

wrist centric hand and object pose regression using a novel residual graph regression network and ﬁnally a

perspective-n-point approach to solve 6D pose of detected objects in hand. To train and evaluate our model,

we also propose a small-scale 3D hand pose dataset with a new semi-automated annotation approach using a

robot arm and demonstrate the generalizability of our model on the state-of-the-art benchmarks.

1 INTRODUCTION

Hand and object pose estimation is an active research

ﬁeld for applications like robotics, augmented real-

ity, and manipulation. 3D hand pose estimation and

6D object pose estimation have been addressed inde-

pendently. Nevertheless, combined hand and object

pose estimation enhances the mutual occlusions and

this is yet to be solved for real-time applications. In

this work, we aim to introduce a pipeline to estimate

both hand pose and object pose for interaction sce-

narios. In robotics applications like bidirectional han-

dover of objects, reactiveness, reliability, and safety

are highly signiﬁcant. This can be achieved by precise

estimation of ﬁngertips and object pose in real-time.

The state-of-the-art works rely heavily on deep learn-

ing architectures for 3D hand pose estimation (Zim-

mermann and Brox, 2017; Mueller et al., 2018; Iqbal

et al., 2018; Ge et al., 2019), 6D object pose estima-

tion (Tekin et al., 2017; Peng et al., 2018; Li et al.,

2018; Wang et al., 2019; Park et al., 2019; Labb

et al., 2020; Tremblay et al., 2018), and uniﬁed hand

and object pose estimation (Doosti et al., 2020; Has-

son et al., 2019; Hasson et al., 2020; Tekin et al.,

2019).

The works (Yang et al., 2020; Rosenberger et al.,

2020) propose unique solutions for applications like

https://orcid.org/0000-0001-7339-8425

the handover of objects. These works rely on seg-

mentation networks to obtain the region of hands and

objects and later forward the region to the respective

grasp pose reﬁnement model. Although the segmen-

tation networks are reliable, the inference speed is

slow without proper hardware resources.

In this paper, we present an approach to regress

3D hand pose using deep learning architecture and

compute 6D object pose estimation as illustrated in

Figure 1. The ﬁrst network is an independent object

detection model to recognize the region of hands and

objects. The second network consists of two different

deep learning models that can either be trained end-

to-end or independently to infer 2D hand pose, 3D

hand pose, 2D object corners, 3D object corners and

a perspective-n-point solver for 6D object pose.

In this work, we introduce a two stream hourglass

network for 2D pose and a novel network for 3D

hand pose regression using graph convolutional net-

works. To train deep learning architectures, datasets

are highly signiﬁcant and there exist quite a few

benchmarks for hand object pose estimation. Most of

the benchmarks rely on a manual annotation process

and it is quite tedious, time-consuming, and costly.

We also introduce a new semi-automatic labeling pro-

cess for 3D hand pose estimation.

770

Bandi, C., Kisner, H. and Thomas, U.

3D Hand and Object Pose Estimation for Real-time Human-robot Interaction.

DOI: 10.5220/0010902400003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 4: VISAPP, pages

770-780

ISBN: 978-989-758-555-5; ISSN: 2184-4321

Figure 1: The overview of complete pipeline. The input is an RGB image and output is the 2D hand pose, 2D object corners,

3D object corners, 3D hand pose and an optional module to get 6D object pose using perspective-n-point algorithm.

2 RELATED WORK

In this section, we review the deep learning-based

state-of-the-art techniques for 3D hand pose estima-

tion, 6D object pose estimation, uniﬁed reconstruc-

tion of a hand-object pose, and well-known bench-

marks.

2.1 3D Hand Pose Estimation and 6D

Object Pose Estimation

The earliest work to estimate 3D hand pose using

deep learning (Zimmermann and Brox, 2017) propose

a cascaded architecture consisting of segmentation

network, pose network, and pose-prior network. Seg-

mentation network localizes the region of the hand,

the segmented region of the interest is forwarded to

pose network to obtain 2D score maps of hand joints

and ﬁnally 2D keypoints are lifted to 3D using a pose-

prior network with viewpoints. The architecture is

evaluated on the rendered hand pose dataset which

is purely synthetic and suffers from generalization is-

sues on real images. To obtain better generalization,

the work (Mueller et al., 2018) proposes a technique

that transforms the synthetic dataset such that it re-

sembles real-world images. Later, the images are for-

warded to the regression network to regress both 2D

heatmaps and 3D joint coordinates. For better gener-

alization, 3D hand kinematics are combined with con-

volutional architecture. The work (Iqbal et al., 2018)

introduces an approach that leverages depth informa-

tion in addition to RGB image. In the ﬁrst stage, an

encoded-decoder network produces 2D heatmaps and

latent depth maps and in the next stage, 2D pose infor-

mation from heatmap is combined with depth maps

to obtain normalized 3D coordinates and ﬁnally re-

constructed to retain 3D pose. The direct 2D or 3D

keypoints cannot express the shape of the hand so the

work in (Ge et al., 2019) proposes graph convolu-

tional neural network-based (Graph-CNN) architec-

ture to recover the 3D mesh of the hand beside the 2D

and 3D hand pose. The architecture is tested on both

real and synthetic datasets. Based on the idea of us-

ing 2D hand pose to obtain 3D hand pose is further

exploited in these works (Bandi and Thomas, 2020;

Zhang et al., 2020).

6D Object Pose Estimation. Most of the state-of-

the-art works follow two-staged processes for 6D ob-

ject pose estimation. In the ﬁrst stage, a CNN ar-

chitecture is utilized to detect the 2D keypoints of

3D projected corners and then the perspective-n-point

(PnP) (Lepetit et al., 2009) solver to compute 6D pose

features. For the ﬁrst stage, (Liu et al., 2016) proposes

an architecture inspired by the YOLO model for 2D

keypoint detection and (Peng et al., 2018) presents a

pixel-wise voting scheme to further enhance the ac-

curacy of 2D keypoint detection like RANSAC. It

is possible that the detected keypoints are not com-

pletely accurate due to occlusions, which is further

reﬁned by deep iterative matching (Li et al., 2018).

The 6D object poses for semantic grasping (Tremblay

et al., 2018) works in a similar two-staged process of

2D keypoint detection using a CNN architecture and

PnP-based object pose estimation. The model is com-

pletely trained on a synthetic dataset, generalized well

on real-world images and the model is quite simple

and applicable in real-time. Further improvement is

shown in (Wang et al., 2019) by fusing both RGB and

depth information. Pixel-wise 3D coordinate predic-

tion of objects without textured models is presented

in (Park et al., 2019) followed by PnP algorithm with

RANSAC. The Multiview multi-object pose estima-

tion (Labb

e et al., 2020) is very robust and it is han-

dled in three stages. In the ﬁrst stage, 2D regions and

respective initial 6D object poses are obtained for all

views and then these objects are matched to recover

a single consistent view and objects in the scene and

camera poses are reﬁned globally.

Uniﬁed Hand + Object Pose Estimation. The

3D Hand and Object Pose Estimation for Real-time Human-robot Interaction

771

Figure 2: The pipeline of the training dataset generator. A camera is attached to the tool center point (TCP) of a robot arm

and captures the scene. The current environment is included as a collision map. New camera viewpoints can be calculated

with the captured information. There are reachable viewpoints (green) and unreachable viewpoints (red) as the robot motion

is restricted. The selected center point (blue) is used for viewpoint calculation. From each reachable viewpoint, point clouds,

as well as images, can be captured and used for the semi-automatic labeling process.

single-shot neural network (Tekin et al., 2019) rec-

ognizes the 3D hand pose and 3D object pose. In

addition to 3D pose features, the network recognizes

the action performed. This network drops the idea of

computing 2D and 3D correspondences to compute

6D pose using PnP and instead results from the di-

rect 3D coordinates of the bounding box around the

3D object. For applications like the handover of ob-

jects or grasp applications, the 3D keypoint represen-

tation might be insufﬁcient. For such application,

hand and object meshes are reconstructed (Hasson

et al., 2019). The drawback is that the architecture

trained on purely synthetic data and the retrieved ob-

ject meshes are not reﬁned. To further improve the ac-

curacy Graph UNet (Doosti et al., 2020) based archi-

tecture is proposed. The initial 2D keypoints detec-

tions are reﬁned using graph convolutional networks

and then adaptive UNet transforms them to 3D pre-

dictions of both hand and 3D object pose.

2.2 Hand + Object Benchmarks

There exist quite a few datasets for 3D hand pose

and 6D object pose estimation. 3D Hand pose is an-

notated on real images using manual (Sridhar et al.,

2016; Mueller et al., 2017), semi-automatic (Zim-

mermann et al., 2019), complete automatic (Simon

et al., 2017) annotation processes, and on synthetic

(Mueller et al., 2018; Zimmermann and Brox, 2017)

images using automated process. 6D object bench-

marks like LINEMOD (Hinterstoisser et al., 2012)

dataset is manually labeled, and it consists of around

1100 frames containing 15 texture-less objects. Later

a large-scale YCB video dataset (Xiang et al., 2018)

with a semi-automated annotation process for 6D ob-

ject pose estimation is introduced with 21 textured ob-

jects containing 80 video sequences for training and

12 video test videos. This dataset is widely applied in

many state-of-the-art works.

For hand object manipulations, several datasets

with hand and object pose have been proposed. (Has-

son et al., 2019) introduce an ObMan dataset with

hands grasping objects. The large-scale dataset con-

sists of 150,000 images that are synthetically gen-

erated. First-person hand action (Garcia-Hernando

et al., 2018) dataset provides hand object interaction

of daily actions with 3D hand joints and object pose.

The drawback is that the dataset is captured by attach-

ing magnetic sensors to the subject’s hand and object,

which in turn modiﬁes the appearance features on

RGB images. Very recently HO-3D (Hampali et al.,

2019) dataset is open-sourced with hand object inter-

actions using the objects from the YCB dataset. The

dataset consists of highly occluded automatic annota-

tions, and it is very suitable for real-world hand object

interaction scenarios. The ContactPose (Brahmbhatt

et al., 2020) dataset is also a large-scale dataset con-

taining 2.9 million images of human hands grasping

objects with contact maps and this dataset is captured

using 3 Kinectv2 RGB-D cameras with known house-

hold objects with markers. In (Ye et al., 2021), object

handovers from human to human is extensively eval-

uated and also introduces a dataset with 18k handover

videos.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

772

3 SEMI AUTOMATIC TRAINING

INSTANCES

The hand pose estimation depends on the quality of

training data from the environment, e.g. a higher data

variety increases robustness. If data with later used

camera sensors exist, the performance increases in a

real-world scenario. However, the creation of labeled

training data is tedious and time-consuming so we

automatize parts of the process to decrease creation

time. For this purpose, we need sensors to detect cam-

era poses, as well as a concept for camera movements.

This leads to the idea of using a robot arm and attach-

ing a camera to it. The following section describes the

setup and the data creation pipeline in more detail.

3.1 Scenario Setup

We need to change and measure camera poses, as ﬂex-

ible and accurate as possible, to ensure a high train-

ing data quality. Thus, our pipeline is based on a

robot arm and a RGB-D camera sensor. The robot

shall move the camera to different poses to get dif-

ferent viewing angles while measuring the position of

the camera. The TCP of the robot includes maximum

freedom of movement and thus we attach a camera

near to it. (The exact camera pose is determined by

robot sensors and a camera calibration which has to

be executed beforehand. Furthermore, the robot has

to know its environment to guarantee collision free-

ness. Thus, our robot controller includes a static map

of the environment. Furthermore, the current camera

view is added to the collision map. However, more

details on collision maps and robot control are out of

the scope of this work.

3.2 Pipeline

Our semi automatized dataset creation pipeline con-

sists of two main stages. The ﬁrst stage is used to

capture images and generate new camera poses. The

saved data is forwarded to the second stage, a post-

processing pipeline that uses all captured informa-

tion to generate training data. Figure 2 visualizes the

pipeline with a sample scenario.

3.2.1 Move and Capture

At ﬁrst, we put the object of interest near the robot

and align the camera to the object

. We want to gen-

erate landmarks for hand-tracking, thus our object of

We assume the object and the robot to be placed in

free space, to guarantee that the robot can move around it

(considering the movability limitations of the robot).

interest is a human hand. After the ﬁrst captured im-

age and pose, the hand has to stay as static as possible

. Now, a computer transforms the captured infor-

mation into a visualizer using the robot frame as the

origin. An external operator uses a GUI to set a point

as the center point for the next camera viewpoints.

Next, the object with respect to the center point has

to be captured from different viewing angles. How-

ever, we could use random camera poses. Instead, we

need a predictable motion of the robot to increase the

safety level. We deﬁne a sphere with the initial cam-

era pose. The resulting sphere corresponds to a set of

inﬁnite viewpoints around the center point. We dis-

cretize the viewpoints using a pre-deﬁned angle dis-

tance, e.g. 10

◦

. The resulting points are the next cam-

era poses in world coordinates. The pipeline captures

for each of the viewpoints the coordinates of the cam-

era as well as a 3D point cloud. The camera pose is

used to transform the information into the robot coor-

dinate system; respectively all point clouds are trans-

formed into the same coordinate system leading to a

stitched point cloud. Next, we post-process the cap-

tured data.

3.2.2 Post Processing

The post-processing uses the point clouds, captured

2D images as well as camera transformations as in-

put data. Additionally, a template of desired land-

marks is required. It is not guaranteed that all land-

marks are visible in each image, e.g. restricted cam-

era view angle, occluded by objects. We overcome

hidden landmarks by displaying the complete aligned

point cloud in a GUI as we need exact training data.

Thus, an external expert has to pick in the GUI initial

landmarks, e.g. the ﬁngertips, in the aligned cloud

concerning the pre-deﬁned landmark template. The

selection results in a set of point indices, that are set

as ground truth information. From that, the algorithm

iterates through the captured point clouds and cam-

era poses. We search for each ground-truth the cor-

responding point in the speciﬁc point cloud and set

the extracted indices as the landmark. However, if

the euclidean distance between the ground truth point

and the new landmark is larger than the threshold

τ = 0.05cm, we deﬁne the landmark as 0 correspond-

ing to an occluded or invisible point. The algorithm

outputs a 3D point cloud, a corresponding 2D image,

the indices as well as the positions of the landmarks in

2D and 3D. We name our dataset as robot arm semi-

automatic hand dataset (RASH). Adding ground truth

data for bounding boxes differs from the normal land-

Proposing an external ﬁxation device or something

similar is out of scope of this work.

3D Hand and Object Pose Estimation for Real-time Human-robot Interaction

773

Figure 3: The architecture of 2D hand pose and 2D object corner regression using heatmaps.

marks as selecting bounding boxes in point clouds is

not intuitive. We use 2D images of the initial cloud

for selecting 2D bounding boxes. However, we need

to add depth information, to convert 2D box informa-

tion into 3D. Therefore, the depth of the object has to

be predeﬁned, e.g. a hand has a size of about 20cm.

The 2D box relates to a set of indices corresponding

to 3D points. We select the nearest point from the

extracted points and add the pre-deﬁned object size.

Thus, the external operator selects x and y coordinates

of the bounding box edges, and the depth is automati-

cally added. The bounding box edges can be similarly

used as ground truth landmarks. Using this setup, we

capture 12000 hand instances which allows maximum

error of 0.5mm for training the network.

4 METHODOLOGY

In this section, we introduce the architecture for

3D hand pose regression and object pose estimation.

Given an RGB image I = R

height×width×3

, we regress

2D Hand pose P

= R

21×2

, 2D object keypoints

= R

8×2

, 3D hand pose of size P

= R

21×3

and

3D object keypoints O

= R

8×3

4.1 Hand + Object Detection

Initially, we consider an independent object detec-

tion model to detect hands and known objects that

are used in the interaction environment. The ob-

ject detection architectures are widely researched for

many real-time applications and there exist one staged

(Bochkovskiy et al., 2020; Liu et al., 2016) and two-

staged models (Dai et al., 2016) with distinct back-

bones either for GPU (He et al., 2015) or for CPU

platforms (Sandler et al., 2018). In this work, we

reuse the well-known YOLOv4 (Bochkovskiy et al.,

2020) architecture with MobileNet (Sandler et al.,

2018) backbone for training the region of hands and

objects. As the network is quite fast and it does not

affect the speed of the cascaded network.

4.2 2D Hand Pose and Object Keypoints

Once the region of hand and object is detected, we

pass the cropped region to heatmap regression net-

work (HRN) to estimate the 21 hand joints and 8 ob-

ject corners. To regress keypoints using heatmaps,

an encoder-decoder or hourglass based architecture is

considered. The hourglass based model with resid-

uals introduced in (Newell et al., 2016) works well

for human pose estimation and later adopted to hand

based keypoint regression models. Based on that, we

introduce a two-stream hourglass model to regress

heatmaps of hand and object joints. The HRN is

clearly depicted in Figure 3. As the RGB image

contains both hand and object, we extract common

features using convolutions and residuals (He et al.,

2015). Later, the information is shared to hourglass

1 to extract hand joint heatmaps and hourglass 2

to extract object corners. The network features be-

fore hourglass are 3 × 256 × 256 → conv(64 × 128 ×

128) → residual(128 × 64 × 64) → residual(128 ×

64 × 64) → residual(256 × 64 × 64)). The features

are directly passed through two hourglass models

without further processing to extract heatmaps of

hand joints and object corners. From the network,

we obtain 21 heatmaps for each hand and 8 heatmaps

for each object present in the image. Each heatmap

consists of one joint with size of 64 × 64. The lo-

cation of keypoint is the location of maximum value

exists in the heatmap and the maximum location is

converted to orginal image size by multipying it with

4 (i.e., 64 × 4 = 256). An example of heatmaps is

shown in Figure 5.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

774

Figure 4: The architecture of residual graph regression network for 3D pose regression.

4.3 Residual Graph Regression

Network (RGRN)

We will revisit brieﬂy the concept of GCNs as pro-

posed in (Kipf and Welling, 2016). The aim is to

learn features on a graph structure data and convolu-

tional ﬁlters are shared across all graph locations. The

graph structure is represented in the form of an adja-

cency matrix A ∈ K × K , where K is the number of

input nodes. The input features of every node are rep-

resented in the form of a feature matrix K ×H, where

H is the input features. The output features f (H, A) of

each layer l are obtained by multiplying the adjacency

matrix with input nodes and trainable weights W , and

features are then passed to non-linear functions like

ReLU(Agarap, 2019).

f (H

(l)

, A) = σ(AH

(l)

) (1)

As our hand model is in the form of a graph, we

use basic building blocks of graph convolutions and

combine them with residual connections (He et al.,

2015). To learn the graph structure, we use the pre-

deﬁned kinematic structure of hand and object joints

as an adjacency matrix to the graph network. Given

2D keypoints of shape, 29 × 2 as input, each graph

convolution output 256 features and the ﬁnal output

layer regress 29 × 3 features i.e., the 3D coordinates

of the hand pose and object pose. The output of each

graph convolution is normalized and passed through

ReLU (Agarap, 2019) non-linear function. As the

input features are wrist-centric (i.e., the wrist joint

is (0,0,0)), we also introduce bias during training.

The purple block in Figure 4 represents the RGRN

model for 3D hand pose regression. As the num-

Figure 5: An example of 64 × 64 heatmap of each joint.

ber of residual layers N is not ﬁxed, we experiment

with the value of N during the training process for

best performance. Each block consists of graph con-

volution operation and the network layers are as fol-

lows 21 × 2 → 21×256 → N − Residual(21 ×256 →

21 × 256) → 21 × 256 → N − Residual(21 × 256 →

21 × 256) → 21 × 3. See Section 7 for training and

validation stability.

4.4 Network Loss

To train a model, we must compute loss for backprop-

agation, the heatmap loss for 2D hand pose l

hand2D

the 2D object corner l

ob j2D

, and the pose loss for 3D

hand pose l

hand3D

and 3D object corner pose l

ob j3D

For both 2D and 3D loss computation, mean squared

error (MSE) loss is utilized. The individual loss com-

putation is

= ||groundtruth − predicted||

(2)

Where n is l

hand2D

, l

ob j2D

, l

hand3D

, and l

ob j3D

The overall heatmap loss L

heatmap

for 2D heatmap

regression network is computed as

heatmap

= l

hand2D

+ l

ob j2D

(3)

The overall 3D pose loss L

hand3D+Ob ject3D

is com-

puted as

hand3D+Ob ject3D

= l

hand3D

+ l

ob j3D

(4)

5 EXPERIMENTS

5.1 Datasets

In this section, we extensively experiment with model

hyperparameters and test the generalizability of the

proposed model on two open source datasets in which

HO3D dataset (Hampali et al., 2019) contains both

hand manipulating objects, FreiHand dataset (Zim-

mermann et al., 2019) contains hand pose with self-

occlusions, and RASH dataset: 1) HO-3D dataset

(Hampali et al., 2019): as the dataset contains both

hand and object annotations. HO-3D dataset contains

3D Hand and Object Pose Estimation for Real-time Human-robot Interaction

775

Table 1: Training Parameters.

Parameter Object Detector 2D Hand and Object Pose 3D Hand and Object Pose

Network YOLOv4 Hourglass RGRN

Input size 416 × 416 256 × 256 29 × 2

Output Bounding Box 64 × 64 × 21 + 8 × 2 29 × 3

Epochs 120 200 200

Learning rate 0.0001 0.0001 0.0001/0.001

Optimizer Adam RMSprop RMSprop/Adam(Kingma and Ba, 2017)

Framework PyTorch

GPU Nvidia RTX 2060 Super

Figure 6: The percentage of correct keypoints of 2D hand

pose on three different datasets.

hands manipulating objects. Although the dataset

consists of 77,558 samples only 66,034 samples are

available with full hand annotations. So, we ran-

domly split such that 50,000 samples are used for

training, 16,325 samples for validation purposes and

the remaining 11,524 unlabeled data for evaluation.

The process of hand-object labelling (Hampali et al.,

2019) is completely automated with less initialization

constraints. 2) FreiHand dataset (Zimmermann et al.,

2019) is a semi-automatic annotated hand dataset con-

sisting of over 32,000 training samples and 4,000 test

samples. Although the dataset is highly occluded, no

object pose annotations are present. 3) RASH dataset

is a small-scale hand dataset captured using Panda

robot arm consisting of 10,000 training instances and

2,000 test samples. The images are captured us-

ing Intel Realsense 415 camera with a resolution of

640 × 480 pixels.

5.2 Evaluation Metrics, Training

Parameters, and Results

We evaluate the performance of 2D pose, and 3D pose

using three different metrics:

1) The mean distance error between the predicted

keypoints and the groundtruth. 2) Percentage of cor-

rect keypoints (PCK) is an exceedingly popular evalu-

ation metric for both 2D and 3D pose. PCK considers

a keypoint as correct if it falls under a certain thresh-

old, with pixel value threshold for 2D and millimeter

Table 2: Evaluation of euclidean position error (EPE) in

millimeters on validation set.

Dataset Residual blocks

N=2 N=3

HO-3D (Hampali et al., 2019) 13.4 9.2

FreiHand (Zimmermann et al., 2019) 13.2 8.8

Ours (RAH) 10.1 8.2

distance for 3D around the ground truth. 3) The area

under the curve (AUC) of the PCK graph computed

for a threshold of 50mm.

The complete training parameters are listed in Ta-

ble 1 and the models are trained independently de-

pending on the dataset. We evaluate the proposed

model by experimenting with hyperparameters such

that the best possible accuracy is acheived. In addi-

tion to training hyperparameters, we need to consider

the architecture parameters like number of stacks and

hidden layers.

5.2.1 2D Hand and Object Pose

For 2D hand pose estimation, we consider PCK eval-

uation metric and it is plotted in Figure 6. On HO-3D

(Hampali et al., 2019) validation dataset the value of

PCK @ 10 pixel is ≈ 0.969 and on RASH dataset

PCK@10 pixel is ≈ 0.986. The difference is due to

the high occlusion in HO-3D (Hampali et al., 2019)

Figure 7: The 2D hand pose and object pose outputs.The

green lines represent groundtruth and the blue lines indicate

predicted keypoints.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

776

Figure 8: Percentage of correct keypoints (PCK) evaluation

of 3D Hand pose.

Table 3: Comparison to the state-of-the-art.

Method Area Under Curve (AUC) %

HO-3D (Hampali et al., 2019) 79.0

FreiHand (Zimmermann et al., 2019) 79.1

Ours on HO-3D 79.4

Ours on FreiHand 84.2

dataset compared to RASH dataset. Similarly, 2D

PCK of object keypoints on HO-3D dataset is ≈ 0.9

for 10 pixel threshold. The 2D hand and object pose

on HO3D validation samples can be observed in Fig-

ure 7.

5.2.2 3D Hand and Object Pose

We train the 2D pose and 3D pose networks inde-

pendently without sharing the weights. RGRN (see

Figure 4) has hyperparameter N that determines num-

ber of hidden residual layers. The network is trained

by setting the value of N to 2 and 3. The 3D PCK

graph is represented in Figure 8. From the graph, we

can clearly observe that by increasing the number of

hidden layers performance is improved. In addition

to PCK metric, we evaluate mean euclidean 3D po-

sition error (EPE) by setting N = 2 and N = 3. The

EPE for HO-3D (Hampali et al., 2019) dataset, Frei-

Hand dataset (Zimmermann et al., 2019), and RASH

dataset is represented in Table 2. Few results on vali-

dation dataset are mentioned in Figure 9. Finally, we

test the accuracy of our proposed model on HO-3D

(Hampali et al., 2019) and FreiHand dataset (Zim-

mermann et al., 2019) datasets and compare it to the

state-of-the art. For testing the accuracy on unlabeled

data, we have to follow the instructions by (Hampali

et al., 2019) to submit the outputs to a submission

board. The Table 3 represents the AUC in percentage

computation of 3D PCK curve with the threshold of

50mm and we can clearly see that the architecure per-

forms well. Although the architecture performs better

on both the datasets, the AUC computation of Con-

tactPose(Brahmbhatt et al., 2020) dataset has shown

to be even higher and the difference is due to the fact

that the hand mesh (3D MANO) (Romero et al., 2017)

models are used to estimate the keypoint errors and in

this work we use direct 3D regression.

Figure 9: 3D hand and object pose on HO3D validation

samples.

5.3 Discussion and RASH Dataset

In robotics, there exist applications like handover of

objects from humans to robots and viceversa. In this

work, our aim is to build an architecture that is suit-

able for bidirectional handover applications. As the

image resolution and occlusions are higher in HO-3D

dataset, we ﬁrst trained proposed model completely

on HO-3D dataset and evaluated in human-robot in-

teraction environment as in Figure 10. The resulting

3D hand pose and 2D hand pose in human-robot in-

teraction environment can be oberved in Figure 11.

By just using HO-3D dataset in our environment, we

noticed that the 3D hand pose have high errors when

there is no object in the hand (see Figure 11, left im-

age with 3D and 2D). Figure 11 right image repre-

sents hand with object in the human robot interaction

environment. Although HO-3D dataset is a large-

scale and highly occluded dataset, it does not con-

tain enough instances for hands without objects. To

make our architecture work well for bidirectional han-

dovers, we collected our own dataset known as RASH

Figure 10: Human-robot interaction environment for han-

dover of objects.

3D Hand and Object Pose Estimation for Real-time Human-robot Interaction

777

Figure 11: Test ouputs in human-robot interaction environment; model trained on HO3D dataset.

Figure 12: Hand and object pose estimation in human-robot interaction environment.

dataset. The dataset consists of mostly open hand im-

ages with very low occlusions. To enchance the per-

formance, we combined training instances of HO-3D

dataset with RASH dataset and retrained the proposed

model. The Figure 12 represents the few pose sam-

ples of hands without objects and hands manipulat-

ing objects in human-robot interaction environment.

From that we noticed a signiﬁcant improvement for

3D hand poses (hands without objects). The number

of parameters in the complete architecture adds upto

20.26 million. During the processs of training, we

utilize Nvidia 1080 Ti graphical processor with 12 gi-

gabyte memory. For inference, we use Nvidia 1660

Ti processor with 6 gigabyte memory. The complete

pipeline acheived a framerate of ≈ 16 fps on single

GPU without visualization. It is further possible to

improve the framerate by reducing the stacks in hour-

glass but it compromises the accuracy of 2D keypoint

estimation. For real-time applications, the training

dataset must be highly occluded. To test this pro-

posed pipeline in real-world application, we retrained

the model by combining HO-3D dataset and RASH

dataset. To avoid the tedious process of manual la-

belling, we opt for a semi automatic labelling and we

are also currently working on fully automated annota-

tion process using robot arm without any constraints.

6 CONCLUSIONS

We proposed a complete pipeline to regress 2D pose,

3D hand pose and object pose. The 2D hand pose is

estimated using hourglass architecture and the inter-

mediate features from 2D hourglass network is further

extended to regress the object keypoints in the image.

To regress 3D hand pose from 2D hand pose, we in-

troduced a residual graph regression network with N

residual connections and achieved best performance

for N = 3. We evaluated the proposed network with

three different metrics on three different datasets. We

will further focus on improving the object keypoint

regression, accuracy of training instances and forward

the obtained information to a grasp planner for real-

time handover of object interactions.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

778

ACKNOWLEDGEMENTS

This work is Funded by the Deutsche Forschungs-

gemeinschaft (DFG, German Research Foundation) -

Project-ID 416228727 - SFB 1410.

REFERENCES

Agarap, A. F. (2019). Deep learning using rectiﬁed linear

units (relu).

Bandi, C. and Thomas, U. (2020). Regression-based 3d

hand pose estimation using heatmaps. In Proceedings

of the 15th International Joint Conference on Com-

puter Vision, Imaging and Computer Graphics Theory

and Applications - Volume 5: VISAPP,, pages 636–

643. INSTICC, SciTePress.

Bochkovskiy, A., Wang, C., and Liao, H. M. (2020).

Yolov4: Optimal speed and accuracy of object detec-

tion. CoRR, abs/2004.10934.

Brahmbhatt, S., Tang, C., Twigg, C. D., Kemp, C. C., and

Hays, J. (2020). ContactPose: A dataset of grasps

with object contact and hand pose. In The European

Conference on Computer Vision (ECCV).

Dai, J., Li, Y., He, K., and Sun, J. (2016). R-fcn: Object de-

tection via region-based fully convolutional networks.

Doosti, B., Naha, S., Mirbagheri, M., and Crandall, D. J.

(2020). Hope-net: A graph-based model for hand-

object pose estimation. CoRR, abs/2004.00060.

Garcia-Hernando, G., Yuan, S., Baek, S., and Kim, T.-

K. (2018). First-person hand action benchmark with

rgb-d videos and 3d hand pose annotations. In Pro-

ceedings of Computer Vision and Pattern Recognition

(CVPR).

Ge, L., Ren, Z., Li, Y., Xue, Z., Wang, Y., Cai, J., and Yuan,

J. (2019). 3d hand shape and pose estimation from a

single RGB image. CoRR, abs/1903.00812.

Hampali, S., Oberweger, M., Rad, M., and Lepetit, V.

(2019). HO-3D: A multi-user, multi-object dataset

for joint 3d hand-object pose estimation. CoRR,

abs/1907.01481.

Hasson, Y., Tekin, B., Bogo, F., Laptev, I., Pollefeys, M.,

and Schmid, C. (2020). Leveraging photometric con-

sistency over time for sparsely supervised hand-object

reconstruction. CoRR, abs/2004.13449.

Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black,

M. J., Laptev, I., and Schmid, C. (2019). Learning

joint reconstruction of hands and manipulated objects.

CoRR, abs/1904.05767.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep

residual learning for image recognition. CoRR,

abs/1512.03385.

Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski,

G., Konolige, K., and Navab, N. (2012). Model based

training, detection and pose estimation of texture-less

3d objects in heavily cluttered scenes. In Proceedings

of the 11th Asian Conference on Computer Vision -

Volume Part I, ACCV’12, page 548–562, Berlin, Hei-

delberg. Springer-Verlag.

Iqbal, U., Molchanov, P., Breuel, T., Gall, J., and Kautz, J.

(2018). Hand pose estimation via latent 2.5d heatmap

regression. CoRR, abs/1804.09534.

Kingma, D. P. and Ba, J. (2017). Adam: A method for

stochastic optimization.

Kipf, T. N. and Welling, M. (2016). Semi-supervised clas-

siﬁcation with graph convolutional networks. CoRR,

abs/1609.02907.

Labb

e, Y., Carpentier, J., Aubry, M., and Sivic, J. (2020).

Cosypose: Consistent multi-view multi-object 6d

pose estimation. CoRR, abs/2008.08465.

Lepetit, V., Moreno-Noguer, F., and Fua, P. (2009). Epnp:

An accurate o(n) solution to the pnp problem. Int. J.

Comput. Vision, 81(2):155–166.

Li, Y., Wang, G., Ji, X., Xiang, Y., and Fox, D. (2018).

Deepim: Deep iterative matching for 6d pose estima-

tion. CoRR, abs/1804.00175.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single shot

multibox detector. Lecture Notes in Computer Sci-

ence, page 21–37.

Mueller, F., Bernard, F., Sotnychenko, O., Mehta, D., Srid-

har, S., Casas, D., and Theobalt, C. (2018). Ganerated

hands for real-time 3d hand tracking from monocular

rgb. In Proceedings of Computer Vision and Pattern

Recognition (CVPR).

Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas,

D., and Theobalt, C. (2017). Real-time hand track-

ing under occlusion from an egocentric rgb-d sensor.

In Proceedings of International Conference on Com-

puter Vision (ICCV).

Newell, A., Yang, K., and Deng, J. (2016). Stacked hour-

glass networks for human pose estimation. CoRR,

abs/1603.06937.

Park, K., Patten, T., and Vincze, M. (2019). Pix2pose:

Pixel-wise coordinate regression of objects for 6d

pose estimation. CoRR, abs/1908.07433.

Peng, S., Liu, Y., Huang, Q., Bao, H., and Zhou, X. (2018).

Pvnet: Pixel-wise voting network for 6dof pose esti-

mation. CoRR, abs/1812.11788.

Romero, J., Tzionas, D., and Black, M. J. (2017). Embod-

ied hands: Modeling and capturing hands and bodies

together. ACM Transactions on Graphics, (Proc. SIG-

GRAPH Asia), 36(6):245:1–245:17.

Rosenberger, P., Cosgun, A., Newbury, R., Kwan, J.,

Ortenzi, V., Corke, P., and Graﬁnger, M. (2020).

Object-independent human-to-robot handovers using

real time robotic vision. CoRR, abs/2006.01797.

Sandler, M., Howard, A. G., Zhu, M., Zhmoginov, A., and

Chen, L. (2018). Inverted residuals and linear bottle-

necks: Mobile networks for classiﬁcation, detection

and segmentation. CoRR, abs/1801.04381.

Simon, T., Joo, H., Matthews, I. A., and Sheikh, Y. (2017).

Hand keypoint detection in single images using mul-

tiview bootstrapping. CoRR, abs/1704.07809.

Sridhar, S., Mueller, F., Zollhoefer, M., Casas, D.,

Oulasvirta, A., and Theobalt, C. (2016). Real-time

joint tracking of a hand manipulating an object from

rgb-d input. In Proceedings of European Conference

on Computer Vision (ECCV).

3D Hand and Object Pose Estimation for Real-time Human-robot Interaction

779

Tekin, B., Bogo, F., and Pollefeys, M. (2019). H+O: uniﬁed

egocentric recognition of 3d hand-object poses and in-

teractions. CoRR, abs/1904.05349.

Tekin, B., Sinha, S. N., and Fua, P. (2017). Real-time

seamless single shot 6d object pose prediction. CoRR,

abs/1711.08848.

Tremblay, J., To, T., Sundaralingam, B., Xiang, Y., Fox,

D., and Birchﬁeld, S. (2018). Deep object pose es-

timation for semantic robotic grasping of household

objects. CoRR, abs/1809.10790.

Wang, C., Xu, D., Zhu, Y., Mart

ın-Mart

ın, R., Lu, C., Fei-

Fei, L., and Savarese, S. (2019). Densefusion: 6d ob-

ject pose estimation by iterative dense fusion. CoRR,

abs/1901.04780.

Xiang, Y., Schmidt, T., Narayanan, V., and Fox, D. (2018).

Posecnn: A convolutional neural network for 6d ob-

ject pose estimation in cluttered scenes.

Yang, W., Paxton, C., Mousavian, A., Chao, Y., Cakmak,

M., and Fox, D. (2020). Reactive human-to-robot han-

dovers of arbitrary objects. CoRR, abs/2011.08961.

Ye, R., Xu, W., Xue, Z., Tang, T., Wang, Y., and Lu, C.

(2021). H2O: A benchmark for visual human-human

object handover analysis. CoRR, abs/2104.11466.

Zhang, Y., Chen, L., Liu, Y., Zheng, W., and Yong, J.

(2020). Explicit knowledge distillation for 3d hand

pose estimation from monocular rgb. In BMVC.

Zimmermann, C. and Brox, T. (2017). Learning to estimate

3d hand pose from single rgb images. In IEEE In-

ternational Conference on Computer Vision (ICCV).

https://arxiv.org/abs/1705.01389.

Zimmermann, C., Ceylan, D., Yang, J., Russell, B. C., Ar-

gus, M., and Brox, T. (2019). Freihand: A dataset

for markerless capture of hand pose and shape from

single RGB images. CoRR, abs/1909.04349.

APPENDIX

Pipeline for Handover of Objects. In human-robot

interaction environment, it is possible that only hand

is present in the scene. As the network designed in

Figure 13: The architecture for bidirectional handovers.

section 4 is for both hand and object images, we in-

cluded another stream if only hand is present in the

scene. The overall pipeline can be observed in Fig-

ure 13.

Training and Validation Loss. The training and vali-

dation loss of the proposed RGRN architecture can be

seen in Figure 14. From the Figure, we can clearly no-

tice the stability of both training and validation loss.

Figure 14: Training and validation loss of RGRN architec-

ture.

Detection Output. The output of YOLOv4

(Bochkovskiy et al., 2020) on hands and objects can

be observed in Figure 15.

Figure 15: The outputs of YOLOv4 architecture for hand

and object detection.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

780