Augmenting Human-Robot Collaboration Task by Human Hand Position

Forecasting

Shyngyskhan Abilkassov

1 a

, Michael Gentner

2,3 b

and Mirela Popa

1 c

Department of Advanced Computing Sciences, Faculty of Science and Engineering, Maastricht University,

P.O. Box 616, 6200 MD, Maastricht, The Netherlands

Technical University of Munich, Munich, Germany

BMW AG, Munich, Germany

Keywords:

Egocentric Vision, Human-Robot Collaboration.

Abstract:

Human-Robot collaboration (HRC) plays a critical role in enhancing productivity and safety across various

industries. While reactive motion re-planning strategies have proven useful, there is a pressing need for proac-

tive control involving computing human intentions to enable efﬁcient collaboration. This work addresses this

challenge by proposing a deep learning-based approach for forecasting human hand trajectories and a heuris-

tic optimization algorithm for proactive robotic task sequencing problem optimization. This work presents

a human hand trajectory forecasting deep learning model that achieves state-of-the-art performance on the

Ego4D Future Hand Prediction benchmark in all evaluation metrics. In addition, this work presents a problem

formulation and a Dynamic Variable Neighborhood Search (DynamicVNS) heuristic optimization algorithm

enabling robot to pre-plan their task sequence to avoid human hands. The proposed algorithm exhibits sig-

niﬁcant computational improvements over the generalized VNS approach. The ﬁnal framework efﬁciently

incorporates predictions made by the deep learning model into the task sequencer, which is evaluated in an

experimental setup for the HRC use-case of the UR10e robot in a visual inspection task. The results indicate

the effectiveness and practicality of the proposed approach, showcasing its potential to improve human-robot

collaboration in various industrial settings.

1 INTRODUCTION

Human-robot collaboration (HRC) has gained signif-

icant importance in recent years due to the increas-

ing demand for automation. In collaboration with

humans, robots can improve productivity, efﬁciency,

and safety in the manufacturing industry (Matheson

et al., 2019). However, achieving effective HRC re-

quires robots to comprehend and respond to human

intentions and actions, which can be challenging in

dynamic environments (Li et al., 2023).

Current research in HRC has primarily focused

on developing reactive motion re-planning strategies

that can adapt to dynamic environments (Li et al.,

2021). While these reactive strategies have proven

useful in different use cases ranging from domestic

to industrial environments (Ajoudani et al., 2018),

they have limitations in anticipating and planning for

https://orcid.org/0009-0002-5882-4275

https://orcid.org/0000-0003-2319-8956

https://orcid.org/0000-0002-6449-1158

future changes in human-present environments. To

overcome these issues, researchers have emphasized

the signiﬁcance of proactive control through the com-

putation of human intentions (Fan et al., 2022; Li

et al., 2023). By accurately predicting human inten-

tions, robots can plan their actions in advance, en-

abling smoother and more efﬁcient collaboration. In

particular, forecasting human hand trajectories pro-

vides valuable information for robot motion planners

to determine the appropriate task sequence.

Hand is a natural and ubiquitous mode of ex-

pression for the human and is an effective mean of

communication between the human and the robot

(Mukherjee et al., 2022). While several works have

proposed using human hand trajectory estimation to

augment robot planners, these works are often tai-

lored to speciﬁc use cases, are limited in the diver-

sity of used data (Mainprice and Berenson, 2013; Lyu

et al., 2022) or by assuming deterministic and static

human behavior (Cheng and Tomizuka, 2021). Devel-

oping an efﬁcient and proactive robot motion planner

262

Abilkassov, S., Gentner, M. and Popa, M.

Augmenting Human-Robot Collaboration Task by Human Hand Position Forecasting.

DOI: 10.5220/0012306300003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 4: VISAPP, pages

262-269

ISBN: 978-989-758-679-8; ISSN: 2184-4321

necessitates the integration of efﬁcient human hand

forecasting and robot task sequencing modules.

Robotic task sequencing is the process of opti-

mizing the sequence and order of task visits to im-

prove the overall performance of the robotic system

in industrial applications. Although various research

has been focused on trajectory optimization (Ata,

2007), efﬁcient control and end-effector path plan-

ning (Alatartsev et al., 2014), the popular approach

is to model task sequencing problem as a Traveling

Salesman Problem (TSP) (Alatartsev et al., 2015).

The TSP is an NP-hard problem, which means that

it is computationally infeasible to solve for large in-

stances and thus an efﬁcient algorithm is required

to ﬁnd optimal solutions in shortest time. To ad-

dress the aforementioned challenges, this work pro-

poses a deep learning-based framework for predict-

ing human hand trajectories and integrating the pre-

dictions into robot task sequencer. The presented

framework includes a state-of-the-art deep learning

model trained on diverse human hand motion data

that forecasts future hand trajectories. Previous ap-

proaches have attempted to solve the problem of

hand position forecasting (Chen et al., 2022), but

this work achieved higher evaluation results with the

same memory requirements, by adapting the pre-

sented model’s training procedure. The predicted tra-

jectories are then utilized by the robot task sequencer

to optimize the robot’s task sequencing. A robot

adaptive task sequencing module employs a Dynamic

Variable Neighborhood Search heuristic algorithm to

plan the most efﬁcient task sequence. By forecast-

ing future human hand positions, the robot can proac-

tively adjust its plan to avoid interrupting human in-

tentions, resulting in seamless and efﬁcient collabora-

tion. This work presents the following contributions:

• Generalized model that achieved state-of-the-art

performance on future hand position forecasting

task.

• Modiﬁed Travelling Salesman Problem formula-

tion to model future hand position occupancy.

• Dynamic Variable Neighborhood Search heuristic

algorithm to plan the most efﬁcient task sequence.

2 RELATED WORK

Future Hand Position Forecasting. Predicting hu-

man hand trajectories is a complex task that requires

understanding the intricate and dynamic nature of hu-

man movement. Recent works addressed the hand

forecasting task, but they are limited by the scope

of the applied datasets, which capture hand interac-

tions in limited settings (Fan et al., 2018; Liu et al.,

2020; Liu et al., 2022). Fan et al. introduced a

two-stream convolutional neural network architecture

that predicts future object and hand positions in video

frames based on previous frames (Fan et al., 2018).

Their innovation lies in forecasting the precise loca-

tions of objects in videos, a task not previously opti-

mized for. However, the evaluation was limited to a

speciﬁc dataset of human interactions, which may not

fully represent daily hand interactions.

Liu et al. developed a deep-learning model for

anticipating human-object interactions in egocentric

videos, introducing the concept of motor attention to

capture intentional hand movements (Liu et al., 2020).

This model jointly predicts hand trajectories, interac-

tion hotspots, and action labels, outperforming pre-

vious methods on benchmark datasets. However, it

relies on manual annotations and may face scalability

challenges with other datasets.

Liu et al. presented the Object-Centric Trans-

former (OCT) model to forecast future hand-object

interactions in egocentric videos (Liu et al., 2022).

This model captures object-centric interactions and

hand-object motion dependencies, demonstrating im-

proved performance on a new dataset generated from

existing ones. This work’s limitations include the

dataset’s limited size and the assumption that objects

are static during interactions.

Robotics Task Sequencing. Simultaneously, the

development of a proactive and dynamic task se-

quencer poses its challenges. Several works ad-

dressed the robotic task sequencing optimization

problem based on the Traveling Salesman Problem

formulation (Dubowsky and Blubaugh, 1989; Edan

et al., 1991; Zacharia and Aspragathos, 2005; Ko-

lakowska et al., 2014). However, these approaches

have not addressed the dynamic time windows for-

mulation due to the forecasting of human inten-

tions. Dubowsky and Blubaugh introduced an ATSP-

based method to plan time-optimal robotic manipula-

tor motions for point-to-point tasks (Dubowsky and

Blubaugh, 1989). They used dynamic programming

to minimize task completion time but didn’t consider

multiple inverse kinematics solutions, limiting opti-

mality. Edan et al. aimed to minimize robot cycle

time while accounting for kinematics and dynamics

(Edan et al., 1991). Their approach involved a TSP

formulation and neural network solving but suffered

from extensive computational time due to frequent

cost matrix recalculations.

Zacharia et al. proposed an optimization-based

TSP solution considering multiple inverse kinemat-

ics solutions using Genetic Algorithms (Zacharia and

Aspragathos, 2005). While efﬁcient, it is heuristic in

Augmenting Human-Robot Collaboration Task by Human Hand Position Forecasting

263

nature, so optimal solutions cannot be guaranteed.

Kolakowska et al.’s recent approach employs

mathematical constraints and a corresponding solver

for up to ten points (Kolakowska et al., 2014). How-

ever, the exponential computational growth with the

number of points limits its applicability to large sce-

narios.

Human-Robot Collaboration Using Human Hand

Position Prediction. Mainprice and Berenson intro-

duced a framework for safe human-robot collabora-

tion in shared workspaces (Mainprice and Berenson,

2013). They utilized a prediction algorithm to con-

struct a 3D workspace occupancy space, enhancing

safety and efﬁciency. However, the limited class and

trajectory diversity restrict real-world applicability.

Landi et al. combined model-based and data-

driven approaches to predict human arm movements

(Landi et al., 2019). They tracked hand positions with

a Kinect camera, ﬁtting them to a model and incorpo-

rating neural networks. However, the focus on detect-

ing reaching motions towards the robot and computa-

tional complexity are some of its limitations.

Cheng and Tomizuka proposed a hierarchical

framework for long-term human hand trajectory pre-

diction and action duration estimation (Cheng and

Tomizuka, 2021). They employed a sigma-lognormal

function and an online algorithm for parameter ad-

justment. However, their assumption of deterministic

human actions may not hold in real-world scenarios.

Lyu et al. presented an efﬁcient HRC pipeline

encompassing trajectory prediction, target estimation,

and robot trajectory generation (Lyu et al., 2022).

While this pipeline ensures safe collaboration, it re-

quires evaluation on more complex tasks and assumes

a static environment, limiting its real-world applica-

bility.

3 FUTURE HAND POSITION

FORECASTING

In this section, the deep learning model used to pre-

dict future hand positions from a video sequence is

presented and evaluated. This model is trained and

evaluated on the Ego4D dataset and achieved state-

of-the-art performance on Future Hand Position Fore-

casting benchmark.

3.1 Ego4D Dataset

Training deep learning models requires a diverse

dataset that contains annotations of human hands in

natural daily interactions. The Ego4D dataset (Grau-

man et al., 2022) is a large-scale egocentric video

dataset and benchmark suite that contains data suit-

ing that task. It consists of 3670 hours of record-

ings capturing daily life activities across 74 world-

wide locations and nine countries. The dataset in-

cludes diverse occupational backgrounds and encom-

passes hand-object interactions in various environ-

ments. It provides hand annotations along with a

hand position forecasting benchmark for evaluation.

The benchmark focuses on predicting future hand po-

sitions in future key frames based on a recording of

hands for consecutive 2 seconds. The benchmark de-

ﬁnes ﬁve key frames: the contact frame (x

), the pre-

condition frame (x

), and three frames preceding the

pre-condition frame by 0.5s, 1.0s, and 1.5s, denoted

as x

, x

, and x

, respectively.

3.2 Evaluation Metric

The evaluation metric is based on the Future Hand

Prediction (FHP) benchmark presented as part of the

Ego4D dataset (Grauman et al., 2022). The bench-

mark is based on two error metrics: Mean Key

Frame Displacement Error (M.Disp.) and Contact

Key Frame Displacement Error (C.Disp.). Mean Key

Frame Displacement Error is deﬁned in Eq. 1.

∑

i∈H

||h

−

|| (1)

denotes a set of visible hand positions in key

frames, n denotes the size of the set H

, h

refers to

the predicted hand position, while the

refers to the

ground truth hand location. It should be noted that

if the hand is not present in the annotation, zeros are

padded into the

The Contact Key Frame Displacement Error is de-

ﬁned in Eq. 2.

= ||h

−

|| (2)

refers to the predicted hand position in the contact

frame, while the

refers to the ground truth hand

position in the contact frame.

It should be noted that the presented evaluation

metric does not penalize the model if it gives predic-

tions on frames without hands.

3.3 UniFormer Model

The UniFormer model is used for forecasting future

hand positions. The UniFormer is a uniﬁed trans-

former architecture for efﬁcient spatiotemporal repre-

sentation learning from high-dimensional videos (Li

et al., 2022). It addresses the challenges of spatiotem-

poral redundancy and dependency by combining the

strengths of 3D CNNs and vision transformers. The

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

264

UniFormer architecture serves as a backbone in all

experiments conducted in this work. The UniFormer

is used as the feature extraction backbone and is fol-

lowed by a linear mapping function as a regressor.

3.4 Multitask Learning Loss

This work utilizes a multi-task learning loss to facil-

itate the learning of future hand positions and occu-

pancy representations. Firstly, the regression task in-

volves predicting hand positions in spatial coordinates

for both hands in ﬁve key frames. Each prediction

target vector consists of 20 values, where two repre-

sent the left and right hand, two represent the x and

y coordinates, and ﬁve represent the key frames. In

addition to those annotations present in the dataset,

this work introduces annotations for learning the con-

tact delay time c

. This is accomplished by calculat-

ing the difference between the ground truth contact

frame time and the frame time of the ﬁrst key frame.

Consequently, the model becomes capable of learn-

ing and predicting the precise time delay between the

pre-condition and contact frames. The resulting target

representation vector comprises 21 values employed

in the regression task.

The regression component measures the Smooth

L1 loss between the predicted and ground truth val-

ues. Subsequently, the losses for all 21 points are

summed and divided by the total sum of mask val-

ues. This loss component encourages the model to

generate accurate predictions for all values at the ex-

pense of producing false positives and is deﬁned in

Equations 3 and 4.

(

0.5 · (h

−

)

, if |h

−

| < 1

−

| − 0.5, otherwise

(3)

RLoss =

∑

i=1

∑

i=1

(4)

The classiﬁcation component introduces an addi-

tional branch in the model output that predicts the

existence mask of the hand in a future key frame or

contact with the object. This branch outputs a single

value between 0 and 1, representing the probability of

the target’s existence. The ground truth for this branch

is a binary value of 1 (indicating target existence) or

0 (indicating target non-existence). The binary cross-

entropy loss is computed by comparing the predicted

existence probability with the ground truth existence

label as shown in Equation 5.

CLoss = −

∑

i=1

ˆm

log(m

) (5)

The total loss is then formulated as a combina-

tion of the regression loss and the existence loss, suit-

ably weighted. The following equation represents the

computation of the total loss:

TotalLoss = α · RLoss + β· CLoss (6)

In Equation 6, the weights α and β are assigned

to balance the contributions of the regression and the

binary cross-entropy loss, respectively.

3.5 EgoVLP Pretraining

In this work, clip-level annotations from EgoVLP are

utilized for model pretraining. Speciﬁcally, the verb-

ﬁltered clip-level annotations provided by Chen et al.

(Chen et al., 2022) are employed. Authors highlight

that video-language pretraining based on the entire

Ego4D dataset signiﬁcantly improves model perfor-

mance on various downstream tasks. Their work sug-

gests that using pretrained backbones directly yields

inferior performance due to the distribution gap be-

tween egocentric and exocentric data.

4 DYNAMIC TASK SEQUENCING

In this section the Dynamic Traveling Salesman Prob-

lem with Time Windows problem formulation along

with the heuristic optimization algorithm is presented.

The trajectories predicted by the deep learning model

are utilized by the Dynamic Variable Neighborhood

Search heuristic algorithm to plan the most efﬁcient

task sequence. By forecasting future human hand po-

sitions, the robot can proactively adjust its plan to

avoid interrupting human intentions, resulting in a

seamless and efﬁcient collaboration.

4.1 Dynamic Traveling Salesman

Problem with Time Windows

Formulation

This work presents a formulation of the Dynamic

Traveling Salesman Problem with Time Windows (D-

TSPTW), introducing a modiﬁcation to the traditional

TSPTW problem by incorporating time variability

into node parameters. The introduction of time vari-

ability in the D-TSPTW implies that the problem

graph G = (V, A), along with its node parameters, can

undergo changes over time. The main difference with

a well-established Time-Dependent Traveling Sales-

man Problem with Time Windows (TD-TSPTW) is

that D-TSPTW assumes changes in the customer con-

ﬁgurations, while TD-TSPTW assumes changes in

travel times (Vu et al., 2020).

Augmenting Human-Robot Collaboration Task by Human Hand Position Forecasting

265

To deﬁne the problem, speciﬁc restrictions are im-

posed. Firstly, the graph structure and node param-

eters remain unchanged during the execution of one

cycle. Here, a cycle is deﬁned as a sequence of ac-

tions performed by an agent, which includes servic-

ing a node, traveling to another node, and waiting

for the start time of the next node. Additionally,

D-TSPTW assumes that only the following parame-

ters can change between each cycle: the number of

present nodes, their start times, and due times. Node

coordinates and service times are assumed to remain

constant throughout the execution of the task.

The D-TSPTW task is formalized as a tour on a

time-expanded network denoted as D = (N , A). The

time in this notation is split into discretized cycles, as

described earlier. The node set N consists of nodes

(i,t), where i ∈ N and t ∈ [a

, b

], representing the time

window for node i. The travel arc set A comprises

travel arcs represented as ((i, t), ( j,

t)), where i ̸= j,

(i, j) ∈ A, t ≥ a

, and max(a

,t + τ

i j

(t)) ≤

t ≤ b

The binary variable x

is equal to 1 if and only

if the arc a = ((i, t), ( j,

t)) is visited. Furthermore,

= c

i j

(t) denotes the non-negative travel cost of arc

a. Thus, the optimization task can be deﬁned as an

integer programming problem, as shown in Equation

min

∑

a∈A

s.t.

∑

a∈A

= 1

t ≥ t + c

− M(1 − x

) ∀a ∈ A

≤ t ≤ b

∀i ∈ V

∈ {0, 1} ∀a ∈ A

(7)

The objective function in Equation 7 aims to min-

imize the total cost of the tour. The ﬁrst constraint

ensures that each arc in the set A is visited exactly

once. The second constraint enforces the requirement

that the visiting time of node j must be greater than

or equal to the sum of the visiting time of node i and

the associated arc cost.

This work assumes that the number of cycles is 5,

where each cycle correspond to a particular time in

the future as denoted in each key frame deﬁnition in

section 3.1.

4.2 Dynamic Variable Neighborhood

Search Heuristic Algorithm

A heuristic solver for the Dynamic Traveling Sales-

man Problem with Time Windows (D-TSPTW) is

now presented. The motivation for this work comes

from the two-stage heuristic presented by da Silva

and Urrutia (Da Silva and Urrutia, 2010) for the tradi-

tional TSPTW problem. Their approach combines the

Variable Neighborhood Search (VNS) for generating

feasible solutions and the Variable Neighborhood De-

scent (VND) for optimizing local solutions. The algo-

rithm they presented demonstrated computational ef-

ﬁciency compared to state-of-the-art approaches and

achieved superior results in terms of execution time

on most of the benchmark instances (Da Silva and Ur-

rutia, 2010).

The approach proposed by da Silva and Urrutia is

limited to the static TSPTW problem and does not ac-

count for variations in node parameters as introduced

in the extended D-TSPTW problem formulation. This

approach can be applied in a straightforward manner

and be used to solve the D-TSPTW problem by tack-

ling each cycle individually. This approach will be

used as a baseline in our experiments.

To improve the optimization performance, this

work proposes an extension of the GVNS opti-

mization algorithm to solve the D-TSPTW problem

called DynamicVNS. The proposed algorithm is de-

signed to tackle the dynamic nature of the prob-

lem and adapt to changing node parameters. The

main difference with the algorithm proposed by da

Silva and Urrutia is in the construction phase. The

GetFeasibleSolution() algorithm, shown in Algo-

rithm 1, is responsible for obtaining a feasible so-

lution in the construction phase for the D-TSPTW

problem. It takes the objective function f (·) and

the number of nodes n as inputs, and optionally, a

previous solution previousSolution. The algorithm

repeatedly searches for feasible solutions, employ-

ing perturbation and local search operation using

Local1Shift() function until a feasible solution is

found. If previousSolution is provided, the algorithm

starts with this solution. Otherwise, it begins with

a new solution using the NewSolution() function,

which generates a path depending on an initializa-

tion strategy. The initialized path can be sorted by

customers’ ready or due times or randomly shufﬂed.

The perturbation is controlled by the variable k, which

is increased iteratively to explore different neighbor-

hoods until a feasible solution is found or until k

max

is reached. As a result, each consecutive expanded

problem formulation is solved until the last one.

5 RESULTS

This section begins by showcasing the outcomes of

the UniFormer model training and evaluation exper-

iments. Following that, it provides an explanation

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

266

Data: f (·), n, previousSolution = None

Result: x

repeat

k ← 1;

if previousSolution ̸= None then

x ← previousSolution;

else

x ← NewSolution(n);

end

x ← Local1Shi f t(x);

while x is unfeasible and k < k

max

′

← Shake(x, k);

∗

← Local1Shi f t(x

′

);

if f (x

∗

) < f (x) then

x ← x

∗

;

k ← 1;

else

k + +;

end

until x is feasible;

Algorithm 1: GetFeasibleSolution() algorithm.

of the DynamicVNS heuristic optimization algorithm

experiments. The section concludes by presenting the

HRC setup, including both simulation and real robot

experiments.

5.1 UniFormer Model Training

The UniFormer model was ﬁrstly pretrained on the

video-language EgoClip dataset and then ﬁne-tuned

for the future hand prediction task. The UniFormer

model is ﬁne-tuned on the future hand prediction task

with the following conﬁguration. Training is per-

formed with a batch size of 16 and 8 segments per

sample. The input size and short side size were set

to 320. Each input consisted of 4 frames. The op-

timizer used is AdamW, with a learning rate of 1e-

3 and beta values of 0.9 and 0.999. Weight de-

cay is set to 0.05. The training process spanned 30

epochs, with 5 warm-up epochs. During testing, 2

segments were used, and 3 crops were taken per seg-

ment. The sampling strategy involved sampling 8

frames in 30 temporal views, which showed best per-

formance across mutliple experiments. Moreovoer,

experiments showed that better evaluation results on

the validation sets could be achieved by reducing the

learning rate to 10e-4, removing abnormal videos, and

employing early stopping. Finally, the model was

trained using a multi-task learning approach, incor-

porating the best learning parameters and sampling

strategy. The learning task employed the multi-task

loss deﬁned in Equations 3, 4, 5, and 6.

Despite the evaluation benchmark not explic-

itly considering the computed multi-task output, the

model achieved state-of-the-art performance on the

FHP benchmark. The ﬁnal model’s performance is

compared with previous state-of-the-art approaches in

Table 1, demonstrating superior results on the test and

validation data. All presented UniFormer-based re-

sults use the (320, 8, 30) sampling strategy for com-

mon evaluation.

5.2 Qualitative Results

Figure 1 showcases qualitative results from the ﬁ-

nal trained model on sample video snippets extracted

from the validation set. The ﬁgure depicts the predic-

tions made on ﬁve key frames (x

, x

) in

sequential order. Green dots represent ground truth

hand positions, while blue dots depict the model’s

predictions. The presented visualization illustrates

that the model can clearly model human hand posi-

tions in future frames with varying people, environ-

ments, and lighting conditions. The predicted tra-

jectory clearly follows the intended human trajectory

with minor differences.

Despite minor differences, the model’s overall

performance demonstrates its ability to accurately

predict human hand positions in various scenarios.

These qualitative results provide valuable insights

into the model’s capabilities and highlight its poten-

tial for real-world applications.

Figure 1: Qualitative results showing predictions made by

the ﬁnal model on the 5 key frames in sequential order (x

, x

). Ground truth hand positions are repre-

sented in green dots, and predictions made by the model are

represented in blue dots.

5.3 DynamicVNS Implementation and

Evaluation

The DynamicVNS algorithm, described in the

methodology section, was implemented in Python

3.10. The algorithm’s performance was evaluated us-

ing three different route initialization strategies: ran-

dom shufﬂing (Random), sorting based on node due

Augmenting Human-Robot Collaboration Task by Human Hand Position Forecasting

267

Table 1: Evaluation of the best-performing models on the test set and comparison with previously presented state-of-the-art.

Set Method

Left Hand Right Hand

M.Disp. ↓ C.Disp. ↓ M.Disp. ↓ C.Disp. ↓

Test

I3D (Grauman et al., 2022) 52.98 56.37 53.69 56.17

UniFormer (Chen et al., 2022) 43.85 53.33 46.25 53.38

UniFormer + Multi-Task Loss 43.00 53.47 44.36 53.42

Table 2: Comparison experiments of the DynamicVNS and GVNS algorithm on time unrolled sequencing tasks using different

initial path generation strategies.

Random Due Time Ready Time

DynamicVNS GVNS DynamicVNS GVNS DynamicVNS GVNS

mean 0.326 4.814 0.563 28.200 0.341 0.346

std 0.00179 0.56678 0.00976 0.24479 0.00846 0.00406

time b

(Due Time), and sorting based on node ready

time a

(Ready Time). The execution time of the Dy-

namicVNS algorithm was recorded for each initial-

ization strategy and compared to the GVNS (Da Silva

and Urrutia, 2010). Table 2 presents the performance

differences for each strategy. Each strategy-algorithm

pair was evaluated ten times on the same benchmark

instance, and the execution time was recorded. The

table 2 displays the mean and standard deviation of

the computational times across all runs.

The results clearly show that DynamicVNS out-

performs the GVNS algorithm on all task instances

while generating solutions with the same travel cost.

The algorithm exhibits a noticeable speedup when

using the random and due time initialization strate-

gies, but the speedup is smaller when using the ready

time initialization strategy. The superior performance

of DynamicVNS can be attributed to its inherent de-

sign, which preserves and utilizes temporal informa-

tion from previous runs. It should be noted that while

the random shufﬂing method shows the fastest execu-

tion time in the presented table, it was not able to ﬁnd

the optimal solutions in some of the benchmark in-

stances. Therefore, the ready-time initialization strat-

egy offers the best speed and robustness trade-off.

5.4 HRC Experiments

Robotic task sequencing experiments were conducted

in simulation and on a real robot setup. Firstly,

the simulation environment in the Gazebo simulation

software was constructed. The simulation environ-

ment consists of a UR10e robot with a mounted cam-

era and a table with four objects on top. The sim-

ulation environment can be seen in Figure 2. Initial

experiments with pre-recorded hand movements were

conducted in the simulation environment to validate

robot task-planning behavior.

The real setup with the same conﬁguration can be

seen in Figure 3. The only difference is the overhead-

mounted camera that captures the hand movements

and sends the images to the camera stream. Multiple

interaction experiments were conducted to conﬁrm

the feasibility of the proposed framework for real-

time HRC application. Experiments have shown that

the forecasting algorithm can accurately predict hand

positions, and the task sequencing algorithm produces

explainable task sequences that do not interfere with

human intentions.

Figure 2: Simulation environment setup consisting of a sim-

ulated UR10e robot and tabletop objects.

Figure 3: Real setup consisting of UR10e robot, overhead

mounted camera, and tabletop objects.

6 CONCLUSIONS

The comprehensive framework developed in this pa-

per provides an overview of the development of

proactive robotic planner. By leveraging deep mod-

els for future hand position forecasting and integrat-

ing them with task sequencing algorithms, safer and

more productive human-robot interactions can be fa-

cilitated in diverse real-world applications. As a di-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

268

rection for future work, further reﬁnements and opti-

mization of the framework can be explored alongside

more sophisticated travel cost estimation based on

robot motion proﬁles, integration of obstacle avoid-

ance modules, and more use-case experiments. More-

over, investigations into end-to-end approaches of hu-

man intention forecasting for robot task sequencing

may prove to be effective.

REFERENCES

Ajoudani, A., Zanchettin, A. M., Ivaldi, S., Albu-Sch

affer,

A., Kosuge, K., and Khatib, O. (2018). Progress

and prospects of the human–robot collaboration. Au-

tonomous Robots, 42:957–975.

Alatartsev, S., Belov, A., Nykolaychuk, M., and Ortmeier,

F. (2014). Robot trajectory optimization for the re-

laxed end-effector path. In 2014 11th Int. Conf.

on Informatics in Control, Automation and Robotics

(ICINCO), volume 1, pages 385–390. IEEE.

Alatartsev, S., Stellmacher, S., and Ortmeier, F. (2015).

Robotic task sequencing problem: A survey. Journal

of intelligent & robotic systems, 80:279–298.

Ata, A. A. (2007). Optimal trajectory planning of manipu-

lators: a review. Journal of Engineering Science and

technology, 2(1):32–54.

Chen, G., Xing, S., Chen, Z., Wang, Y., Li, K., Li, Y., Liu,

Y., Wang, J., Zheng, Y.-D., Huang, B., et al. (2022).

Internvideo-ego4d: A pack of champion solutions to

ego4d challenges. arXiv preprint arXiv:2211.09529.

Cheng, Y. and Tomizuka, M. (2021). Long-term trajectory

prediction of the human hand and duration estimation

of the human action. IEEE Robotics and Automation

Letters, 7(1):247–254.

Da Silva, R. F. and Urrutia, S. (2010). A general vns heuris-

tic for the traveling salesman problem with time win-

dows. Discrete Optimization, 7(4):203–211.

Dubowsky, S. and Blubaugh, T. (1989). Planning time-

optimal robotic manipulator motions and work places

for point-to-point tasks. IEEE Transactions on

Robotics and Automation, 5(3):377–381.

Edan, Y., Flash, T., Peiper, U. M., Shmulevich, I., and

Sarig, Y. (1991). Near-minimum-time task planning

for fruit-picking robots. IEEE transactions on robotics

and automation, 7(1):48–56.

Fan, C., Lee, J., and Ryoo, M. S. (2018). Forecasting hands

and objects in future frames. In Proc. of the European

Conf. on Computer Vision (ECCV) Workshops.

Fan, J., Zheng, P., and Li, S. (2022). Vision-based holistic

scene understanding towards proactive human–robot

collaboration. Robotics and Computer-Integrated

Manufacturing, 75.

Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari,

A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M., Liu,

X., et al. (2022). Ego4d: Around the world in 3,000

hours of egocentric video. In Proc. of the IEEE/CVF

Conf. on Computer Vision and Pattern Recognition,

pages 18995–19012.

Kolakowska, E., Smith, S. F., and Kristiansen, M. (2014).

Constraint optimization model of a scheduling prob-

lem for a robotic arm in automatic systems. Robotics

and Autonomous Systems, 62(2):267–280.

Landi, C. T., Cheng, Y., Ferraguti, F., Bonf

e, M., Sec-

chi, C., and Tomizuka, M. (2019). Prediction of hu-

man arm target for robot reaching movements. In

2019 IEEE/RSJ International Conference on Intelli-

gent Robots and Systems (IROS), pages 5950–5957.

Li, K., Wang, Y., Gao, P., Song, G., Liu, Y., Li, H., and

Qiao, Y. (2022). Uniformer: Uniﬁed transformer for

efﬁcient spatiotemporal representation learning. arXiv

preprint arXiv:2201.04676.

Li, S., Han, K., Li, X., Zhang, S., Xiong, Y., and Xie, Z.

(2021). Hybrid trajectory replanning-based dynamic

obstacle avoidance for physical human-robot interac-

tion. Journal of Intelligent & Robotic Systems, 103:1–

14.

Li, S., Zheng, P., Liu, S., Wang, Z., Wang, X. V., Zheng,

L., and Wang, L. (2023). Proactive human–robot col-

laboration: Mutual-cognitive, predictable, and self-

organising perspectives. Robotics and Computer-

Integrated Manufacturing, 81:102510.

Liu, M., Tang, S., Li, Y., and Rehg, J. M. (2020). Fore-

casting human-object interaction: joint prediction of

motor attention and actions in ﬁrst person video. In

Computer Vision–ECCV 2020, pages 704–721.

Liu, S., Tripathi, S., Majumdar, S., and Wang, X. (2022).

Joint hand motion and interaction hotspots prediction

from egocentric videos. In Proc. of the IEEE/CVF

Conf. on Computer Vision and Pattern Recognition,

pages 3282–3292.

Lyu, J., Ruppel, P., Hendrich, N., Li, S., G

orner, M., and

Zhang, J. (2022). Efﬁcient and collision-free human-

robot collaboration based on intention and trajectory

prediction. IEEE Trans. on Cognitive and Develop-

mental Systems.

Mainprice, J. and Berenson, D. (2013). Human-robot col-

laborative manipulation planning using early predic-

tion of human motion. In 2013 IEEE/RSJ Interna-

tional Conference on Intelligent Robots and Systems,

pages 299–306. IEEE.

Matheson, E., Minto, R., Zampieri, E. G., Faccio, M.,

and Rosati, G. (2019). Human–robot collaboration

in manufacturing applications: A review. Robotics,

8(4):100.

Mukherjee, D., Gupta, K., Chang, L. H., and Najjaran,

H. (2022). A survey of robot learning strategies

for human-robot collaboration in industrial settings.

Robotics and Computer-Integrated Manufacturing,

73:102231.

Vu, D. M., Hewitt, M., Boland, N., and Savelsbergh, M.

(2020). Dynamic discretization discovery for solving

the time-dependent traveling salesman problem with

time windows. Transportation science, 54(3):703–

720.

Zacharia, P. T. and Aspragathos, N. (2005). Optimal robot

task scheduling based on genetic algorithms. Robotics

and Computer-Integrated Manufacturing, 21(1):67–

79.

Augmenting Human-Robot Collaboration Task by Human Hand Position Forecasting

269