Towards We-intentional Human-Robot Interaction using Theory of Mind

and Hierarchical Task Network

Maitreyee

and Michele Persiani

Department of Computing Science, Ume

a University, Universitetstorget 4, Ume

a, Sweden

Keywords:

We-intention, Joint Intention, Joint Activity, Human-Robot Interaction, We-mode, I-mode, Hierarchical Task

Network, Dialogue Interaction.

Abstract:

Joint activity between human and robot agent requires them to not only form joint intention and share a mutual

understanding about it but also to determine their type of commitment. Such commitment types allows robot

agent to select appropriate strategies based on what can be expected from others involved in performing the

given joint activity. This work proposes an architecture embedding commitments as we-intentional modes in

a belief-desire-intention (BDI) based Theory of Mind (ToM) model. Dialogue mediation gathers observations

facilitating ToM to infer the joint activity and hierarchical task network (HTN) plans the execution. The work

is ongoing and currently the proposed architecture is being implemented to be evaluated during human-robot

interaction studies.

1 INTRODUCTION

For successful joint activities, human and a robot

agent take a we-intentional stance (Tuomela and

Miller, 1988; Tuomela, 2005) aiming to act on a joint

intention. In a day-to-day scenario such interactions

about medication reminder, taking care of health and

ﬁnding objects in the house may require participants

to sometimes commit to a shared goal and other times

pursue their private ones.

An example of a prior case is when the human and

a robot agent deliberate by deciding on how to ﬁnd an

object like newspaper, reading glasses or their cell-

phone, prepare a meal, or plan and clean the house

together. The instance for the latter case would be

when robot persuades human agent to take scheduled

medication or manage daily physical well-being by

rating any discomfort, taking some rest, or booking

an appointment to seek medical care.

The research has already discussed the content

required (Clodic et al., 2014; Levine and Williams,

2018; Vinanzi et al., 2021) for human-robot agent

joint actions. However, another factor that has not

received much attention is of the stance (Schweikard

and Schmid, 2020) an agent should have to perform

the joint activity (Grynszpan et al., 2019; van der Wel,

2015). A research question to ask for such a case is

https://orcid.org/0000-0002-3036-6519

https://orcid.org/0000-0001-5993-3292

How can agent’s stance be modeled for deliberation

and persuasion during a joint activity and what kind

of mediation will be required to achieve it?

To address the mentioned research gap we exam-

ine here different modes of we-intention (Tuomela

and Miller, 1988; Tuomela, 2006) determining the

stance of the agents with-in a theory of mind (ToM)

framework. Furthermore, we use dialogue communi-

cation to increase the information that the robot has

available when computing its ToM model. Finally,

through hierarchical task networks (Georgievski and

Aiello, 2015) we create probability distributions of

plans to be executed as legible or optimal depending

on the calculated information gain.

The proposed architecture requires a model of

shared cognition and mind (Thellman and Ziemke,

2019; Bianco and Ognibene, 2019) allowing robot

agent to attribute behavior to mental attitudes of hu-

mans or other artiﬁcial agents, similar to the con-

cept from Theory of Mind (ToM) (Scassellati, 2002).

In robotics ToM reasons the state of mind of other

involved agents at different dispositions, where two

most applicable ones’ are ﬁrst and second order (Hell-

str

om and Bensch, 2018). From Robot’s stand point

ﬁrst order is robot’s reasoning of the human agent and

second order is robot’s reason of what human reasons

about the robot. Humans can be considered as ra-

tional agents with mental attitudes composed of their

knowledge about the world, the motivation and their

Maitreyee, . and Persiani, M.

Towards We-intentional Human-Robot Interaction using Theory of Mind and Hierarchical Task Network.

DOI: 10.5220/0010722200003060

In Proceedings of the 5th International Conference on Computer-Human Interaction Research and Applications (CHIRA 2021), pages 291-299

ISBN: 978-989-758-538-8; ISSN: 2184-3244

291

goals required to mediate in their complex and unpre-

dictable environment.

A belief-Desire-Intention (BDI) (Rao et al., 1995)

model can be used to represent agents with mental at-

titudes mentioned before. In this work, the agents are

represented as BDI models on which the robot agent

computes ﬁrst and second order ToM. Interaction be-

tween robot and a human agent can be structured by

a shared task domain. For example, in a health-care

scenario a possible task is of taking pills where the

robot should bring pills and/or water to human owner.

It is therefore necessary that both human and robot

comply with a structure both in terms of tasks and in-

teractions that it allows. In this context we study how

a robot agent can form a we-intention (Tuomela and

Miller, 1988) with a human.

A we-intention is a type of aim intention, where

the agents share a set of actions to be performed aimed

by all the involved participants. We-intentional agents

form a joint intention to perform an action together,

namely, a joint action, in two different modalities

of; we-mode when the agents act as group members

“committed to collectively agreed upon goals” and

i-mode when agents are part of a group “committed

to their own private goal” (Tuomela, 2006; Tuomela,

2005). These two modes imply legible or optimal

mechanisms for joint activity. Our contribution im-

plies mechanisms for robot agent to create BDI based

ToM for executing joint activity with humans. Which

is realized by dialogue based observation gathering

and planning of shared task using hierarchical task

network.

Following Section 2 brieﬂy introduces theory of

mind, motivates we-intentions for joint activity and

concludes with some related work. Then, Section 3

provides the formal architecture created for robots

BDI ToM, dialogue mediation, task execution and hi-

erarchical task network. The article concludes with

an illustrative example and ongoing work in Section 4

and Section 5.

2 BACKGROUND AND RELATED

WORK

This section provides relevant interpretation of The-

ory of Mind (ToM) and stance/commitment in We-

intention (Tuomela and Miller, 1988) for the purpose

of understanding this work.

2.1 Theory of Mind

A ﬁeld recently being investigated to create models

for belief reasoning is theory of mind (ToM) (Scas-

sellati, 2002; Thellman and Ziemke, 2019). Theory of

mind relates to the ability of agents to attribute men-

tal states and beliefs to themselves or other agents,

and of creating a point of view of a situation in terms

of beliefs, goals and intentions that is different from

their own but rather belonging to others. A ﬁrst or-

der theory of mind is expressed in the sentence “Bob

thinks that Alice thinks X”, or in other words Bob has

an estimate of Alice’s mental state, believing she’s

thinking X. Higher order theories deepen these levels

of reasoning by extending the thinking chain. A sec-

ond order reasoning would be “Carl thinks that [Bob

thinks that Alice thinks X]”—with parenthesis added

to highlight the recursion. In this case Carl holds an

estimate of Bob’s mental state. Arbitrary higher or-

ders of reasoning follow the same incremental struc-

ture.

Theory of mind is considered to be at the corner-

stone of shared intentionality (Tomasello et al., 2005).

Human beings are able to understand each other and

collaborate thanks to their innate will of sharing in-

tentions and psychological states.

2.2 We-intentional Stance

Intentions that agents’ mind create in order to per-

form actions and achieve goals as a collective can

be deﬁned and studied under the umbrella term of

‘collective intentionality’. With explicit discussions

in Phenomenology, Existential Philosophy and Soci-

ological Theory collective intentionality gained atten-

tion in practical philosophy when introduced by Sel-

lars (Wilfrid, 1980) as ‘we-referential intentions’ or

simply called ‘we-intentions’. He argues we-intention

to have the following characteristics: (a) to be atti-

tudes of individuals, (b) to involve a non-parochial at-

titude towards group the individuals have, and (c) that

there are two types of intentions– primary i-intention

and we derivative i-intention to joint intention.

Joint intention can be expressed as a case

of performing joint actions by agents with joint

agency (Tuomela and Miller, 1988), we-intention and

a shared mutual belief. Central tenet of we-intention

entails a participation intention to perform a joint ac-

tion requiring agents to jointly intend to participate

action-ally, that is, contributing to performing of joint

actions. A we-intending agent has a rational belief

that it can perform its part of a joint action A with

some probability, and that it can perform A collec-

tively with other agents with some non-zero probabil-

ity.

Attitudes of we-intending agents allows joint in-

tention to exist in either– I-mode or we-mode for a

group as illustrated next.

Humanoid 2021 - Special Session on Interaction with Humanoid Robots

292

2.2.1 We-mode of We-intention

An agent in we-mode, functions as a group member

satisfying and committed to group’s goal g in the joint

intention by participating in joint action with the set

of requirements (Tuomela, 2006): (a) agent A

intends

to do his part of g (b) agent A

has a belief that other

agents in the group will perform their parts of g and

beliefs that there is a mutual belief

in the group that g will be effectuated to achieve the

joint intention.

2.2.2 I-mode of We-intention

An agent pursuing i-mode, functions as part of a

group satisfying the groups goal g being committed to

its private goal pg in the joint intention by participat-

ing in joint action with the set of requirements (Brat-

man, 1993): (a) agent A

intends to satisfy g, (b) agent

intends to satisfy g (c) A

and A

do so by meshing

up their sub-plans of their private goal pg and (d) they

have a mutual belief about (a), (b), and (c).

2.3 Joint Intention and We-intention in

Human Robot Interaction (HRI)

(Clodic et al., 2014) highlights the components re-

quired for joint activities between human and robot.

Their work discusses low level processing to infer in-

tentions of other agents, to jointly direct the attention

and to form a shared mutual belief about task in hand.

Some of the previous work (Rauenbusch and

Grosz, 2003; Kamar et al., 2009) proposed team ratio-

nality for building collaborative multi-agent systems,

for example, in (Rauenbusch and Grosz, 2003) the au-

thors used Shared Plan (Grosz and Kraus, 1996) and

Propose Trees to model collaboration as multi-agent

planning problem, where a rational team will perform

an action only if the beneﬁts from performing an ac-

tion is less than its cost.

Authors in (Devin et al., 2017) illustrated a block

world scenario for human and robot agent to collab-

orate on a shared plan, with the focus on ﬂexibility

and non-intrusiveness. Extension of HATP (Lalle-

ment et al., 2014) (a hierarchical task network plan-

ner for human robot agent shared planning) was pro-

posed with conﬂict resolution mechanism to account

for ﬂexible plan interaction and minimal dialogue

interaction for non-intrusiveness. Pike (Levine and

Williams, 2018) was introduced for robot agent to

concurrently infer human intention and adapt to it.

The robot agent reasons on speciﬁc combinations of

controllable (actions that robot agent can choose to

do) and uncontrollable (actions performed by the hu-

man, environment or other agents in the environment)

possibilities for effective plans to jointly achieve a

task. The combinations are constrained by ordering of

actions, time constraint and unpredictable situations.

A temporal plan network generates real-time plans for

execution. Authors in (Vinanzi et al., 2021) assume

trust and intention recognition as essential to collab-

orative joint activities with mutual understanding, in-

teraction and coordination. Authors in this work dif-

fered from most of the research to determining robot

agent’s trust in human partner in performing joint ac-

tivities. Their work performs simulated evaluation

of learning and probabilistic cognitive architecture to

perform joint activity by robot agent estimating inten-

tion of and trust in human agent.

Literature suggests that most of the work con-

cerning joint activity in HRI models relate to inten-

tion recognition and human-aware planning, i.e., what

should the robot must do at perception and action

level without distinguishing the way in which a joint

activity could be pursued. Some work relating to such

modality has been investigated within the context of

agency. During joint activity between humans the au-

thor in (van der Wel, 2015) studies the effect of we-

mode processing of joint actions on the sense of con-

trol in participating human agents. The author stud-

ies dyads of human agents during a joint action with

different roles validating we-mode stance with a high

sense of control and performance of ones own as well

as of the other. However, online intention negotiation

reduced sense of control and shifted the stance of the

dyad to i-mode.

3 METHODOLOGY

We assume for the robot to be already have an aware-

ness of some of the goals that human might want to

pursue as a joint activity with the robot. In a day-to-

day health-care scenario human might want the robot

to; remind them about taking medication on time, to

help the human ﬁnd objects such as newspaper, read-

ing glasses or their cell phone, to make human aware

of any persisting health issue they have been ignoring

and to discuss possible solutions. Either of the hu-

man or the robot may initiate a dialogue pursuing a

goal with a joint activity by ﬁrst facing the prospec-

tive participant. The initiating agent then proposes

a joint intention which could be co-created by other

agent resulting into a we-/I-mode behavior or break-

downs because the other agent might be engaged in

or prioritizes another activity or lacks the knowledge

and/or functions to participate. To this end we only

Towards We-intentional Human-Robot Interaction using Theory of Mind and Hierarchical Task Network

293

(B, Π, O)

H (P

, P

|o)

Figure 1: Robot model and second-order theory of mind as

equivalent Bayesian Networks. The cross-entropy H mea-

sures the difference between robot’s and user’s estimated

state of mind a posteriori of the observations o gathered

through the dialogue with the user.

formulate co-creation of joint intention. Co-creation

of joint intention is realized with a ToM where, the

robot agent computes an information gain between its

own and human agent’s belief-desire-intention (BDI)

models. The information gain then directs the agent

to carryout the plan legibly or optimally.

Next sections illustrates the procedure robot uses

to (1) create its ToM model; (2) gather observations

by dialogue mediation; (3) regularize an HTN plan

constrained by human expectation; and (4) execute

the plan legibly or optimally.

The formal procedure in the text uses follow-

ing symbols: P

denotes probability distribution for

robot, P

is probability distribution on human com-

puted by the robot. b ∈ B denotes set of beliefs, G is

the goal and o ∈ O is the set of observations gathered

from the dialogue interaction. π ∈ Π denotes the set of

task networks. H is used here to denote cross-entropy

between human and robot’s intentions.

3.1 Theory of Mind- Human and Robot

as BDI Models

The general architecture to plan and executes joint

tasks in we-mode is shown in Figure 1. It is com-

posed of two Belief-Desire-Intention (BDI) models in

a conﬁguration that describes a second-order theory

of mind between robot and human agent. On the left

side is the BDI model that the robot uses to plan the

interaction. Its joint distribution is:

(B, Π, O) = P

(O|Π)P

(Π|B)P

(B) (1)

where P

(B) is the probability distribution of be-

liefs and desires (here put together for simplicity).

Sampling from P

(B) yields a candidate belief and

desire of the robot, that it can use to compute its inten-

tion. P

(Π|B) is the intention model, that computes

intentions in the form of plans. Sampling P

(Π, B =

b) yields a candidate robot plan that is consistent with

b. Finally, P

(O|Π) is the observation model, that

provides the robot knowledge of how its intentions are

perceived, either in the form of a dialogue or physical

actions.

On the right side is instead the second-order the-

ory of mind of the human agent that the robot uses to

compute his expectations on the interaction. Again,

the theory of mind is a BDI model and deﬁned as P

Its joint probability distribution is deﬁned as:

(B, Π, O) = P

(O|Π)P

(Π|B)P

(B) (2)

For simplicity, we set P

and P

as equivalent

Bayesian networks. This relies on the assumption

of having a correct human agent model and that it

matches the model used by the robot, that can be

achieved by training the human agent before the inter-

actions. Notice that while the BDI models are struc-

turally the same, the random variables could be dif-

ferently distributed between them, allowing to model

uncertainty in the estimate of the human agent’s be-

liefs or intentions.

Crucial to the functioning of the proposed theory

of mind model is the distance measure H (P

, P

that the robot uses to compute the distance between its

own and the human agent’s probabilistic BDI models.

We refer to this degree of matching between the two

models as the general understanding that the human

agent has of the interaction (Hellstr

om and Bensch,

2018).

A we-mode interaction implies that the human

agent is consistently understanding what’s happening

during the interaction i.e the state of mind of the in-

teraction. Therefore, to operate in we-mode from the

robot side, means to perform in a way that maintains

ToM to be “synchronized”. This can for example

be achieved by producing observations increasing the

understanding of the interaction. In this context, the

Information Gain (IG) that observations produce is:

IG(P

, P

|o) = H (P

, P

) − H (P

, P

|o) (3)

Information gain will be used to guide the exe-

cution of robot’s plan during the two main phases of

interactions, namely the dialogue phase, where robot

and human agent mediates the task (who will do what)

of their joint activity, and the execution phase, where

they jointly perform the agreed upon activity.

Humanoid 2021 - Special Session on Interaction with Humanoid Robots

294

3.2 Dialogue Interaction and Gathering

Observations

In order to mediate what they will do, human agent

and robot perform a dialogue to decide on a set of

commitments. These commitments are expressed in

the form of action frames that are used to constrain the

execution of the planner. Table 2 shows an example

dialogue and how utterances are associated through

action frames.

As proposed in previous research (Persiani and

Hellstr

om, 2020), we utilize Semantic Role Label-

ing to classify action frame from utterances. For ex-

ample, an utterance “Can you remind me to take my

pills in ﬁfteen minutes?” could get parsed into the se-

mantic frame verb: remind, object: pills, recipient:

me, while a classiﬁer maps the semantic frames to

a corresponding task. For example, if the task do-

main deﬁnes an action frame remind ?agent ?recip-

ient ?object ?time, the classiﬁer could map the ut-

terance to a partially speciﬁed action frame remind

robot0 human0 pills0 15minutes, where robot0 hu-

man0 pills0 15minutes are identiﬁers belonging to

the planning instance. Not all arguments of the action

frame are to be speciﬁed, and the missing ones are

found during the subsequent inference steps.

We utilize a ﬁnite state machine of dialogue acts

representing the utterances in each turn to model a di-

alogue scenario about medication as illustrated in 2.

A set of dialogue acts were selected for the purpose of

this work from (Harry et al., 2017)–Greeting, Good-

Bye, Question, Offer, CounterOffer, AcceptOffer, Re-

jectOffer, Inform. To organize the utterances such that

they fulﬁll the joint activity, Table 1 contains the ef-

fects of dialogue acts with respect to three sets: θ is

a set of questions, inform or offers R and O are re-

spectively the sets of rejected and accepted commit-

ments. Either of the agents can initiate a dialogue with

a greeting and end it with a goodbye, without any ef-

fects on commitments. A question or an inform in-

stantiates an action frame that one of the agent wants

to achieve during a joint activity. An offer provides

an agent with a possible action that can be performed

during the joint activity. A counter offer is an action

not accepted and an alternative a

is instead pro-

posed. An Accept and Reject can be used to accept or

reject proposed commitments.

At the end of the dialogue the set o = {a

, ..., a

}

identiﬁes the action frames accepted through the di-

alogue. This set is utilized to constrain the planning

procedure as we will next describe.

3.3 Task Execution: Legibly or

Optimally

After the dialogue the set of mediated action frames

are utilized for constraining inference using the theory

of mind architecture. Two main type of inference are

required for obtaining a we-mode interaction: intent

recognition, that from the action frames ﬁnds the most

likely intention being expressed; and legible behavior,

which uses the found intention to change the original

robot plan toward more legible versions.

Intent recognition is performed by solving the fol-

lowing equation:

, π

= argmax

b∈B,π∈Π

(b, π|o) ∝ P

(b, π, o) (4)

that corresponds to ﬁnding the most likely goal

and plan that the second-order theory of mind yields

while being constrained by o. In other words, the hu-

man agent intention is found by simulating its model

under the constraints.

Legible behavior is obtained by making the robot

plan consistent to human agent’s intention. This is

expressed by the following equation:

, π

= argmax

b∈B,π∈Π

(π|o) + αH (P

, π

)

k P

(b, π))

(5)

The right part regularizes the planner towards inten-

tions that are expected by the human agent model, ie.

similar to g

and π

3.4 Implementation through

Hierarchical Task Network

We implemented the BDI models P

(O, Π, B) and

(O, Π, B) by specifying planning instances using

the Hierarchical Task Network formalism (HTN). A

planning instance for the robot is obtained by spec-

ifying the tuple hP

, T

, M

, A

, I

, G

, O

i. Where

is the set of ground predicates in the initial state,

the goal task to be decomposed into primitive ac-

tions, O

is the set of objects available to ground the

available predicates P

, T

is the set of tasks available

to the planner, M

the set of methods to decompose

tasks in sub-tasks, A

the set of available primitive

actions. Refer to (Bercher et al., 2014) for a com-

plete description of HTN formalism. Similarly, the

second order theory of mind model has components

, T

, M

, A

, I

, G

, O

To implement the equivalent Bayesian networks

if Figure 1 we set the descriptive components of the

planning instances of robot and theory of mind to be

Towards We-intentional Human-Robot Interaction using Theory of Mind and Hierarchical Task Network

295

Table 1: Dialogue acts for the SDS that allows mediation of actions and goals with respect to the sets of offered, rejected and

accepted commitments.

Dialogue Act Precondition Effect

Question, x, ˆg

0 g = ˆg

Offer, x, a a 6∈ θ ∧ a 6∈ R ∧ a 6∈ R a ∈ θ

CounterOffer, x, a

, a

∈ θ a

6∈ R ∧ a

6∈ O a

6∈ θ ∧ a

∈ θ

Accept, x, a a ∈ θ a 6∈ θ ∧ a ∈ O

Reject, x, a a ∈ θ a 6∈ θ ∧ a ∈ R

Inform, x, a

...a

∈ θ a

...a

6∈ θ ∧ a

...a

∈ O

Table 2: Robot and human mediating joint intention about medication committed to mutually agreed upon group goal (we-

mode).

Sentence Speech act Action frame

H Hello Robo Greeting

H Can you remind me to take my pills in ﬁfteen minutes? Question remind robot0 human0 pills 15minutes

R Yes sure! Accept

R Do you want me to bring you some water as well? Offer bring robot0 human0 water ?when

H Yes, that would be great. Accept

R Okay! I will remind you to take pills in ﬁfteen minutes

and bring you some water. Inform

H Thanks a lot! GoodBye

R No problem! GoodBye

equivalent. ie. P

= P

, T

= T

, M

= M

, A

and O

= O

, with the only probabilitic parts be-

ing I

, G

, I

and G

. We can realize the probability

distribution over the possible HTN instances describ-

ing the robot state through a combination of Bernoulli

distribution for the beliefs I

, and a categorical distri-

bution for the possible goals G

(the same for I

and

respectively).

(B) = P

(I)P

(G) (6)

(I; θ

) = Π

P(p

∈ I

;θ

) (7)

P(p

∈ I

) = θ

(G;θ

) = P(G|{g

, ..., g

}) (8)

P(G = hG

i|{g

, ..., g

}) = θ

∑

= 1

Sampling a belief from P

(B) yields a initial state

and a goal task for the HTN planner. The plan-

ning model P

(Π|B) is implemented by a planner

of choice compatible with the underlying HTN re-

quirements. For our experiments we are using the

PANDA(Bercher et al., 2014) planner.

As also commonly proposed in research on plan-

ning, we model the human agent as expecting ratio-

nal behavior from the robot ie. by associating higher

probabilities to cheap plans. Therefore, the probabil-

ity of a plan P

(Π = π|B = b) is deﬁned as a function

of rationality:

(Π = π|B = b) = αexp{τ(|π

opt,b

| − |π|)} (9)

where |π

opt,b

| is the length of an optimal plan for a be-

lief b, while |π| the length of the considered plan. α

is a normalizing constant, τ a temperature parameter.

Eq. 9 gives high likelihood to plans with lower cost,

with maximum likelihood associated to the plans of

the same cost of the optimal plan. Sampling from the

planning model can for example be done through Di-

verse Planning techniques (Katz and Sohrabi, 2020).

(B) and P

(Π|B) are similarly deﬁned.

4 ILLUSTRATIVE EXAMPLE

AND ONGOING WORK

To better show the functioning of the proposed archi-

tecture we present an illustrative scenario with Ex-

amples 3 and 4. We assume a household scenario,

where a robot agent cohabits with human support-

ing him in daily tasks. Among these tasks, the robot

could be asked to perform the joint activity of clean-

ing the house, where they plan and execute the tasks

to; clean windows, mirrors and other surfaces, vac-

uum the ﬂoor, or organize the house.

Depending on human’s expectation from the

robot, the proposed scenario can unfold into follow-

Humanoid 2021 - Special Session on Interaction with Humanoid Robots

296

Table 3: Robot and human mediating joint intention about cleaning the house committed to mutually agreed upon group goal

(We-intention in we-mode)

Sentence Speech act Action frame

H Hello Greeting

H Can you help me with cleaning the house now? Question clean robot0 house0 now

R Yes sure! What would you like me to do? Accept

H I would like you to go and vacuum

the room once I have dusted their surfaces. Offer vacuum robot0 room0 ﬂoor afterDusting

R Okay! Accept

R I will vacuum the ﬂoors once

you have ﬁnished dusting the room. Inform

Is that correct? Question

H Yes, Thanks. Accept

R No Problem. GoodBye

ing two dialogues shown in Table 3 and Table 4.

Where, in the ﬁrst one human asks the robot to help

him clean the house by performing one task of vacu-

uming the rooms given the condition that robot should

vacuum each room after human has cleaned the sur-

faces of that room. In the second case, human in-

structs the robot to perform the tasks of cleaning sur-

faces and vacuuming the two rooms (living and guest

room) in the house on its own, while he could clean

the other rooms. A set of commitments for the corre-

sponding joint activity (cleaning the house) is created

using the action frames for vacuum action in Exam-

ple 3 and, vacuum and clean surface actions in Ex-

ample 4. Depending on the dialogue scenario that hu-

man chose, action frames will be used to guide the

execution of the joint activity in the following ways:

ﬁrstly, as constraints while computing the intention

of the human. The human’s intention is computed to

know what the human expects (we-mode in Exam-

ple 3 and i-mode in Example 4) from the joint activ-

ity. Since in the example dialogues the human uses

Offer speech act to set the goal task, what remains to

compute is the plan part of the intention. To do so the

human model P

is simulated giving the set of action

frames as constraints (Eq. 4) ie. only intention con-

taining the mentioned action frames are considered.

After ﬁnding the human intention the robot should

compute its own intention. It can perform the activity

either in I-mode or in we-mode. Robot computes opti-

mal plan for operating in I-mode for the dialogue Ex-

ample 4 while the plan is regularized to match the ex-

pectation of the human in dialogue Example 3. Such

plans for this work is created using HTN as illustrated

in the previous Section 3.4.

This work reports on ongoing research and cur-

rently we are implementing the different compo-

nents for realizing the proposed architecture from

Activity Theoretical perspective (Vygotsky, 1978;

Kaptelinin and Nardi, 2006; Leontiev, 1978). Ac-

tivity theory facilities the study of human activity

at the intersection of human psychology, social re-

lationships and institutions (Roth, 2014). We mod-

ify the proposed BDI model in our architecture with

elements–(subject, object, tool, division of labor,

rules, and community) from Engestr

om’s Activity

System Model (ASM) (Engestr

om, 1999), for dia-

logues between robots and humans in a health-care

setting. As a ﬁrst step of validation, a qualitative

study was conducted with 20 human participants, who

watched recordings of human and robot interacting in

a Wizard of Oz setup. The study aimed to understand

how human’s perceived HRI interactions for joint ac-

tivity in we-/I-mode of we-intention. Qualitative re-

sults highlighted that behavior, role, relationship and

dialogue strategies were perceived as intuitive, intelli-

gent and human-like, however, the robot was expected

to adapt and improve to have more empathetic behav-

ior. The results from our qualitative study will now be

used to inform the development of our architecture.

5 CONCLUSIONS

With this position paper we proposed a theory-of-

mind architecture for joint human-robot activities.

Joint activities are ﬁrst mediated through a dialogue,

allowing human and robot to specify what will be

their commitments inside the joint activity. Then,

from the robot’s perspective, a joint activity can be

executed in I-mode or we-mode respectively. While

operating in I-mode, the robot committed towards its

private goal executes the joint activity in the most op-

timal way given its capabilities, and without consid-

ering the fact that it is collaborating with a human.

This can be desirable is some scenarios, such as when

robot and human are committed to the same goal op-

erating in we-mode instead, a legibility requirement

is introduced to make the robot plan similar to the hu-

Towards We-intentional Human-Robot Interaction using Theory of Mind and Hierarchical Task Network

297

Table 4: Robot and human mediating joint intention about cleaning the house committed to their own private goals (we-

intention in i-mode).

Sentence Speech act Action frame

H Hi, Can you help me with cleaning the house now? Greeting and Question clean robot0 human0 house now

R Yes sure! What would you like me to do? Accept

H I would like you to dust the surfaces dust,vacuum robot0

and vacuum ﬂoors of the living room and guest room, Offer room0 living room1 guest

while.

R Yes sure! Accept

R I will clean the living and guest room. Offer

H Thanks, that would be great. Accept

R No problem! GoodBye

man’s expectations.

We proposed the formalization of the architec-

ture, together with an illustrative example showing its

function in a simple scenario where human and robot

mediate the task of bringing the human medications.

Future work will conduct Human Robot Interaction

(HRI) studies to evaluate the proposed architecture.

REFERENCES

Bercher, P., Keen, S., and Biundo, S. (2014). Hy-

brid planning heuristics based on task decomposition

graphs. In Seventh Annual Symposium on Combinato-

rial Search.

Bianco, F. and Ognibene, D. (2019). Transferring adaptive

theory of mind to social robots: Insights from devel-

opmental psychology to robotics. In Social Robotics,

pages 77–87. Springer International Publishing.

Bratman, M. E. (1993). Shared intention. Ethics,

104(1):97–113.

Clodic, A., Alami, R., and Chatila, R. (2014). Key elements

for joint human-robot action. In Robo-Philosophy,

volume 273 of Sociable Robots and the Future of So-

cial Relations, pages 23–33, Aarhus, Denmark. IOS

Press Ebooks.

Devin, S., Clodic, A., and Alami, R. (2017). About de-

cisions during human-robot shared plan achievement:

Who should act and how? In Social Robotics, pages

453–463, Cham. Springer International Publishing.

Engestr

om, Y. (1999). Activity theory and individual and

social transformation. Perspectives on activity theory,

19(38):19–38.

Georgievski, I. and Aiello, M. (2015). Htn planning:

Overview, comparison, and beyond. Artiﬁcial Intel-

ligence, 222:124–156.

Grosz, B. and Kraus, S. (1996). Collaborative plans for

complex group action. Artiﬁcial Intelligence.

Grynszpan, O., Saha

ı, A., Hamidi, N., Pacherie, E., Berbe-

rian, B., Roche, L., and Saint-Bauzel, L. (2019). The

sense of agency in human-human vs human-robot

joint action. Consciousness and cognition, 75:102820.

Harry, B., Volha, P., David, T., and Jan, A. (2017). Dialogue

Act Annotation with the ISO 24617-2 Standard, pages

109–135. Springer International Publishing.

Hellstr

om, T. and Bensch, S. (2018). Understandable robots

- what, why, and how. Paladyn, Journal of Behavioral

Robotics, 9(1):110–123.

Kamar, E., Gal, Y., and Grosz, B. J. (2009). Incorporating

helpful behavior into collaborative planning. In Pro-

ceedings of The 8th International Conference on Au-

tonomous Agents and Multiagent Systems (AAMAS).

Springer Verlag.

Kaptelinin, V. and Nardi, B. A. (2006). Acting with Tech-

nology: Activity Theory and Interaction Design. The

MIT Press, USA.

Katz, M. and Sohrabi, S. (2020). Reshaping diverse plan-

ning. In AAAI, pages 9892–9899.

Lallement, R., de Silva1, L., and Alam, R. (2014). Hatp:

An htn planner for robotics. In Planning and Robotics

(PlanRob), ICAPS Workshop, volume 2, page 8,

Portsmouth, USA. AAAI Press.

Leontiev, A. N. (1978). Activity, consciousness, and per-

sonality. Prentice-Hall, Moscow, Russia.

Levine, S. J. and Williams, B. C. (2018). Watching and act-

ing together: Concurrent plan recognition and adap-

tation for human-robot teams. J. Artif. Int. Res.,

63(1):281–359.

Persiani, M. and Hellstr

om, T. (2020). Intent recognition

from speech and plan recognition. In International

Conference on Practical Applications of Agents and

Multi-Agent Systems, pages 212–223. Springer.

Rao, A. S., Georgeff, M. P., et al. (1995). Bdi agents: from

theory to practice. In ICMAS, volume 95, pages 312–

319, USA. MIT Press.

Rauenbusch, T. W. and Grosz, B. J. (2003). A decision

making procedure for collaborative planning. In Pro-

ceedings of the Second International Joint Conference

on Autonomous Agents and Multiagent Systems, AA-

MAS ’03, page 1106–1107. Association for Comput-

ing Machinery.

Roth, W.-M. (2014). Activity Theory, pages 25–31.

Springer New York, New York, NY.

Scassellati, B. (2002). Theory of mind for a humanoid

robot. Autonomous Robots, 12:13–24.

Schweikard, D. P. and Schmid, H. B. (2020). Collective In-

tentionality. In Zalta, E. N., editor, The Stanford En-

cyclopedia of Philosophy. Metaphysics Research Lab,

Stanford University, Winter 2020 edition.

Thellman, S. and Ziemke, T. (2019). The intentional

stance toward robots: conceptual and methodologi-

cal considerations. In The 41st Annual Conference of

Humanoid 2021 - Special Session on Interaction with Humanoid Robots

298

the Cognitive Science Society, July 24-26, Montreal,

Canada, pages 1097–1103.

Tomasello, M., Carpenter, M., Call, J., Behne, T., and Moll,

H. (2005). Understanding and sharing intentions: The

origins of cultural cognition. Behavioral and Brain

Sciences, 28(5):675–691.

Tuomela, R. (2005). We-intentions revisited. Philosophical

Studies, 125(3):327–369.

Tuomela, R. (2006). Joint intention, we-mode and i-mode.

Midwest Studies In Philosophy, 30(1):35–58.

Tuomela, R. and Miller, K. (1988). We-intentions. Philo-

sophical Studies, 53(3):367–389.

van der Wel, R. P. (2015). Me and we: Metacognition and

performance evaluation of joint actions. Cognition,

140:49–59.

Vinanzi, S., Cangelosi, A., and Goerick, C. (2021). The col-

laborative mind: intention reading and trust in human-

robot interaction. iScience, 24(2):102130.

Vygotsky, L. S. (1978). Mind in society: The development

of higher mental processes (e. rice, ed. & trans.).

Wilfrid, S. (1980). On reasoning about values. In American

Philosophical Quarterly, volume 17, page 81–101.

JSTOR.

Towards We-intentional Human-Robot Interaction using Theory of Mind and Hierarchical Task Network

299