Sentiment-Based Engagement Strategies for Intuitive Human-Robot

Interaction

Thorsten Hempel

, Laslo Dinges

and Ayoub Al-Hamadi

Neuro-Information Technology, Faculty of Electrical Engineering and Information Technology,

Otto von Guericke University, Magdeburg, Germany

Keywords:

Human-Robot Interaction, Approaching Strategy, Sentiment Estimation, Emotion Detection, Anticipating

Human Behaviors, Approaching People, Productive Teaming.

Abstract:

Emotion expressions serve as important communicative signals and are crucial cues in intuitive interactions

between humans. Hence, it is essential to include these fundamentals in robotic behavior strategies when inter-

acting with humans to promote mutual understanding and to reduce misjudgements. We tackle this challenge

by detecting and using the emotional state and attention for a sentiment analysis of potential human inter-

action partners to select well-adjusted engagement strategies. This way, we pave the way for more intuitive

human-robot interactions, as the robot’s action conforms to the person’s mood and expectation. We propose

four different engagement strategies with implicit and explicit communication techniques that we implement

on a mobile robot platform for initial experiments.

1 INTRODUCTION

Once utopian, robots are increasingly moving from

industrial and laboratory settings to the real-world

in order to assist humans in everyday tasks. In this

process, human-robot interaction became a central

point of interest in robotic research that investigates

the manifold challenges in performing interactive and

collaborative tasks. As a general principle, the robot

needs to be acquired with the necessary skills to en-

able intuitive interactions with arbitrary human inter-

action partners, regardless of the corresponding task,

the human’s intention and communication behavior.

To this end, a fundamental objective is the anticipa-

tion of appropriate strategies to proactively approach

or evade people based on the situation-speciﬁc con-

text, such as the person’s mood, attitude (M

uller-

Abdelrazeq et al., 2019; Elprama et al., 2016) and

ressentiments towards robots (Naneva et al., 2020;

Miller et al., 2021). First, it requires the determina-

tion of people’s interest in interacting at all (Satake

et al., 2009; Finke et al., 2005). Thereupon, the robot

has to ﬁnd an appropriate strategy for either engaging

people to establish an interaction or to avoid a con-

frontation if desired. In order to reach these under-

https://orcid.org/0000-0002-3621-7194

https://orcid.org/0000-0003-0364-8067

https://orcid.org/0000-0002-3632-2402

lying decisions, the robot has to cope with a row of

sub-tasks: estimating people’s focus of attention (Dini

et al., 2017), predicting their mental state, and carry-

ing out a (dis-)engagement strategy (Avelino et al.,

2021).

Especially, the estimation of the interaction will-

ingness is a very challenging task, as this mental state

tends to provide only vague social signals. As an

alternative approach, many methods (Oertel et al.,

2020) go after the detection of engagement in order

to determine if a person already approached the robot

and waits for the robot’s reaction. This state expresses

itself more clearly, e.g., in voice commands (Foster

et al., 2017), gestures and proxemics, but it under-

mines the proactive approach and neglects situations

where the human partner is in need of help, but too

uncertain or not aware of the robot.

In this work, we close the gap between the lack of

proactive behavior and advanced mental state predic-

tions by introducing a sentiment-based engagement

strategy.

By combining the focus of attention and emotions,

we derive a sentiment analysis that allows us to select

ﬁne-grained behavior patterns that exceed current bi-

nary engagement and disengagement strategies. We

implement our approach on a mobile robot platform

and execute our strategies using explicit and implicit

embodied communication. Our models are trained

on minimum sized networks in order to perform our

680

Hempel, T., Dinges, L. and Al-Hamadi, A.

Sentiment-Based Engagement Strategies for Intuitive Human-Robot Interaction.

DOI: 10.5220/0011772900003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 4: VISAPP, pages

680-686

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

method on mobile systems with limited computa-

tional resources.

2 RELATED WORK

In recent years, proxemic rules have been a popular

tool to draw conclusions about the different states of

human-robot interactions (Mumm and Mutlu, 2011)

and how spatial zones can be leverages to improve

them (Syrdal et al., 2007). (Repiso et al., 2018) pre-

dict appropriate encounter points to achieve a natu-

ral engagement with a group of people. Similarly,

(Satake et al., 2009) proactively approach detected

people to offer help. (Fischer et al., 2021) verbally

greet people and dynamically adapt the voice volume

based on the distance to the target person. Follow-

ing (Kendon, 1990), (Carvalho et al., 2021) apply the

Kendon’s greeting model to approach people while

tracking the human mental states of the interaction us-

ing multiple features, such as gaze, facial expressions,

and gestures. Yet, all of these methods neglect if the

human counterpart is actually interested and ready to

initiate an interaction. This is problematic, as in the

case of negative attitude, the robot’s interaction ef-

forts can be perceived as rude and annoying (Br

ohl

et al., 2019). An initial attempt to address this was

presented by (Kato et al., 2015), who only approach

and offer help if they detect signs of attention towards

the robot. But this still doesn’t incorporate the actual

mental attitude of the human counterpart towards the

interaction itself.

The recent advances in the area of deep-learning

opened up new possibilities for the visual analysis of

humans, starting from general detection tasks to the

estimation of facial micro expressions. This allows

the aggregation of additional relevant information,

such as emotions (Chuah and Yu, 2021; Spezialetti

et al., 2020), to better understand the behavior and in-

tentions of the human interaction partner. To the best

of our knowledge, we are the ﬁrst to leverage this ad-

vancement by predicting sentiment states based on vi-

sual emotion estimation to improve interaction strat-

egy for robots. We summarize our main contributions

as follows:

• We propose an image-based sentiment analysis to

gauge the current interaction preference of poten-

tial human interaction partners.

• We propose four different behavior strategies (En-

gage, Attract, Ignore, Avoid) that comprises a

number of different explicit and implicit commu-

nication modalities.

Yaw, Tilt Head

Rotate Base

Lifting Torso

Speech

Figure 1: The TIAGo robot platform with the key opera-

tions used in our method.

• We design ultra lightweight models to implement

our method and integrate it on a mobile robot plat-

form.

• We conduct initial experiments in laboratory set-

tings for an early evaluation.

3 METHOD

This section explains each step of our proposed

method for sentiment-based interaction behavior.

The sentiment analysis is based on two main fea-

tures: head pose and emotion estimation. The head

pose estimation is a leading indicator to gauge the cur-

rent visual focus of attention of the person. The emo-

tion estimation predicts the current emotional state of

the person. Together, it allows the assumption about

a person that a) is seeking for interaction, b) is in-

decisive about it or c) doesn’t want to engage with

the robot at all. Determining each sentiment state en-

ables to perform a dedicated robot behavior that cor-

responds with the expected reaction from the person

and, thus, improves an intuitive interaction. Figure 2

illustrates an overview of our proposed system. In the

following subsection, we will give details about each

component of the system and its interplay.

3.1 Robot Platform

We use the TIAGo robot from PAL Robotics, a mobile

service robot for indoor environments. It is equipped

with an RGB-D camera mounted inside its head, that

can be yawed and pitched to dynamically perceive the

environment. For grasping and moving objects, the

Sentiment-Based Engagement Strategies for Intuitive Human-Robot Interaction

681

robot has a manipulator arm with a 5 ﬁnger gripper.

The torso can be lifted to adjust the robot’s height, and

integrated speakers allow the output of voice com-

mands. These abilities allow the combination of ex-

plicit and implicit communication modalities that we

use in our method.

• Head Following: The robot’s head aims at and fol-

lows a moving person as an implicit signal of at-

tentiveness without actively approaching.

• Body Following: If a person exceeds the range of

the robot’s moving head, the entire robot base ro-

tates in order to keep track of the person. This

behavior communicates an attention commitment

and is therefore an even stronger signal as the

head following.

• Torso Lifting: Lifting the torso is a rather subtle

communication cue, as it is perceived similarly to

the social convention of standing up while greet

someone.

• Speech: Approaching people with speech rep-

resents an explicit communication strategy, that

stronger commits them to a reaction. Hence,

speech should be only used if the approached per-

son is highly expected to welcome an interaction.

3.2 Visual Attention

Typical indicators for gauging the human visual at-

tention are the gaze and, more coarsely, the head

pose. As the gaze depends on the eyes, which takes

only a small part in face images, it is prone to er-

rors. Therefore, we estimate the visual attention of

the surrounding persons based on a head pose predic-

tion algorithm. At ﬁrst, we locate faces in the image

stream using an ultra light SSD face detector. Then,

the face crops are further processed by the 6DRep-

Net (Hempel et al., 2022), that uses a rotation repre-

sentation to directly regress yaw, pitch, and roll an-

gle of the faces. To assure real-time processing capa-

bilities, we replace the original 6DRepNet backbone

with the efﬁcient MobileNetv3-Small (Howard et al.,

2019). The new head pose prediction model is able to

maintain 90% of its accuracy compared to the origi-

nal model, but is downsized to only one tenth of its

parameters count.

3.3 Emotion Estimation

In order to assess the attitude towards robots as well

as differentiated subjective states such as dislike or

willingness to cooperate, basic emotions can be used

as indicating features. Naturally, these are frequently

communicated by facial expressions. Hence, au-

tomatic facial expression recognition is an essen-

tial method to generate features for Human-Robot-

Collaboration and attention prediction.

Previous facial expression recognition methods

adhere to the established pipeline of face detection,

landmark extraction, and action unit (AU). Then,

emotions are classiﬁed using these AUs as feature

vectors (Wegrzyn et al., 2017; Vinkemeier et al.,

2018; Werner et al., 2017). However, end-to-end

learning on a single but more comprehensive, and bet-

ter generalizing database, often outperforms such tra-

ditional approaches. The AffectNet database (Molla-

hosseini et al., 2017) for example contains about 500k

in-the-wild samples for neutral and the basic emo-

tions happy, disgust, fear, surprise, anger, and sad-

ness. Furthermore, it is the only database, that also

includes ground truth of valence and arousal (VA),

which are, in contrast to the basic emotion classes,

continuous labels ∈ [−1, 1] which can be used for re-

gression tasks. VA is less intuitively interpretable by

humans, but it allows differentiation between facial

expressions of varying degrees. This is particularly

useful for capturing modest but long-term changes

in expressed emotion. In addition, a multitask net-

work (simultaneous training of classiﬁcation and re-

gression) also improved the accuracy of the classiﬁ-

cation. Furthermore, Valence summarizes essential

information about several emotions in one parameter

(negative, neutral, positive), which may facilitate the

development of heuristic behavior rules for some sce-

narios.

In our former work, we proposed a multitask

network based on YOLOv3 (Dinges et al., 2021).

For the current paper, we compared EfﬁcientNetV2-

S, ResNet18, and MobileNetV3-Large. Despite the

poorer converging training loss, MobileNet, which is

optimized for limited hardware, achieved almost the

same accuracy on the test set as EfﬁcientNet (devia-

tion < 1%). In addition, we used a weighted random

sampler to handle the imbalanced dataset and re-train

the network on the training and test set to generate

an application model with increased robustness to un-

seen data.

3.4 Sentiment Analysis and

Approaching Strategy

Detecting emotional expressions combined with the

visual focus of attention allows perceiving important

communicative signals. As in human-human interac-

tion, you may be more likely to greet a stranger who

smiles at you but avoid eye contact if that stranger

is scowling. We transfer these principles to deter-

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

682

Engage Attract

Avoid Ignore

Head Pose Estimation

Emotion estimation

Face Detection

Perception

Behavior Strategy

Sentiment Analysis

Emotional state with

focus of attention

Figure 2: Method overview of our proposed system. Detected faces are used to estimate the head pose and the corresponding

emotions based on facial expressions. Both information is used to determine the emotional state with the corresponding visual

focus of the person’s attention that indicates the sentiment and allows a ﬁne-tuned behavior strategy.

Happy

Uncertain

Surprise

Neutral

Sad

Fear

Disgust

Anger

Facing Averted

ENGAGE:

Lift torso, follow with

head and base, and

approach by speech

ATTRACT:

Lift torso and follow

with head

AVOID:

Look down and avoid

attraction

IGNORE:

Follow with head

Figure 3: Behavior strategies based on the emotional and

attentive state.

mine states of sentiment based on the combination

of the different emotions and visual attention. Is a

person showing visual signs of fear or anger, while

he is attentive towards the robot, it indicates a nega-

tive attitude towards the robot. In this case, a passive

robot behavior without hectic movements can help to

defuse the situation and to calm the person. If a per-

son is feared, but not attentive towards the robot, the

reason for the fear might not be the robot. Execut-

ing a more active robot behavior to show awareness

could be therefore a more suitable strategy. Hence,

our sentiment analysis allows us to compartmental-

ize the common hard binary engagement strategy into

soft behavior. Instead of choosing between engaging

and not engaging, we choose between four different

behaviors: Engage, Attract, Avoid, Ignore.

• Engage: Engaging will extend the torso of the

robot to signal the awareness of the person’s pres-

ence. Additionally, voice commands will greet

the people to actively initiate an interaction. This

strategy is called, when the attention is directed at

the robot with clear positive emotional attitude.

• Attract: Positive emotional attitude without atten-

tion towards the robot leads to more passive be-

havior. The goal is to signal the offer for inter-

action and help without actively committing to an

interaction. This is done by extending the torso

to show awareness, facing the person, but without

verbally greeting. This way, a non-verbal offer for

support (especially in case of uncertainty and sad-

ness) is subtly given, that can be easily accepted

or ignored.

• Avoid: The avoiding strategy is executed when

strong negative emotions (Fear, Disgust, Anger)

are facing the robot. In this case, the robot avoids

eye contact (keeps the face outside the camera

center) and takes a passive part. Especially in case

of fear, the goal is to increase the feeling of safe-

ness by avoiding unexpected motions.

• Ignore: If the negative emotions are not directed

at the robot, the robot mostly ignores the person,

only following it with its head to capture changes

in emotional or attention states.

Figure 3 shows all combinations of emotional and at-

tention states with corresponding behavior strategy.

4 IMPLEMENTATION

We used the ROS

middleware to implement our pro-

posed solution on our robot platform, as this also

offers a simpliﬁed integration into other ROS-based

robot platforms. Face detection, head pose and emo-

tion estimation are executed in separate nodes, that

forward the necessary information to perform the sen-

timent analysis and the action controls. While the lat-

ter is performed on the robot itself, the inference of

https://www.ros.org/

Sentiment-Based Engagement Strategies for Intuitive Human-Robot Interaction

683

Figure 4: Sentiment analysis in a laboratory setting. The

person is tracked using a face detector, followed by a head

pose and emotion estimation. As the person shows atten-

tiveness and happy, a positive sentiment towards the robot

is expected and a proactive initiation strategy selected.

the neural networks is outsourced to a separate note-

book that is mounted on the shoulder of the robot.

This ensures the complete mobility of the robot, while

exploiting additional computational resources. The

notebook has a Quadro RTX 4000 processing all

models in parallel with the following inference times:

• Face Detection: 6.7 ms

• Head Pose Estimation: 1.4 ms

• Emotion Estimation: 6.3 ms

The visual perception can be therefore performed in

real time and is able to continuously track people in

the surroundings.

5 EXPERIMENT

We conducted an initial experiment in a laboratory

setting to test our implemented method on the TIAGo

robot platform. The objective of the experiment was

to gain experience about the model’s performances in

real application scenarios. Figure 4 shows a snapshot

from one of our tests. The robot platform detects a

face in the environment and predicts head pose and

emotional state. The focus of attention is on the robot

(indicated by the green arrow) in conjunction with ex-

pressing happiness, so the robot initiates an interac-

tion by following the engage strategy.

While the models provide very robust predic-

tions, the implementation is currently entirely based

on single shot detections. Yet, over the duration of

face tracking, facial expressions change intermedi-

ately and short distractions make the head pose oc-

casionally oscillating. This leads to abrupt changes of

states within short time periods. Hence, to improve

the robustness of the sentiment analysis, conclusions

about the current state should incorporate the predic-

tions from multiple time frames.

6 CONCLUSION

In this paper, we present a sentiment-based method-

ology to improve the intuitiveness and reasoning in

human-robot interactions. Our method focuses on the

visual perception of mental states and the focus of at-

tention to derive a sentiment analysis for suitable in-

teraction initiation strategies. We take also into ac-

count the case where people are not interested in an

engagement at all, which has been mainly neglected

in the recent literature. To implement our method,

we trained lightweight perception models that we in-

tegrate on a mobile robot platform. Finally, we con-

ducted initial experiments using a mobile robot plat-

form to analyze the performance of our models and

the overall system.

In the future, we will extend our work by

embedding our sentiment analysis in a temporal-

probabilistic framework to improve the robustness in

real-world scenarios. Here, we will also incorpo-

rate the valence and arousal predictions to evaluate

the emotion’s intensity and transitions between emo-

tional states. Further enhancements offer the inclu-

sion of the robot’s mobility to actively engage and dis-

engage people or to add additional interaction modal-

ities, such as following people by command, getting

out of the way of passing people, or approaching a

place that offers a better overview of the surround-

ings.

Finally, we will conduct a study with test subjects

to evaluate our approach in terms of intuitiveness and

sympathy.

ACKNOWLEDGEMENTS

This work is funded and supported by the Fed-

eral Ministry of Education and Research of Ger-

many (BMBF) (AutoKoWAT-3DMAt under grant Nr.

13N16336) and German Research Foundation (DFG)

under grants Al 638/13-1, Al 638/14-1 and Al 638/15-

REFERENCES

Avelino, J., Garcia-Marques, L., Ventura, R., and

Bernardino, A. (2021). Break the ice: a survey on

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

684

socially aware engagement for human–robot ﬁrst en-

counters. International Journal of Social Robotics,

13:1851 – 1877.

ohl, C., Nelles, J., Brandl, C., Mertens, A., and Nitsch,

V. (2019). Human-robot collaboration acceptance

model: Development and comparison for germany,

japan, china and the usa. I. J. Social Robotics, 11:709–

726.

Carvalho, M., Avelino, J., Bernardino, A., Ventura, R.

M. M., and Moreno, P. (2021). Human-robot greet-

ing: tracking human greeting mental states and acting

accordingly. 2021 IEEE/RSJ International Confer-

ence on Intelligent Robots and Systems (IROS), pages

1935–1941.

Chuah, S. H.-W. and Yu, J. (2021). The future of ser-

vice: The power of emotion in human-robot interac-

tion. Journal of Retailing and Consumer Services,

61:102551.

Dinges, L., Al-Hamadi, A., Hempel, T., and Al Aghbari,

Z. (2021). Using facial action recognition to evaluate

user perception in aggravated hrc scenarios. In 2021

12th International Symposium on Image and Signal

Processing and Analysis (ISPA), pages 195–199.

Dini, A., Murko, C., Yahyanejad, S., Augsd

orfer, U., Hof-

baur, M., and Paletta, L. (2017). Measurement and

prediction of situation awareness in human-robot in-

teraction based on a framework of probabilistic atten-

tion. In 2017 IEEE/RSJ International Conference on

Intelligent Robots and Systems (IROS), pages 4354–

4361.

Elprama, S., El Makrini, I., Vanderborght, B., and Jacobs,

A. (2016). Acceptance of collaborative robots by fac-

tory workers: a pilot study on the role of social cues

of anthropomorphic robots.

Finke, M., Koay, K. L., Dautenhahn, K., Nehaniv, C., Wal-

ters, M., and Saunders, J. (2005). Hey, i’m over here

- how can a robot attract people’s attention? In RO-

MAN 2005. IEEE International Workshop on Robot

and Human Interactive Communication, 2005., pages

7–12.

Fischer, K., Naik, L., Langedijk, R. M., Baumann, T.,

Jel

ınek, M., and Palinko, O. (2021). Initiating human-

robot interactions using incremental speech adapta-

tion. In Companion of the 2021 ACM/IEEE Interna-

tional Conference on Human-Robot Interaction, HRI

’21 Companion, page 421–425, New York, NY, USA.

Association for Computing Machinery.

Foster, M. E., Gaschler, A., and Giuliani, M. (2017). Au-

tomatically classifying user engagement for dynamic

multi-party human–robot interaction. International

Journal of Social Robotics, 9.

Hempel, T., Abdelrahman, A. A., and Al-Hamadi, A.

(2022). 6d rotation representation for unconstrained

head pose estimation. In 2022 IEEE International

Conference on Image Processing (ICIP), pages 2496–

2500.

Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B.,

Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V.,

Le, Q. V., and Adam, H. (2019). Searching for mo-

bilenetv3. In Proceedings of the IEEE/CVF Interna-

tional Conference on Computer Vision (ICCV).

Kato, Y., Kanda, T., and Ishiguro, H. (2015). May i

help you? - design of human-like polite approach-

ing behavior-. In 2015 10th ACM/IEEE International

Conference on Human-Robot Interaction (HRI), pages

35–42.

Kendon, A. (1990). Conducting Interaction: Patterns of Be-

havior in Focused Encounters. Cambridge University

Press, Cambridge, U.K.

Miller, L., Kraus, J., Babel, F., and Baumann, M. (2021).

More than a feeling—interrelation of trust layers in

human-robot interaction and the role of user disposi-

tions and state anxiety. Frontiers in Psychology, 12.

Mollahosseini, A., Hasani, B., and Mahoor, M. H. (2017).

Affectnet: A database for facial expression, valence,

and arousal computing in the wild. IEEE Transactions

on Affective Computing, 10:18–31.

uller-Abdelrazeq, S. L., Sch

onefeld, K., Haberstroh, M.,

and Hees, F. (2019). Interacting with Collaborative

Robots—A Study on Attitudes and Acceptance in In-

dustrial Contexts, pages 101–117. Springer Interna-

tional Publishing, Cham.

Mumm, J. and Mutlu, B. (2011). Human-robot proxemics:

Physical and psychological distancing in human-robot

interaction. In 2011 6th ACM/IEEE International

Conference on Human-Robot Interaction (HRI), pages

331–338.

Naneva, S., Gou, M. S., Webb, T. L., and Prescott, T. J.

(2020). A systematic review of attitudes, anxiety, ac-

ceptance, and trust towards social robots. Interna-

tional Journal of Social Robotics, pages 1–23.

Oertel, C., Castellano, G., Chetouani, M., Nasir, J., Obaid,

M., Pelachaud, C., and Peters, C. (2020). Engagement

in human-agent interaction: An overview. Frontiers in

Robotics and AI, 7.

Repiso, E., Garrell, A., and Sanfeliu, A. (2018). Robot

approaching and engaging people in a human-robot

companion framework. In 2018 IEEE/RSJ Interna-

tional Conference on Intelligent Robots and Systems

(IROS), pages 8200–8205.

Satake, S., Kanda, T., Glas, D. F., Imai, M., Ishiguro, H.,

and Hagita, N. (2009). How to approach humans?-

strategies for social robots to initiate interaction. In

2009 4th ACM/IEEE International Conference on

Human-Robot Interaction (HRI), pages 109–116.

Spezialetti, M., Placidi, G., and Rossi, S. (2020). Emotion

recognition for human-robot interaction: Recent ad-

vances and future perspectives. Frontiers in Robotics

and AI, 7.

Syrdal, D. S., Lee Koay, K., Walters, M. L., and Dauten-

hahn, K. (2007). A personalized robot companion?

- the role of individual differences on spatial prefer-

ences in hri scenarios. In RO-MAN 2007 - The 16th

IEEE International Symposium on Robot and Human

Interactive Communication, pages 1143–1148.

Vinkemeier, D., Valstar, M. F., and Gratch, J. (2018). Pre-

dicting folds in poker using action unit detectors and

decision trees. In FG, pages 504–511.

Wegrzyn, M., Vogt, M., Kireclioglu, B., Schneider, J., and

Kissler, J. (2017). Mapping the emotional face. How

Sentiment-Based Engagement Strategies for Intuitive Human-Robot Interaction

685

individual face parts contribute to successful emotion

recognition. PLOS ONE, 12(5):e0177239.

Werner, P., Handrich, S., and Al-Hamadi, A. (2017). Fa-

cial action unit intensity estimation and feature rel-

evance visualization with random regression forests.

In 2017 Seventh International Conference on Affective

Computing and Intelligent Interaction (ACII), volume

2018-Janua, pages 401–406. IEEE.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

686