Be Persuasive! Automatic Transformation of Virtual Agent’s Head and

Facial Behavior

Afef Cherni

, Roxane Bertrand

and Magalie Ochs

Aix-Marseille Univ., IN2P3, CNRS, France

Aix-Marseille Univ., LPL, CNRS, France

Aix-Marseille Univ., LIS, CNRS, France

Keywords:

Multimodal Cues, Persuasion, Embodied Conversational Agent, Machine Learning Methods, Mathematical

Convolution.

Abstract:

The persuasiveness of a virtual agent refers to its ability to inﬂuence, persuade, or motivate users to take spe-

ciﬁc actions or adopt certain attitudes or beliefs. Virtual agents can use its multimodal capabilities, including

non-verbal cues to enhance their persuasiveness. In this paper, we present a new tool called THRUST (from

neuTral Human face to peRsUaSive virTual face) to automatically generate the head movements and facial

expressions of a persuasive virtual character. This tool is based on a machine learning approach from a human

videos corpus to identify the non-verbal persuasive cues. A convolution-based model then transforms neutral

non-verbal behavior to a persuasive non-verbal behavior simulated on a virtual face. Videos generated by the

tool have been evaluated through a subjective perceptive study with about 90 participants. The results show

that the virtual agent’s head and facial behaviors generated by the THRUST tool are perceived as persuasive,

thus validating the proposed approach.

1 INTRODUCTION

One challenge facing the ﬁeld of intelligent virtual

agent research is the automatic generation of behav-

iors for embodied conversational agents, especially

concerning social and emotional dimensions. In this

article, we focus on the generation of persuasive vir-

tual agent’s behavior. According to (Miller, 2013),

the persuasion can be deﬁned as ”any message that is

intended to shape, reinforce or change the responses

of another or others”. As highlighted in (Burgoon

et al., 1990) and (Chidambaram et al., 2012), the per-

suasiveness of a message does not solely depend on its

content but also largely relies on the multimodal com-

ponents, encompassing different verbal, vocal, and

mimo-gestural levels (such as facial expressions, ges-

tures, and pitch). This article speciﬁcally focuses on

the social cues conveyed through non-verbal signals,

such as facial expressions and head movements, that

can be expressed by a virtual agent to enhance per-

suasiveness. However, this article does not address

the argumentative aspects related to persuasion, such

as identifying arguments to convince, arranging the

order of argument presentation, and responding to ar-

guments raised by the persuadee.

Our ultimate objective is to create an Embodied

Conversational Agent (ECA) with persuasive capa-

bilities to encourage the elderly population to en-

gage in physical activity. In the ﬁeld of intelligent

virtual agents, numerous persuasive virtual agents

have already been created (e.g. (Lisetti et al., 2013;

Petukhova et al., 2017; Nguyen et al., 2007)). The

main method for modeling persuasive behavior in-

volves identifying behavioral cues that have an impact

on perceived persuasiveness, and integrating these

cues into virtual agents. The literature emphasizes

certain human behavioral cues related to persuasion,

such as body movements (Burgoon et al., 1990) and

prosody (Petukhova et al., 2017). In the domain of

virtual agents, empirical research has shown the im-

portance of certain verbal and non-verbal cues in en-

hancing the persuasiveness of the virtual agent (Ghaz-

ali et al., 2018; Chidambaram et al., 2012). However,

as far as we know, there is currently no existing multi-

modal behavioral model that generates the behavioral

cues that a virtual agent should display to be perceived

as persuasive.

In this article, we present a new software tool

called THRUST: from Neutral Human Face to Persua-

sive Virtual Face with its subjective evaluation. The

Cherni, A., Bertrand, R. and Ochs, M.

Be Persuasive! Automatic Transformation of Virtual Agent’s Head and Facial Behavior.

DOI: 10.5220/0012429700003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 1, pages 359-366

ISBN: 978-989-758-680-4; ISSN: 2184-433X

359

tool is designed to automatically convert a video of a

human into a video of a virtual character that exhibits

a persuasive non-verbal behavior. Speciﬁcally, the

tool extracts automatically the human’s head move-

ments and facial expressions, applies modiﬁcations

based on a proposed computational model, and repro-

duces the resulting head and facial movements on a

virtual face. The main focus of the paper is the com-

putational model that transforms the head and facial

movements obtained from the human face to persua-

sive movements that are then mimicked on the vir-

tual face. The model is evaluated through a percep-

tive study to validate that the generated animations

are perceived by users as persuasive.

Developing a persuasive behavior model implies

several challenges, and in particular the precise iden-

tiﬁcation of the behavioral cues associated to persua-

sion. These cues should be modiﬁed on virtual agent

to simulate a persuasive behavior. For this purpose,

we investigate, in a ﬁrst step, the relevant behavioral

cues of persuasion. We use machine learning tech-

niques to explore the cues of persuasion in a human

video corpus. In particular, we explore the POM cor-

pus (Park et al., 2014), which is, to the best of our

knowledge, the only multimedia corpus with annota-

tions of perceived persuasiveness. The POM corpus

contains web videos of individuals discussing diverse

topics in front of a camera. In our machine learning

approach, we pay a particular attention to the inter-

pretability of the model to be able to identify features

that can be easily understood and replicated on vir-

tual agents. Our aim in this research is not to cre-

ate a classiﬁcation model to assess persuasiveness but

to use machine learning to identify the relevant fea-

tures of behavioral persuasiveness. Based on the iden-

tiﬁed persuasive behavioral cues, the THRUST tool

converts the cues extracted from the human face into

persuasive ones. From the POM corpus, we propose

a dictionary to establish reference points that reﬂect

persuasive non-verbal behavior. A convolution-based

model, based on this dictionary, is integrated in the

THRUST tool to compute the persuasive behavior of

the virtual agent .

This paper is organized as follows. In Section 2,

we discuss theoretical and empirical research works

that explore the behavioral cues related to persua-

sion. In Section 3, we introduce an overview of the

architecture of the THRUST tool. Section 4 details

the machine learning framework and Section 5 the

convolution-based model. In Section 6, we present

the implementation and the evaluation of the tool. We

conclude in Section 7.

2 RELATED WORK

Several research studies, particularly in the human-

human interaction ﬁeld, have explored the efﬁciency

of certain behavioral cues. For instance, (Burgoon

et al., 1990; Petukhova et al., 2017; Miller et al.,

1976), have emphasized the importance of various

multimodal behavioral cues. In the present article, we

focus on non-verbal cues related to persuasion. As

highlighted by(Burgoon et al., 1990), gestures, body

movements, smiles and facial expressions are impor-

tant non-verbal cues that enhance persuasiveness.

At the interactional level, several works studied

the positive impact of mimicry on persuasion (Tan-

ner and Chartrand, 2006). In this article, we anal-

yse corpora of monologue excluding the possibility of

studying the interactional level. Other contextual ele-

ments, such as the appearance of the persuader (Bur-

goon et al., 1990), may impact the perceived persua-

sion. In this article, given the size of the considered

corpus and the lack of contextual variability, as a ﬁrst

step, we do not consider the inﬂuence of the context.

Based on the research showing the importance of face

and head movements for persuasion (Burgoon et al.,

1990), we consider in our study the facial expressions

through the study of action units and the head move-

ments. These behavioral cues considered as features

of the learned models are presented in more details in

Section 4.3.

In the Intelligent Virtual Agent domain, to gen-

erate automatically the behavior of a virtual agent,

two main approaches are identiﬁed in the literature.

The ﬁrst approach relies on rule-based systems that

exploit linguistic information from the text and the

meaning of gestures, facial expressions or head move-

ments to determine the appropriate signals to express

(e.g. (Cassell, 2001; Marsella et al., 2013)). Rule-

based approaches remain very limited, given the vari-

ability of human expressions across modalities. In a

much more recent approach, machine learning meth-

ods are used to automatically generate co-verbal ges-

tures (e.g. (Chiu and Marsella, 2014)), facial expres-

sions and body movements from speech (e.g. (Habi-

bie et al., 2021)) or from speech and text to take

into account both acoustic and semantic information

(e.g. (Ahuja et al., 2020; Kucherenko et al., 2020)).

Most studies are based on deep neural networks (e.g.

(Chiu and Marsella, 2014; Hasegawa et al., 2018;

Kucherenko et al., 2020)) and, more recently, on the

use of GAN architectures (e.g. (Ahuja et al., 2020;

Habibie et al., 2021)). Compared to existing works,

the originality of the work presented in this article is:

(1) we generate non-verbal behavior, not from speech

or text, but from a video of a human with a neutral

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

360

Figure 1: Global architecture of the system to automatically transform a neutral human video to a persuasive virtual character

video. Input: a video of a human with automatic extraction of head and facial movements using OpenFace. Model: a

computational model to automatically transform neutral non-verbal features to persuasive non-verbal features Output: a video

of a virtual character replicating the behavior of the human but with persuasive head and face movements.

attitude; (2) we generate the facial and head move-

ments whereas most of the existing models consider

the body and head movements and (3) unlike exist-

ing works that do not allow the generation of socio-

emotional behaviors, we propose the automatic gen-

eration of persuasive behavior.

From a machine leaning perspective, few research

works have investigated persuasion. The main work

has been conducted by Park et al. (Park et al., 2014;

Park et al., 2016; Nojavanasghari et al., 2016) on the

Persuasive Opinion Multimedia (POM) corpus con-

sisting of 1000 movie review videos obtained from

a social multimedia website called ExpoTV.com. As

proposed by Park et al. (Park et al., 2014; Park et al.,

2016; Nojavanasghari et al., 2016), we use machine

learning algorithms to explore persuasiveness. How-

ever, our work differs from the latter in several as-

pects: (1) contrary to Park et al., in order to ob-

tain explainable models, we do not use deep learning

methods but “white box” classiﬁers such as SVM and

Random Forest; (2) still in our perspective of inter-

pretability, we consider non-verbal features that can

be simulated on a virtual agent

; (3) last but not least,

our ﬁnal objective is not to create a prediction model

but to explore the non-verbal cues and to use machine

learning-based methods in order to create a persua-

sive artiﬁcial agent. In the next section, we present

the architecture of the proposed tool to automatically

generate a persuasive virtual speaker from a human

one.

https://github.com/isir/greta/wiki

3 ARCHITECTURE

The THRUST tool takes as input the video of a hu-

man and provides as output a video of a virtual agent

replicating the same human’s behavior but in a per-

suasive way. The tool is composed of 3 main mod-

ules: the Input module, the Model module, and the

Output module. The architecture is illustrated Fig-

ure 1. We describe each module in the following.

In the Input module, the system takes a video of

a human speaking in a neutral way. At this step,

the OpenFace tool

is used to extract the human’s

head and facial movements. These measures noted

as (U

)

i=1...N

, where U

design the i-th measured fea-

ture characterizing the face and head movements, will

be saved and used as an input of the Model module

which transforms them to a set of features (W

)

i=1...N

characterizing the head and face movements of a per-

suasive speaker as output. For this purpose, a combi-

nation of machine learning methods and convolution-

based techniques is used. The machine learning meth-

ods are employed on an existing corpus to identify the

important relevant features of persuasiveness (Step 1,

2, and 3 in Figure 1), which is explained in detail in

Section 4. Note that, the resulting learning model is

also used as a classiﬁer to automatically determine if

the behavior in a video (human or virtual agent) is

persuasive (Step 2 in Figure 1). By this way, this

learnt model is used to conﬁrm if the transformed fea-

https://www.cl.cam.ac.uk/research/rainbow/projects/

openface/

Be Persuasive! Automatic Transformation of Virtual Agent’s Head and Facial Behavior

361

ture vector (W

)

i=1...N

is indeed considered as per-

suasive (as depicted in Figure 1 by the dotted arrow

from the ”output” box to the ”classiﬁcation” box). In

the subsequent steps (Step 4 and 5 in Figure 1), a

convolution-based method is used to determine how

to modify the features to be persuasive (details of

these steps are given in Section 5). Finally, in order

to deliver the same speech as the original video, but

with persuasive head and face movements, the Out-

put module uses the embodied conversational agent

Greta to simulate the set of variables (W

)

i=1...N

and

generate the video. The vector (W

)

i=1...N

denotes the

value of the head and face movements extracted from

the original human video and modiﬁed to be persua-

sive (Figure 1).

4 MACHINE LEARNING

FRAMEWORK

4.1 Corpus and Features Extraction

In the step 1 (Figure 1), we consider a speciﬁc cor-

pus and extract the features from the video of the cor-

pus. Concerning the choice of the corpus, nowadays,

few corpora in the research community are available

to study persuasiveness. In this work, we consider the

Persuasive Opinion Multimedia (POM) corpus (Park

et al., 2014). This corpus is freely available and con-

tains videos of speakers trying to convince on differ-

ent subjects. POM corpus consisting of 1000 movie

review videos obtained from a social multimedia web-

site called ExpoTV.com. It contains different conver-

sational videos cut into a total of 1096 thin slices.

Each cut was annotated by different native English-

speaking workers of the United States.

Based on the theoretical and empirical research

on persuasion presented above (Section 2), we con-

sider the following groups of features: facial ac-

tion units (AU1, AU2, AU4, AU5, AU6, AU7, AU9,

AU10, AU12, AU14, AU15, AU17, AU18, AU20,

AU23, AU24, AU25, AU16, AU28, AU43), emotions

(Anger, Contempt, Disgust, Joy, Fear, Surprise, Con-

fusion, Frustration), head movements (displacement

and rotation in (x, y, z) axes, speed of the head move-

ment and its acceleration according to (x, y, z) axis)

and acoustic descriptors (fundamental frequency f

peak slope). For each feature, we computed the mean,

median, maximum, minimum, standard deviation and

the variance.

4.2 Formalization of the Classiﬁcation

Problem

As illustrated by the step 2 (Figure 1), to identify the

importance of the features in the perception of the

persuasion, we consider a classiﬁcation task: based

on the features as input, the classiﬁers have to pre-

dict if the features are persuasive as output. As a ﬁrst

step, we consider a binary classiﬁcation to simplify

the learning problem (i.e. prediction if persuasive or

not).

4.3 Automatic Analysis of the Vocal and

Non-Verbal Cues of Persuasion

In the step 3 (Figure 1), the objective is to compare

the performances of the classiﬁers and then to se-

lect the most important features that ensure the high-

est prediction performances. We propose to experi-

ment different classiﬁers: the Naives Bayes (NB), the

Support Vector Machine (SVM) and the Random For-

est (RF). These methods, compared to other neuronal

models, are well suited for handling small datasets

and have the advantage of interpretability. All exper-

iments were performed with 10-fold cross-validation

(CV) where each CV was tested 10 times. In order to

estimate the performances of the different classiﬁers,

we compute scores from classiﬁers returning random

predictions, to establish baselines. We consider three

different strategies: uniform (generates predictions

uniformly at random) (noted BR), stratified (gen-

erates predictions with respect to the training set’s

class distribution) (noted BU0) and most frequent

(always predicts the most frequent class in the train-

ing set) (noted BU1). For each fold of the cross-

validation, the random classiﬁers are ﬁtted on the

training set and used to generate predictions on the

validation set, for each strategy. Each classiﬁer is

trained on 80% of the corpus and tested on 20% of

the corpus.

The performances of the classiﬁers are evalu-

ated though the classical metrics of accuracy and F1

weighted score (to cope with the unbalanced classes).

Moreover, we compute the statistical signiﬁcant dif-

ferences of the obtained F-scores. The Student’s t-test

is performed to compute the statistical differences be-

tween the F1-scores of the classiﬁers and of the base-

lines obtained by the k-fold-cross-validation. This

test is one of the recommended methods to compare

the performance of machine learning algorithms (Di-

etterich, 1998).

In order to evaluate the importance of each group

of features to predict the persuasion, we compute the

performance scores of the classiﬁers considering each

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

362

group of features and combinations of groups as in-

put. The results show that the emotions do not enable

us to obtain signiﬁcant differences with the baselines.

In others words, the emotions are not sufﬁcient to pre-

dict persuasion. In the same way, the group of fea-

tures containing only head movements or only acous-

tic features leads to performances not signiﬁcantly

different from the baselines. However, the features

of the facial expressions provide good performance

scores with signiﬁcant differences with the baselines

(with RF, accuracy score = 0.71, F1 weighted score

= 0.74 and p-value < 0.05). Considering combina-

tions of groups of features, the result reveals that the

combination of non-verbal and vocal cues improves

signiﬁcantly the accuracy score (with RF, accuracy

score = 0.74, F1 weighted score = 0.82 and p-value

< 0.005). These results are in line with the research

on persuasion showing the importance of multimodal-

ity for perceived persuasion. Finally, the best accu-

racy score is obtained by combining facial expres-

sions features, head movements and vocal features

with a Random Forest classiﬁer, we have an accuracy

score = 0.81, a F1 weighted score = 0.72 and p-value

< 0.0005.

In the following steps, since we obtain also good

results with the combination of facial expressions fea-

tures and head movements (accuracy score equals to

0.74 and F1 weighted score equals to 0.82), we focus

on these non-verbal cues that we can be simulated on

the embodied conversational agent Greta

Figure 2: Screenshots at the same time of two animations

given as input the same video of human: (1) neutral atti-

tude generated without transformation and (2) persuasive

attitude generated by the model of the THRUST tool.

5 CONVOLUTION-BASED

MODEL

In this section, we describe the steps 4 and 5 illus-

trated on Figure 1. These steps consist in comput-

ing the non-verbal cues (facial expressions and head

movements identiﬁed in the previous steps) of the vir-

An open-source platform to create Embodied Conver-

sational Agent: https://github.com/isir/greta/wiki

tual speaker to enhance its persuasiveness. The com-

putation is based on the POM corpus. This corpus

contains neutral and persuasive sequences. It is im-

portant to note that, in this study, we deﬁne a neutral

attitude as the act of speaking without attempting to

be persuasive. It does not mean that the neutral face

is not persuasive.

For each relevant non-verbal feature, we gener-

ate a signal that describes its average dynamic across

all persuasive or neutral sequences in the POM cor-

pus. For this purpose, we treat each slice (Ambady

and Rosenthal, 1992) in the POM corpus as a sam-

ple and we use the average value of each non-verbal

behavior’s dynamic as a reference. The generated sig-

nals characterize the typical values of the non-verbal

cues associated to a persuasive attitude. Each signal

is associated to a non-verbal cue. These signals cor-

respond to references that the non-verbal cues have

to follow when we generate the persuasive behavior.

The reference values is noted (M

)

i=1,..,N

, where M

corresponds to the i-th reference of the non-verbal

cues indexed with i = 1, ..., N (for example, AU1

and AU2 as facial units and head position according

(x, y, z) axis). To generate persuasive non-verbal cues,

we modify the U

value using a convolution prod-

uct between U

and M

which involves averaging the

input based on the properties of the reference M

A re-sampling step may be necessary at this stage to

avoid the issue of size mismatch between U

and M

It should be noted that our convolution-based strat-

egy is only applied to the non-verbal cues that Greta

takes into account, namely head movements along the

(x, y, z) axis and speciﬁc AUs (AU1, AU2, AU4, AU5,

AU6, AU7, AU12). In the next section, we present the

implementation and the evaluation of the THRUST

tool integrating the models presented above.

6 IMPLEMENTATION AND

EVALUATION

The process outlined in Figure 1 has been success-

fully implemented and is now fully operational. As

described in 3, we use OpenFace to extract the fea-

tures from the input (the video of human with a neu-

tral attitude). Then we developed our Model module

(Step 1, 2, 3, 4, 5 and 6) to convert neutral human face

and head movement into persuasive virtual ones using

Python language. The output of our Python script is

played with Greta tool. At this ﬁnal step, we have

used the virtual female character Emma to create the

ﬁnal output of the THRUST tool (the video of a per-

suasive Embodied Conversational agent). The entire

code of the tool, as well as a tutorial video, is pro-

Be Persuasive! Automatic Transformation of Virtual Agent’s Head and Facial Behavior

363

vided in the GitHub account of the authors as an open-

source project

. In the following sections, we present

an objective and subjective evaluation of the tool.

6.1 Objective Evaluation

The videos generated by our tool show a noticeable

difference in terms of facial and head movements be-

tween the videos of the virtual speaker without trans-

formation (i.e. the non-verbal cues extracted from

the human video are directly replicated on the vir-

tual speaker; called neutral videos) and those trans-

formed by the models (i.e. the non-verbal cues ex-

tracted from the human video are transformed by the

models and replicated on the virtual speaker; called

persuasive videos). The generated neutral videos ap-

pear to have very little movements, while the gen-

erated persuasive videos show more eyebrow move-

ments, smiles and head movements. To assess the ef-

fectiveness of our tool, we propose an objective eval-

uation based on the learnt classiﬁer. As described pre-

viously (Section 4.2), to construct the THRUST tool,

we have developed an accurate classiﬁer for predict-

ing persuasion. We propose to use this classiﬁer to

verify that the generated videos are correctly classi-

ﬁed as persuasive. For this purpose, we consider the

best classiﬁer that was the Random Forest one. To

evaluate the model objectively, we have generated 24

videos. Four persons have been ﬁlmed (2 female and

2 male). We asked to each person to say two different

predeﬁned sentences in a neutral way. The sentences

are related to our use case, e.g. “Through physical ac-

tivity you can overcome disabling pain and improve

your general well-being”. The two sentences have the

same size. The speech production of these sentences

lasts around 10 seconds. In total, we have recorded

8 videos of human speaker of 10 seconds. Using the

THRUST tool, we have generated two kinds of video

using the same virtual speaker (Figure 2):

• videos of a neutral virtual speaker that corre-

spond to the replay of the recorded human fea-

tures on the virtual face;

• videos of a persuasive virtual speaker that corre-

spond to the recorded human features transformed

by our model and replicate on the virtual speaker.

For each recorded human video, we have generated

these two kinds of video. Moreover, we have cre-

ated baseline videos. These videos correspond to the

noisy version of the recorded human features on the

virtual face (we used a classical additive white Gaus-

sian noise). The baseline videos have been created

https://test.i2m.univ-amu.fr/perso/cherni.a/Software.

html

in order to compare videos with the same amount of

movements (persuasive and baseline videos) with the

neutral videos with few movements. In total, we had

24 videos to evaluate

The generated videos have been provided to the

classiﬁer to evaluate objectively if they were classi-

ﬁed as persuasive. Note that only the non-verbal fea-

tures of the video have been used for the classiﬁca-

tion. The results show that all the persuasive videos

are classiﬁed as persuasive whereas the neutral videos

and the baseline videos are classiﬁed as non persua-

sive. This ﬁrst objective evaluation constitutes a ﬁrst

validation step of the proposed tool. However such an

evaluation is not sufﬁcient since the classiﬁer and the

THRUST tool have been created based on the same

data. To complete this evaluation, we present in the

next section a subjective evaluation to assess the gen-

erated videos with users.

6.2 Subjective Evaluation

The THRUST tool has been evaluative through a per-

ceptive study. We describe the protocol of the experi-

ment and the results of the evaluation in the following.

Videos. In this subjective evaluation, we consider

the same 24 generated videos used for the objective

evaluation (previous section): 8 videos correspond-

ing to a persuasive virtual speaker, 8 to a neutral

virtual speaker and 8 corresponding to baseline with

randomly generated animations. As for the objec-

tive evaluation, we consider only the non-verbal cues.

Consequently, the videos were played without sound

to avoid the lip synchronization problem and the im-

pact of the speech on the perception.

Questionnaire. We asked several participants to

evaluate the perceived persuasiveness of the virtual

speaker, both of the videos before and after the trans-

formation of our model (i.e. the neutral and persua-

sive videos) and the baseline videos. For this purpose,

we consider 2 questions: QI “Did you ﬁnd the char-

acter persuasive in the video ?” and QII “Did you

ﬁnd the character animations convincing in the video

” (translation from French). The responses to each

question were indicated through a Likert-scale from 1

to 5.

Participants. A total of 89 persons (51 female and

38 male) have participated online to the experiment.

They were recruited on French mailing lists. The age

of the participants is in average 34.2 (SD = 11.78).

Task. Each participant had the task to watch each

video and to indicate their perception through the

Examples of videos: https:

//www.youtube.com/playlist?list=

PL6t9zd1YosSWFwosMBWoPXCPk-0TT0tUB

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

364

questions on the virtual speaker’s persuasiveness (2

questions). The experiment took place online. The or-

der of the video and of the questions have been coun-

terbalanced to avoid an effect on the results.

Results. The scores of the participants have been ana-

lyzed using a Two-Way Repeated Measures ANOVA.

We applied a normality test using kewness and Kur-

tosis. The distribution of each measure is normal. In

Table 1, we report the descriptive statics of the results.

We report the results considering each question

separately.

Concerning the ﬁrst question QI “Did you ﬁnd

the character persuasive in the video ?”, the results

show a signiﬁcant effect of the agent’s behavior on

the users’ perception, F(1.48,89) = 12.66, and p <

0.001. The persuasive virtual speaker (M = 2.43,

SD = .08) has been perceived signiﬁcantly more per-

suasive than the baseline virtual speaker (M = 1.93,

SD = .08), p < 0.001, and the neutral virtual speaker

(M = 2.18, SD = .08), p = 0.001. This result vali-

dates the proposed approach showing that the videos

generated by the THRUST tool are perceived sig-

niﬁcantly more persuasive than the videos gener-

ated without transformation or with randomly gen-

erated non-verbal cues. Note that no main effect of

users’ gender has been found, although women (M =

2.28, SD = 0.07) had the tendency to perceive agents

more persuasive than men (M = 2.08, SD = 0.08),

with p-value equal to 0.08. Interestingly, the sentence

used for the recording has a signiﬁcant impact on

the perceived persuasiveness F(1, 89) = 17.41, p <

0.001, whereas the videos had no sound. The videos

recorded with the sentence 2 (M = 2.25, SD = 0.06)

were perceived as more persuasive than the videos

with the sentence 1 (M = 2.11, SD = 0.06).

Concerning the question QII “Did you ﬁnd the char-

acter animations convincing in the video?”, the re-

sults are coherent with the results for the question

QI. In fact, the main effect of the agent’s behavior

is signiﬁcant with F(1.54, 89) = 13.35 and p < .001.

The persuasive virtual speaker (M = 2.51, SD = .08)

has been perceived as more convincing than the base-

line (M = 2.02, SD = .09) and p < .001, and the

neutral virtual speaker (M = 2.27, SD = .08) and

p = .003. No main effect of users’ gender has been

found, although women (M = 2.37, SD = .07) had the

tendency to ﬁnd agents more convincing than men

(M = 2.16, SD = .08) and p = .06. The sentence

used for the recording has a signiﬁcant impact on

the perceived convincing aspect F(1, 89) = 29.38 and

p < .001. The videos recorded with the sentence 2

(M = 2.34, SD = .06) were perceived as more con-

vincing than the videos recorded with the sentence 1

(M = 2.18, SD = .05). These results conﬁrm that the

videos generated by the THRUST tool are perceived

signiﬁcantly more convincing than the videos gener-

ated without transformation or with randomly gener-

ated non-verbal cues.

Discussion. The results of the perceptive study enable

us to validate the videos generated by the THRUST

tool. In fact, the animations of the virtual speaker

are perceived signiﬁcantly more persuasive and con-

vincing after the transformation by the THRUST tool.

The signiﬁcant differences with the baseline videos

with randomly generated animations show that it is

not the fact that the persuasive videos have more fa-

cial and head movements compared to the neutral one

but it is the animations generated by the tool that al-

low the perception of persuasiveness. Moreover, we

have evaluated the tool considering different human

videos as input, both female and male, showing, then,

that the tool provides persuasive output whatever hu-

man is in the input video. A point of attention is

the speech. Even if the videos were played without

sound, surprisingly, it appears that what is said has a

signiﬁcant impact on the perceived persuasiveness. A

more ﬁne-grained analysis considering a larger set of

sentences should be conducted to explain this result.

The thrust tool has been evaluated considering only

one speciﬁc female virtual agent. Videos generated

with virtual agents with different appearances should

be evaluated to completely assess the efﬁciency of the

tool.

7 CONCLUSION AND

PERSPECTIVES

The main goal of the presented work is to develop

and validate a new tool that can transform a video of

a neutral human face into a video of a virtual agent

with a behavior expressing persuasiveness. For this

purpose, we have proposed a tool called THRUST

(from neuTral Human face to peRsUaSive virTual

face) based on machine learning techniques and on

a convolution-based model. The tool computes au-

tomatically the facial and head movements of a per-

suasive virtual speaker. The tool has been evaluated

through an objective and subjective study. Both eval-

uations on a set of videos have enabled us to demon-

strate that the virtual speaker’s behaviors computed

by the tool are perceived as persuasive. Even if the

evaluation was limited to a speciﬁc set of videos, fea-

tures and virtual agents, these results constitute a ﬁrst

validation of the proposed approach for the automatic

generation of persuasive behavior.

Since the THRUST tool is based on the POM cor-

pus, which is the only corpus with persuasion anno-

Be Persuasive! Automatic Transformation of Virtual Agent’s Head and Facial Behavior

365

Table 1: Descriptive statistics on the results of the subjective evaluation. The statistics (M: Mean and STD: standard deviation)

are reported for each question asked to the participants (QI and QII) and for each sentence said by the virtual character

(Sentence 1 and 2), considering the gender of the participant (women and men) and the condition (baseline, neutral and

persuasive - Section 6.1).

Women Men

Baseline Neutral Persuasive Baseline Neutral Persuasive

M STD M STD M STD M STD M STD M STD

Sentence 1

QI 1.86 0.06 2.32 0.13 2.37 0.06 1.85 0.07 2.33 0.08 1.98 0.06

QII 1.94 0.56 2.54 0.14 2.40 0.1 1.97 0.08 2.39 0.15 1.91 0.02

Sentence 2

QI 2.26 0.03 2.10 0.03 2.76 0.09 1.79 0.06 2.02 0.14 2.62 0.02

QII 2.33 0.02 2.20 0.04 2.91 0.13 1.86 0.18 2.00 0.20 2.85 0.03

tations, and was created using the open-source tool-

boxes Greta and OpenFace, which only use some non-

verbal cues, there are certain limitations in the choice

of features. We propose on the future work to expand

our analysis and include other multimodal features,

particularly vocal features, to enhance the persuasive

model and to develop an automated artiﬁcial agent ca-

pable of expressing persuasive speech.

ACKNOWLEDGMENTS

This research was funded by the French National Re-

search Agency as part of the COPAINS project.

REFERENCES

Ahuja, C., Lee, D. W., Ishii, R., and Morency, L.-P. (2020).

No gestures left behind: Learning relationships be-

tween spoken language and freeform gestures. In

Findings of the association for computational linguis-

tics: EMNLP 2020, pages 1884–1895.

Ambady, N. and Rosenthal, R. (1992). Thin slices of ex-

pressive behavior as predictors of interpersonal con-

sequences: A meta-analysis. Psychological bulletin,

111(2):256.

Burgoon, J. K., Birk, T., and Pfau, M. (1990). Nonverbal

behaviors, persuasion, and credibility. Human com-

munication research, 17(1):140–169.

Cassell, J. (2001). H. vilhjilmsson, and t. BEAT: the Behav-

ior Expression Animation Toolkit.

Chidambaram, V., Chiang, Y.-H., and Mutlu, B. (2012). De-

signing persuasive robots: how robots might persuade

people using vocal and nonverbal cues. In Proceed-

ings of the seventh annual ACM/IEEE international

conference on Human-Robot Interaction, pages 293–

300.

Chiu, C.-C. and Marsella, S. (2014). Gesture generation

with low-dimensional embeddings. In AAMAS.

Dietterich, T. G. (1998). Approximate statistical tests

for comparing supervised classiﬁcation learning algo-

rithms. Neural computation, 10(7):1895–1923.

Ghazali, A. S., Ham, J., Barakova, E. I., and Markopoulos,

P. (2018). Poker face inﬂuence: persuasive robot with

minimal social cues triggers less psychological reac-

tance. In 2018 27th IEEE International Symposium on

Robot and Human Interactive Communication (RO-

MAN), pages 940–946. IEEE.

Habibie, I., Xu, W., Mehta, D., Liu, L., Seidel, H.-P.,

Pons-Moll, G., Elgharib, M., and Theobalt, C. (2021).

Learning speech-driven 3d conversational gestures

from video. In IVA.

Hasegawa, D., Kaneko, N., Shirakawa, S., Sakuta, H., and

Sumi, K. (2018). Evaluation of speech-to-gesture gen-

eration using bi-directional lstm network. In IVA.

Kucherenko, T., Jonell, P., van Waveren, S., Henter,

G. E., Alexandersson, S., Leite, I., and Kjellstr

om, H.

(2020). Gesticulator: A framework for semantically-

aware speech-driven gesture generation. In ICMI.

Lisetti, C., Amini, R., Yasavur, U., and Rishe, N. (2013). I

can help you change! an empathic virtual agent deliv-

ers behavior change health interventions. ACM Trans-

actions on Management Information Systems (TMIS),

4(4):1–28.

Marsella, S., Xu, Y., Lhommet, M., Feng, A., Scherer, S.,

and Shapiro, A. (2013). Virtual character performance

from speech. In SIGGRAPH, pages 25–35.

Miller, G. R. (2013). On being persuaded: Some basic dis-

tinctions. Sage Publications, Inc.

Miller, N., Maruyama, G., Beaber, R. J., and Valone, K.

(1976). Speed of speech and persuasion. Journal of

personality and social psychology, 34(4):615.

Nguyen, H., Masthoff, J., and Edwards, P. (2007). Persua-

sive effects of embodied conversational agent teams.

In International Conference on Human-Computer In-

teraction, pages 176–185. Springer.

Nojavanasghari, B., Gopinath, D., Koushik, J., Baltru

saitis,

T., and Morency, L.-P. (2016). Deep multimodal fu-

sion for persuasiveness prediction. In ICMI.

Park, S., Shim, H. S., Chatterjee, M., Sagae, K., and

Morency, L.-P. (2014). Computational analysis of per-

suasiveness in social multimedia: A novel dataset and

multimodal prediction approach. In ICMI.

Park, S., Shim, H. S., Chatterjee, M., Sagae, K., and

Morency, L.-P. (2016). Multimodal analysis and pre-

diction of persuasiveness in online social multimedia.

ACM TiiS, 6(3):1–25.

Petukhova, V., Raju, M., and Bunt, H. (2017). Multimodal

markers of persuasive speech: Designing a virtual de-

bate coach. In INTERSPEECH, pages 142–146.

Tanner, R. and Chartrand, T. (2006). The convincing

chameleon: The impact of mimicry on persuasion.

ACR North American Advances.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

366