It’s not Just What You Do but also When You Do It: Novel Perspectives

for Informing Interactive Public Speaking Training

Beatrice Biancardi

1 a

, Yingjie Duan

, Mathieu Chollet

3 b

and Chlo

e Clavel

2 c

LINEACT CESI, Nanterre, France

LTCI, T

ecom Paris, IP Paris, Palaiseau, France

School of Computing Science, University of Glasgow, Glasgow, U.K.

Keywords:

Affective Computing, Human Communication Dynamics, Social Signals, Public Speaking.

Abstract:

Most of the emerging public speaking training systems, while very promising, leverage temporal-aggregate

features, which do not take into account the structure of the speech. In this paper, we take a different perspec-

tive, testing whether some well-known socio-cognitive theories, like ﬁrst impressions or primacy and recency

effect, apply in the distinct context of public speaking perception. We investigated the impact of the temporal

location of speech slices (i.e., at the beginning, middle or end) on the perception of conﬁdence and persua-

siveness of speakers giving online movie reviews (the Persuasive Opinion Multimedia dataset). Results show

that, when considering multi-modality, usually the middle part of speech is the most informative. Additional

ﬁndings also suggest the interest to leverage local interpretability (by computing SHAP values) to provide

feedback directly, both at a speciﬁc time (what speech part?) and for a speciﬁc behaviour modality or feature

(what behaviour?). This is a ﬁrst step towards the design of more explainable and pedagogical interactive

training systems. Such systems could be more efﬁcient by focusing on improving the speaker’s most impor-

tant behaviour during the most important moments of their performance, and by situating feedback at speciﬁc

places within the total speech.

1 INTRODUCTION

Soft skills have been identiﬁed as key competencies

for work in the 21st century (Sharma and Sharma,

2010). Among them, public speaking constitutes a

real challenge: estimates indicate that 15% to 30% of

the population suffers from public speaking anxiety

(Tillfors and Furmark, 2007).

The automatic evaluation of public speaking per-

formance remains a complex task for which exist-

ing approaches still show some limitations, due to its

subjectivity and the challenges posed by the multi-

modality of human communication. An additional

problem is encountered when automatic evaluations

are used to provide feedback to the user. Indeed,

most of the models used to predict communicative

skills, and more broadly socio-emotional behaviours,

are based on “black box” models (e.g., deep neural

networks), whose opacity makes them ill-suited to

https://orcid.org/0000-0002-6664-6117

https://orcid.org/0000-0001-9858-6844

https://orcid.org/0000-0003-4850-3398

produce explainable feedback to users about their per-

formance. This approach weakens the current poten-

tial of public speaking skills training applications, in

particular by limiting pedagogical explanations.

In this paper, we propose a novel approach to-

wards the aim of facilitating explainability in public

speaking training systems. In particular, we are inter-

ested in whether speciﬁc moments during a speaker’s

speech have a different impact on the perception

of their performance. If it is the case, a speaker

should pay more attention at their behaviours dur-

ing these speciﬁc moments. In particular, we in-

vestigate whether some well-known effects of socio-

cognitive theories, such as ﬁrst impressions (Ambady

and Skowronski, 2008) or primacy and recency effect

(Ebbinghaus, 1913), apply in the distinct context of

public speaking.

We aim to answer the following research question:

“Is the impact of speakers’ behaviours on the ob-

server’s perception of their performance different ac-

cording to WHEN these behaviours are realised dur-

ing the speech? If yes, which part of the speech is the

most important?”

Biancardi, B., Duan, Y., Chollet, M. and Clavel, C.

It’s not Just What You Do but also When You Do It: Novel Perspectives for Informing Interactive Public Speaking Training.

DOI: 10.5220/0011680400003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 2: HUCAPP, pages

193-200

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

193

Automatic assessment of a speaker’s performance

could beneﬁt from this information by assigning dif-

ferent weights to different behaviours considering

when they are realised during the speech. In addition,

a training system could be more efﬁcient by focusing

on improving the speaker’s most important behaviour

during the most important moments of their perfor-

mance, and by situating feedback at speciﬁc places

within the total speech.

2 RELATED WORK

2.1 Public Speaking Assessment

Multi-modal modelling of public speaking in differ-

ent contexts has been extensively studied. These con-

texts include job interviews (e.g., (Hemamou et al.,

2019)), student presentations (e.g., (Nguyen et al.,

2012)), academic talks (Curtis et al., 2015) or polit-

ical speech (e.g., (Hirschberg and Rosenberg, 2005)).

The results of these studies highlight that several be-

havioural descriptors can be used as cues of a good

speaking performance. Among them: fundamental

frequency F0, speaking rate, the use of 1st-person

pronouns (Hirschberg and Rosenberg, 2005); motion

energy, tense voice quality, reduced pause timings

(Scherer et al., 2012); ﬂow of speech, vocal variety,

eye contact (Batrinca et al., 2013); overall speaker’s

movement normalised by the head movements (Cur-

tis et al., 2015); vocal expressivity, pitch mean and the

ratio of speech and pauses (W

ortwein et al., 2015).

On the other hand, diﬂuencies have been found to be

negatively correlated with the speaker’s performance

(Strangert and Gustafson, 2008). In general, speech

and lexical features perform better than visual ones,

but multi-modal models achieve the best performance

(Chen et al., 2015; W

ortwein et al., 2015).

The above studies analysed time-aggregated fea-

tures, however a few others explored different ap-

proaches. For example, Ramanarayan et al. (Rama-

narayanan et al., 2015) focused on the temporal evolu-

tion of a speaker’s performance during a presentation,

by including in their analyses time-series features.

Haider et al. (Haider et al., 2020) proposed a novel ac-

tive data representation method to automatically rate

segments of full video presentations, based on un-

supervised clustering. Chollet and Scherer (Chollet

and Scherer, 2017) investigated the use of thin slices

of behaviours (Ambady and Rosenthal, 1992) for as-

sessing public speaking performance. Their results

showed that it is possible to predict ratings of perfor-

mance using audio-visual features of 10-second thin

slices randomly selected from the full video. A sim-

ilar effect was also found in the context of job inter-

views. The analyses in (Hemamou et al., 2021) on

peaks of attention slices (of a duration between 0.5

and 3.3 seconds) during asynchronous job interviews

showed that these slices were systematically different

from random slices. They occured more often at the

beginning and at the end of a response, and were bet-

ter than random slices at predicting hirability.

2.2 Public Speaking Training

In addition to automatically assessing public speaking

quality, several authors also focused on feedback gen-

eration to help speakers improve their performance.

We can divide existing interactive systems accord-

ing to the type of the temporality of the feedback

provided: real-time feedback (e.g., (Damian et al.,

2015; Tanveer et al., 2015; Chollet et al., 2015)) and

after-speech report (e.g., (Zhao et al., 2017)). Real-

time feedback can provide visual information such as

graphs or icons, or can be communicated through the

mean of virtual humans (as coach or virtual audience).

After-speech reports usually include an interface dis-

playing the video of the speaker’s performance along

with personalised feedback information.

2.3 Our Positioning

Temporal Position Matters...

We aim to investigate if the differences in the be-

haviours related to high and low public speaking per-

formance are more discriminative at particular mo-

ments of the speech. Previous studies demonstrated

that it is possible to predict a speaker’s performance

from thin slices randomly selected from a presenta-

tion (Chollet and Scherer, 2017; Nguyen and Gatica-

Perez, 2015), but they did not focus on the location

of these slices. Our general hypothesis is that not

only what happens is important, but when it happens

is important as well. Some previous works suggest

that the moments that are most important in a speech

are the beginning and the end. For example, pri-

macy and recency effect (Ebbinghaus, 1913) is ex-

ploited by politicians as a persuasive strategy in their

speech (e.g.,(Hongwei et al., 2020)). If the primacy

and recency effect applies to our context, the discrim-

ination between high and low performance should be

related to the behaviours occurring at the beginning

and at the end of the speech, while what happens in

the middle should have less impact in the prediction

of a speech quality. Differently, ﬁrst impressions the-

ory (Ambady and Skowronski, 2008) argues that per-

ceivers form an impression of others at the earliest

instants of an interaction (the earliest instants of the

HUCAPP 2023 - 7th International Conference on Human Computer Interaction Theory and Applications

194

speech in our case), and that this ﬁrst impression is

hard to modify subsequently. If this theory applies to

our context, we should ﬁnd a signiﬁcant impact of the

speakers’ behaviour at the beginning of their speech,

and what happens during the rest of the speech should

have less impact in predicting their performance. Fi-

nally, it could be that what is important for a speaker

is to maintain the listener’s attention during all the

speech. In this case, their behaviour at the middle

of the speech should be more informative about their

performance.

...Also when Giving Feedback

Our goal is to develop a public speaking training sys-

tem, which can offer personalised after-speech reports

providing localised, actionable hints on a variety of

behaviours. If our hypothesis that different parts of

speech vary in their importance is conﬁrmed, then a

feedback system should reﬂect this in the advice pro-

vided to users. Our main contribution in this paper

is a step towards more explainable and pedagogical

interactive systems. We propose a SHAP-based ap-

proach with the aim to provide feedback in a localised

way and at the modality or feature level using a purely

data-driven method.

3 METHODOLOGY

3.1 The POM Dataset

The Persuasive Opinion Multimedia (POM) dataset

(Park et al., 2014) includes 1000 movie review videos

obtained from a social multimedia website called Ex-

poTV.com. The videos are relatively short (mean du-

ration = 93 ± 31 seconds). Each video contains a

movie review given by one person talking in front of

the camera. Persuasiveness and other high-level at-

tributes and personality traits have been annotated for

each speaker by three raters, on a 7-point Likert scale.

The ﬁnal value for each dimension is the mean of the

scores given by the three raters.

3.2 Labels

In the studies presented in Section 2, most of the items

used to assess a speaker’s performance are explicitly

related to their verbal and non-verbal behaviours. A

few items are related to the raters’ perception of the

speakers, beyond their behaviour, and mainly concern

the perceived level of conﬁdence and persuasiveness

of the speaker. We focus on these two dimensions

since we are interested in how annotators’ perception

of the speaker is inﬂuenced by their behaviours.

As we want to discriminate between performances

in terms of quality, we only consider speakers who

obtained high and low scores of persuasiveness or

conﬁdence. Speakers obtaining persuasiveness scores

higher than 5 are taken as high-persuasiveness speak-

ers, while speakers obtaining persuasiveness scores

lower than 3 are taken as low-persuasiveness ones.

Since conﬁdence ratings are a bit positively skewed,

we consider scores higher than 6 to select high-

conﬁdence speakers, while speakers obtaining conﬁ-

dence scores lower than 3 are taken as low-conﬁdence

ones. The ﬁnal set used in our study contains

162 high-persuasiveness, 114 low-persuasiveness, 94

high-conﬁdence and 61 low-conﬁdence samples.

3.3 Features

3.3.1 Audio Features

We used openSMILE (Eyben et al., 2010) to extract

88 features from the extended Geneva Minimalistic

Acoustic Parameter Set (eGeMAPS) proposed by Ey-

ben et al. (Eyben et al., 2016). This feature set in-

cludes prosodic, voice quality and some spectral fea-

tures like MFCCs. The default statistical functionals

(e.g., mean, standard deviation) were computed for

each feature. In addition, features related to speech

ﬂow (speech and articulation rates, use of pauses)

were extracted from the aligned transcripts.

3.3.2 Text Features

We counted the number of occurrences of unigrams

and bigrams of the corresponding transcripts. We

used lemmas of words extracted by the lemmatizer

nltk (Bird et al., 2009) for unigrams and bigrams and

selected unigrams and bigrams occurring more than

100 times in the corpus. We used spaCy

to extract

POS of each word and selected unigrams, bigrams

and trigrams occurring more than 20 times. We also

extracted 93 features of Linguistic Inquiry and Word

Count (LIWC) (Pennebaker et al., 2015).

3.3.3 Visual Features

We used OpenFace 2.2 (Baltrusaitis et al., 2018) to

extract Action Units (AU) related features of both

presence and intensity (see Table 1 for more details),

as well as head pose features.

3.3.4 Feature Groups

The features described above were grouped according

to their modality (i.e., text, audio or visual) and also

https://github.com/explosion/spaCy

It’s not Just What You Do but also When You Do It: Novel Perspectives for Informing Interactive Public Speaking Training

195

Table 1: The features computed for our study, belonging to three modalities: audio, text and visual.

Audio

eGeMAPS contains 88 features:

prosodic features: pitch, loudness etc.

voice quality features: formant, jitter, shimmer etc.

spectral features: MFCC 1-4, spectral ﬂux etc.

Flow of Speech:

speech rate words=

nWords+nPauses

art rate words=

nWords

10−durationPauses

pause rate=

nPauses

pause ratio =

nPauses

nWords

pause mean dur=

durationPauses

nPauses

pause perc=

durationPauses

Text

LIWC contains 93 features:

Syntactic related: Ppron, Verb categories etc.

Lexical related: Social, Work categories etc.

Count of N-gram contains 606 features

uni-grams and bi-grams of lemmas occurring >100 times

Count of POS N-gram contains 1310 features

uni-grams, bi-grams and tri-grams of POS occurring >20 times

Visual

Presence of AU:

duration= sum(AU)

episodes= #separate episodes

average=

duration

episodes

Intensity of AU:

int mean, int sd, int range (int

min

−int

max

−)

int

′

mean, int

′

sd, int

′

range (int

′

min−

int

′

max

)

Head Pose

Count of Nod: number of peaks and valleys of pose Rx

Count of Shake: number of peaks and valleys of pose Ry

Count of Tilt: number of peaks and valleys of pose Rz

Table 2: Feature Groups and Corresponding Features.

Lexical count of n-gram and lexical related categories in LIWC

Syntactic count of POS n-gram and syntactic related categories in LIWC

Prosody prosodic features in enGeMAPS and features of ﬂow of speech

Voice Quality voice quality features in eGeMAPS

Spectral spectral features in eGeMAPS

Facial Expression features of presence of AU and features of intensity of AU

Head Pose features of head pose

combined in multi-modal groups (i.e., audio+text, au-

dio+visual, text+visual). In addition, when comput-

ing the SHAP values (see Section 4.2.2), we also cat-

egorised the features in higher-level groups. For au-

dio features, we considered three groups: Prosody,

Voice Quality and Spectral. For text features, we di-

vided them into Lexical and Syntactic. For visual

features, we categorised them into Facial Expressions

and Head Poses. All the groups and corresponding

features are listed in Table 2.

4 EXPERIMENTS AND RESULTS

4.1 Experimental Setting

4.1.1 Slices Datasets

To address our research question, we used thin slices

to investigate the effect of different moments of the

speech on the perception of the speaker. In line with

previous work (e.g., (Chollet and Scherer, 2017)) we

ﬁxed the duration of the windows to 10 seconds. For

each video, we extracted the following windows: start

(the ﬁrst 10s), middle (a 10s window randomly se-

lected from any moment after the ﬁrst 30s and before

the last 30s of the video) and end (the last 10s). These

slices were grouped in three new datasets: start-

dataset, middle-dataset and end-dataset according to

which part each slice belongs to.

4.1.2 Classiﬁcation Models

The aim of this paper is not to obtain state-of-the-art

performance in classiﬁcation accuracy, but rather to

provide insights about the importance of the various

speech parts and the relative contributions of differ-

ent modalities to each of these parts. Accordingly, we

chose Support Vector Machine (SVM) as the base-

line model to perform the following experiments. We

applied feature selection methods to select the most

important and relevant features, to reduce the redun-

dant ones and improve the performance of the model.

Similar to the method used in (Nojavanasghari et al.,

2016), we performed a z-test between the features

extracted from high and low performance instances,

then select features with p< 0.05.

HUCAPP 2023 - 7th International Conference on Human Computer Interaction Theory and Applications

196

Multi-modal features were generated through

early fusion.As for the hyperparameters of the model

(C and γ), we selected the best combination from lists

of values (the value of C varies in [1, 10, 20] and the

value of γ varies in [0.001, 0.01, 0.1, 1,’auto’]) using

5-fold cross validation.

For each slices dataset (i.e., start-dataset, middle-

dataset and end-dataset) as well for the original

dataset of the full videos, we took 80% as the training

set and the rest as the test set. We trained models on

a binary classiﬁcation task (high and low conﬁdence

or persuasiveness) by using features from a single

modality or combined features from different modali-

ties, and looked at the F1-scores (because our datasets

are imbalanced, see Section 3.2). Due to the relatively

small size of our dataset, the F1-score varies when we

use different random seeds to split the dataset. There-

fore, we sampled the F1-score 300 times using differ-

ent random seeds and calculated its 95% conﬁdence

interval. The results are shown in Tables 3 and 4.

4.2 Results

4.2.1 Thin Slices vs Full Video

In Tables 3 and 4, we report F1-scores on conﬁdence

and persuasiveness for different modalities and differ-

ent slices of the video. In both tables, we can notice

that the F1-scores vary across the different feature sets

and the considered slices.

The results show that using the full video leads to

a higher performance compared to the slices, in most

of the cases (audio, text, audio+visual and text+visual

for conﬁdence ratings; audio, text, visual and au-

dio+visual for persuasiveness).

What is interesting is that, for both conﬁdence

and persuasiveness, the best performance is obtained

when considering the middle slice (for audio+text

and all modalities and audio+text, text+visual and all

modalities, respectively). In particular, the best abso-

lute score for both conﬁdence and persuasiveness pre-

diction is obtained when considering audio+text fea-

tures in the middle slice.

4.2.2 Temporal Location of Behaviours

With eXplainable Artiﬁcial Intelligence (XAI) devel-

oping rapidly in recent years, many excellent tools

have emerged to help us interpret our models. Among

them, SHAP (Shapley Additive Explanations), pro-

posed by (Lundberg and Lee, 2017), is used to ex-

plain the output of any machine learning model, by

showing how much each feature or group of features,

contribute, either positively or negatively, to the tar-

get variable. The SHAP analysis on the different be-

havioural features can give us more details about how

each behaviour is informative and when. In Figures

1 and 2, the mean absolute SHAP values of (a) be-

haviour modalities and (b) feature groups (see Table

2 for more details) are provided, relative to the mod-

els predicting conﬁdence (Figure 1) or persuasiveness

(Figure 2) quality. From these Figures, we can see

that, even if in general text modality is the most infor-

mative for the models (see Figures 1a and 2a), we can

notice some variations across the speech moments.

For example, syntactic features are more informative

to predict the speaker’s conﬁdence during the middle

slice compared to the other moments of the speech

(Figure 1b).

5 DISCUSSION

The results from Tables 3 and 4 show that in general

using the entire video allows for a better performance

when predicting public speaking quality, compared to

speciﬁc thin slices. This is consistent with previous

results in (Chollet and Scherer, 2017; Nguyen and

Gatica-Perez, 2015), where it was observed that for

conﬁdence, using full video still performs better than

just using thin slices.

We remind that the focus of this work is not on

the use of thin slices in general but rather on the im-

pact of the temporal position of these slices. Our aim

is to analyse public speaking under the perspective of

socio-cognitive theories such as primacy and recency

effect or ﬁrst impressions. Under this point of view,

there are some results worth being discussed. In par-

ticular, the best absolute performance of the models

was obtained when looking at the middle slice of the

speech. This could indicate that what is important for

a speaker is to maintain the audience’s attention and

interest also after a ﬁrst impression is formed. These

results are in contrast with previous ﬁndings, for ex-

ample in (Hemamou et al., 2019) it was found that

slices at the beginning and end of a speech performed

better than random slices in predicting a speaker’s

hirability. The used methods and dataset are differ-

ent from ours, thus more investigations are required

to compare these ﬁndings.

Beyond the results speciﬁc to our research ques-

tion, we can notice a slightly lower performance

when predicting persuasiveness level compared to

conﬁdence. This could be explained by the lower

inter-raters agreement for persuasiveness ((Park et al.,

2014)) and conﬁrms that these dimensions, even if

correlated, represent different aspects of the speaker’s

performance, e.g., persuasiveness is more related to

dominance than conﬁdence (Burgoon et al., 2002).

It’s not Just What You Do but also When You Do It: Novel Perspectives for Informing Interactive Public Speaking Training

197

Table 3: The prediction F1-scores of conﬁdence for different features sets of different slices.

Conﬁdence Start Middle End Full

Audio 0.738 (0.729, 0.747) 0.761 (0.753, 0.768) 0.707 (0.699, 0.715) 0.804 (0.797, 0.812)

Text 0.835 (0.828, 0.842) 0.882 (0.876, 0.888) 0.794 (0.786, 0.801) 0.884 (0.878, 0.890)

Visual 0.746 (0.739, 0.753) 0.741 (0.733, 0.748) 0.746 (0.739, 0.753) 0.740 (0.733, 0.748)

Audio + Text 0.852 (0.845, 0.859) 0.906 (0.901, 0.912) 0.827 (0.820, 0.834) 0.896 (0.890, 0.901)

Audio + Visual 0.782 (0.774, 0.790) 0.799 (0.792, 0.807) 0.802 (0.795, 0.810) 0.839 (0.833, 0.846)

Text + Visual 0.865 (0.859, 0.871) 0.888 (0.882, 0.894) 0.880 (0.874, 0.886) 0.889 (0.884, 0.895)

All 0.871 (0.865, 0.918) 0.900 (0.895, 0.906) 0.889 (0.884, 0.895) 0.893 (0.887, 0.898)

Table 4: The prediction F1-score of persuasiveness for different features sets of different slices.

Persuasiveness Start Middle End Full

Audio 0.617 (0.611, 0.623) 0.601 (0.595, 0.608) 0.624 (0.618, 0.631) 0.702 (0.696, 0.709)

Text 0.733 (0.728, 0.739) 0.787 (0.782, 0.792) 0.727 (0.721, 0.733) 0.800 (0.795, 0.805)

Visual 0.619 (0.613, 0.624) 0.619 (0.613, 0.624) 0.619 (0.613, 0.624) 0.622 (0.616, 0.628)

Audio + Text 0.741 (0.736, 0.747) 0.832 (0.827, 0.837) 0.737 (0.731, 0.742) 0.812 (0.807, 0.817)

Audio + Visual 0.641 (0.635, 0.648) 0.630 (0.623, 0.636) 0.653 (0.647, 0.659) 0.688 (0.682, 0.694)

Text + Visual 0.763 (0.757, 0.768) 0.831 (0.826, 0.836) 0.782 (0.776, 0.787) 0.802 (0.798, 0.807)

All 0.765 (0.760, 0.770) 0.828 (0.823, 0.833) 0.794 (0.788, 0.780) 0.812 (0.807, 0.817)

(a) Mean absolute SHAP values of modalities. (b) Mean absolute SHAP values of feature groups.

Figure 1: Mean absolute SHAP values of (a) behaviour modalities and (b) feature groups, relative to the models prediction

conﬁdence quality using the different slices (beginning, middle or end) or the entire video (Lex.: lexical, Syn.: syntactic, FE:

facial expression, Pro.: prosody, Spe.: spectral, VQ: voice quality, HP: head pose).

(a) Mean absolute SHAP values of modalities. (b) Mean absolute SHAP values of feature groups.

Figure 2: Mean absolute SHAP values of (a) behaviour modalities and (b) feature groups, relative to the models predicting

persuasiveness quality using the different slices (beginning, middle or end) or the entire video (Lex.: lexical, Syn.: syntactic,

FE: facial expression, Pro.: prosody, Spe.: spectral, VQ: voice quality, HP: head pose).

In addition, once again in line with results from

(Park et al., 2014), and other previous works (Chen

et al., 2015; W

ortwein et al., 2015) using uni-modal

visual features got the lowest performance for both

conﬁdence (Table 3) and persuasiveness (Table 4)

prediction. This could suggest that in public speak-

ing assessment the non-verbal behaviours need to be

contextualised according to what and how is said (i.e.,

in combination with text and audio modalities).

The results shown in Figures 1 and 2 also sug-

HUCAPP 2023 - 7th International Conference on Human Computer Interaction Theory and Applications

198

gest the interest to leverage local interpretability of

the SHAP-based approach to provide feedback di-

rectly, both at a speciﬁc time (what speech part?)

and for a speciﬁc behaviour modality or feature (what

behaviour?). Endowing training interactive systems

with this information would allow them to provide

more adapted and hopefully more useful feedback to

speaker trainees. This could take the form of a report

highlighting the different feature groups and associ-

ated behaviours that contributed positively and nega-

tively to a speciﬁc assessment.

The main limitation of our study is that the results

we obtained could be related to the particular charac-

teristics of POM dataset. The duration of the videos

is relatively short (93± 31 seconds) and the content of

the speech very speciﬁc (movie reviews). In the case

of longer videos, such as TED Talks

) for instance,

other moments of the speech could be more discrimi-

native. However, the ﬁndings of the present study still

support the hypothesis that the impact of a speaker’s

behaviour on the perception of their performance is

different according to when these behaviours are re-

alised during the speech, and this should be taken into

account by public speaking training systems. Further

investigations could elucidate whether what happens

in the middle part of the speech is still important in

different contexts or whether ﬁrst impressions or pri-

macy and recency effect apply in those cases.

6 CONCLUSION

In this paper, we proposed a novel perspective to anal-

yse public speaking performance. In order to facilitate

explainability of the assessment of a speaker’s perfor-

mance and in turns provide more pedagogical training

system, we investigated the impact of the temporal

location of speech slices on the perception of conﬁ-

dence and persuasiveness of the speaker. We found

that, when considering multi-modality, usually the

middle part of speech is the most informative. In or-

der to use model-learned knowledge to give feedback,

we discussed a SHAP-based feedback approach, with

the aim to provide feedback in a localised way and

at the modality or feature level using a purely data-

driven method.

This is a ﬁrst step towards the design of more

explainable and pedagogical interactive training sys-

tems. Such systems could be more efﬁcient by fo-

cusing on improving the speaker’s most important be-

haviour during the most important moments of their

performance, and by situating feedback at speciﬁc

https://www.ted.com/

places within the total speech. In future work, we

plan to apply the same perspective by implementing

more powerful models such as attention-based neu-

ral models and validate our results on larger datasets.

We are also interested in whether the results also hold

for longer speeches, since observer attention may vary

differently.

ACKNOWLEDGEMENTS

This work was partially funded by the Carnot in-

stitutes TSN and M.I.N.E.S. under the InterCarnot

contract 200000830 AI4SoftSkills and the ANR-21-

CE33-0016-02 REVITALISE project.

REFERENCES

Ambady, N. and Rosenthal, R. (1992). Thin slices of ex-

pressive behavior as predictors of interpersonal con-

sequences: A meta-analysis. Psychological bulletin,

111(2):256.

Ambady, N. and Skowronski, J. J. (2008). First impressions.

Guilford Press.

Baltrusaitis, T., Zadeh, A., Lim, Y. C., and Morency, L.-

P. (2018). Openface 2.0: Facial behavior analysis

toolkit. In 2018 13th IEEE International Conference

on Automatic Face Gesture Recognition (FG 2018),

pages 59–66.

Batrinca, L., Stratou, G., Shapiro, A., Morency, L.-P., and

Scherer, S. (2013). Cicero-towards a multimodal vir-

tual audience platform for public speaking training. In

International workshop on intelligent virtual agents,

pages 116–128. Springer.

Bird, S., Klein, E., and Loper, E. (2009). Natural language

processing with Python: analyzing text with the natu-

ral language toolkit. ” O’Reilly Media, Inc.”.

Burgoon, J. K., Dunbar, N. E., and Segrin, C. (2002). Non-

verbal inﬂuence. The persuasion handbook: Develop-

ments in theory and practice, pages 445–473.

Chen, L., Leong, C. W., Feng, G., Lee, C. M., and Soma-

sundaran, S. (2015). Utilizing multimodal cues to au-

tomatically evaluate public speaking performance. In

2015 International Conference on Affective Comput-

ing and Intelligent Interaction (ACII), pages 394–400.

IEEE.

Chollet, M. and Scherer, S. (2017). Assessing public speak-

ing ability from thin slices of behavior. In 2017 12th

IEEE International Conference on Automatic Face

& Gesture Recognition (FG 2017), pages 310–316.

IEEE.

Chollet, M., W

ortwein, T., Morency, L.-P., Shapiro, A., and

Scherer, S. (2015). Exploring feedback strategies to

improve public speaking: an interactive virtual audi-

ence framework. In Proceedings of the 2015 ACM In-

ternational Joint Conference on Pervasive and Ubiq-

uitous Computing, pages 1143–1154.

It’s not Just What You Do but also When You Do It: Novel Perspectives for Informing Interactive Public Speaking Training

199

Curtis, K., Jones, G. J., and Campbell, N. (2015). Effects of

good speaking techniques on audience engagement. In

Proceedings of the 2015 ACM on International Con-

ference on Multimodal Interaction, pages 35–42.

Damian, I., Tan, C. S., Baur, T., Sch

oning, J., Luyten, K.,

and Andr

e, E. (2015). Augmenting social interac-

tions: Realtime behavioural feedback using social sig-

nal processing techniques. In Proceedings of the 33rd

annual ACM conference on Human factors in comput-

ing systems, pages 565–574.

Ebbinghaus, H. (1913). Memory: a contribution to exper-

imental psychology. 1885. New York: Teachers Col-

lege, Columbia University.

Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg,

J., Andr

e, E., Busso, C., Devillers, L. Y., Epps, J.,

Laukka, P., Narayanan, S. S., and Truong, K. P.

(2016). The geneva minimalistic acoustic parameter

set (gemaps) for voice research and affective com-

puting. IEEE Transactions on Affective Computing,

7(2):190–202.

Eyben, F., W

ollmer, M., and Schuller, B. (2010). Opens-

mile: the munich versatile and fast open-source au-

dio feature extractor. In Proceedings of the 18th ACM

international conference on Multimedia, pages 1459–

1462.

Haider, F., Koutsombogera, M., Conlan, O., Vogel, C.,

Campbell, N., and Luz, S. (2020). An active data

representation of videos for automatic scoring of oral

presentation delivery skills and feedback generation.

Frontiers in Computer Science, 2:1.

Hemamou, L., Felhi, G., Vandenbussche, V., Martin, J.-C.,

and Clavel, C. (2019). Hirenet: A hierarchical atten-

tion model for the automatic analysis of asynchronous

video job interviews. In Proceedings of the AAAI Con-

ference on Artiﬁcial Intelligence, volume 33, pages

573–581.

Hemamou, L., Guillon, A., Martin, J.-C., and Clavel, C.

(2021). Multimodal hierarchical attention neural net-

work: Looking for candidates behaviour which impact

recruiter’s decision. IEEE Transactions on Affective

Computing.

Hirschberg, J. B. and Rosenberg, A. (2005). Acous-

tic/prosodic and lexical correlates of charismatic

speech.

Hongwei, Z. et al. (2020). Analysis of the persuasive meth-

ods in barack obama’s speeches from the social psy-

chology’s perspectives. The Frontiers of Society, Sci-

ence and Technology, 2(10).

Lundberg, S. M. and Lee, S.-I. (2017). A uniﬁed approach

to interpreting model predictions. Advances in neural

information processing systems, 30.

Nguyen, A.-T., Chen, W., and Rauterberg, M. (2012). On-

line feedback system for public speakers. In 2012

IEEE Symposium on E-Learning, E-Management and

E-Services, pages 1–5. IEEE.

Nguyen, L. S. and Gatica-Perez, D. (2015). I would hire you

in a minute: Thin slices of nonverbal behavior in job

interviews. In Proceedings of the 2015 ACM on inter-

national conference on multimodal interaction, pages

51–58.

Nojavanasghari, B., Gopinath, D., Koushik, J., Baltru

saitis,

T., and Morency, L.-P. (2016). Deep multimodal fu-

sion for persuasiveness prediction. In Proceedings

of the 18th ACM International Conference on Multi-

modal Interaction, pages 284–288.

Park, S., Shim, H. S., Chatterjee, M., Sagae, K., and

Morency, L.-P. (2014). Computational analysis of per-

suasiveness in social multimedia: A novel dataset and

multimodal prediction approach. In Proceedings of

the 16th International Conference on Multimodal In-

teraction, pages 50–57.

Pennebaker, J. W., Boyd, R. L., Jordan, K., and Blackburn,

K. (2015). The development and psychometric prop-

erties of liwc2015. Technical report.

Ramanarayanan, V., Leong, C. W., Chen, L., Feng, G., and

Suendermann-Oeft, D. (2015). Evaluating speech,

face, emotion and body movement time-series fea-

tures for automated multimodal presentation scoring.

In Proceedings of the 2015 ACM on International

Conference on Multimodal Interaction, pages 23–30.

Scherer, S., Layher, G., Kane, J., Neumann, H., and Camp-

bell, N. (2012). An audiovisual political speech analy-

sis incorporating eye-tracking and perception data. In

LREC, pages 1114–1120.

Sharma, G. and Sharma, P. (2010). Importance of soft skills

development in 21st century curriculum. International

Journal of Education & Allied Sciences, 2(2).

Strangert, E. and Gustafson, J. (2008). What makes a good

speaker? subject ratings, acoustic measurements and

perceptual evaluations. In Ninth Annual Conference of

the International Speech Communication Association.

Tanveer, M. I., Lin, E., and Hoque, M. (2015). Rhema:

A real-time in-situ intelligent interface to help people

with public speaking. In Proceedings of the 20th in-

ternational conference on intelligent user interfaces,

pages 286–295.

Tillfors, M. and Furmark, T. (2007). Social phobia in

swedish university students: prevalence, subgroups

and avoidant behavior. Social psychiatry and psychi-

atric epidemiology, 42(1):79–86.

ortwein, T., Chollet, M., Schauerte, B., Morency, L.-P.,

Stiefelhagen, R., and Scherer, S. (2015). Multimodal

public speaking performance assessment. In Proceed-

ings of the 2015 ACM on International Conference on

Multimodal Interaction, pages 43–50.

Zhao, R., Li, V., Barbosa, H., Ghoshal, G., and Hoque,

M. E. (2017). Semi-automated and collaborative on-

line training module for improving communication

skills. Proceedings of the ACM on Interactive, Mobile,

Wearable and Ubiquitous Technologies, 1(2):1–20.

HUCAPP 2023 - 7th International Conference on Human Computer Interaction Theory and Applications

200