Considerations for Face-based Data Estimates:

Affect Reactions to Videos

Gustaf Bohlin

, Kristoffer Linderman

, Cecilia Ovesdotter Alm

and Reynold Bailey

Malm

o University, Sweden

Rochester Institute of Technology, U.S.A.

Keywords:

Affective Reactions, Facial Expressions Estimates, Face-based Pulse Estimates.

Abstract:

Video streaming is becoming the new standard for watching videos, providing an opportunity for affective

video recommendation that leverages noninvasive sensing data from viewers to suggest content. Face-based

data has the distinct advantage that it can be collected noninvasively with minimal equipment such as a simple

webcam. Face recordings can be used for estimating individuals’ emotional states based on their facial mo-

vements and also for estimating pulse as a signal for emotional reactions. We provide a focused case-based

contribution by reporting on methodological challenges experienced in a research study with face-based data

estimates which are then used in predicting affective reactions. We build on lessons learned to formulate a set

of recommendations that can be useful for continued work towards affective video recommendation.

1 INTRODUCTION

Face-based data has the distinct advantage that it can

be collected noninvasively with minimal equipment

such as a simple webcam. Face recordings can be

used to estimate an individual’s emotion based on

their facial movements. Pulse can also be estima-

ted from videos of the face, providing additional cues

about the viewers’ emotional state.

We provide a focused case study contribution by

reporting on methodological challenges with face-

based data estimates, experienced in the context of

predictive modeling as a step towards affective video

recommendation. We captured webcam recordings of

users’ faces and upper bodies as they watched video

clips intended to evoke reactions of anger, fear, hap-

piness, sadness, or surprise. We report on the use of

face-based estimates of emotional facial movements

and face-based pulse for computational modeling to

predict a user’s rating of a video, comparing against

the explicit self-reported rating.

Video streaming is becoming the new standard for

watching videos, providing a need to suggest content

to users. Although an individual’s rating is valuable to

a recommendation system, most people do not rate the

videos they watch (for example, 50% of the subjects

involved in this study indicated that they never rate

videos). This provides an opportunity for affective vi-

deo recommendation that leverages noninvasive sen-

Figure 1: Facial expression and pulse were both captured

using standard webcams as demonstrated here by one of the

authors. The green box indicates where the pulse was trac-

ked.

sing data from viewers to suggest new content. Af-

fective video recommendation sets out to analyze the

users’ emotions while they are watching videos in or-

der to conclude what to recommend.

188

Bohlin, G., Linderman, K., Alm, C. and Bailey, R.

Considerations for Face-based Data Estimates: Affect Reactions to Videos.

DOI: 10.5220/0007687301880194

In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 188-194

ISBN: 978-989-758-354-4

2 RELATED WORK

Face-based estimates are commonly used in infe-

rence of emotional experiences. For example, (Busso

et al., 2004) explored multimodal emotion recogni-

tion using speech and facial expressions. An actor

read sentences while being recorded and face markers

were utilized to interpret facial muscular movement.

Similarly, (Ioannou et al., 2005) extracted face featu-

res and explored the understanding of users’ emotio-

nal states with a neurofuzzy method and facial ani-

mation parameters. As another example, (Tarnow-

ski et al., 2017) used a Microsoft Kinect to record

a 3D model of subjects’ faces with numerous facial

points. They recognized seven emotions using facial

expressions, a k-NN classiﬁer, and a neural network .

Work in affective computing has also focused on re-

actions for speciﬁc emotions. For instance, (Shea

et al., 2018) studied intuitively extracted reactions to

surprise, spanning multiple modalities. Estimated fa-

cial expressions were particularly important for iden-

tifying naturally occurring surprise reactions.

More speciﬁcally, affective computing methods

have been considered promising for video recommen-

dation and classiﬁcation. (Zhao et al., 2013) presented

a framework for recognizing human facial expressi-

ons to create a classiﬁer, which identiﬁed what genre

viewers watched from their facial expressions. Howe-

ver, as they drew on acted facial data, reactions tended

to involve exaggerations rather than corresponding to

less direct, more intuitive and natural expressions, re-

sulting in modeling unsuitable for actual practical use.

The use of facial data towards video recommen-

dation was explored by (Rajenderan, 2014). To facial

expressions, Rajenderan added analysis of the pulse

modality, calculated with a method called photoplet-

hysmography, developed at MIT by (Poh et al., 2011).

The work was continued by (Diaz et al., 2018), with

a focus on estimating and visualizing viewers’ domi-

nant emotions over the course of a video. The pho-

toplethysmography method uses ﬂuctuations in skin

color related to blood volume and the proportion of

reﬂected light to help estimate the viewer’s pulse. For

this case study, we also apply photoplethysmography

for non-invasive pulse estimation (see Figure 1), with

recalibration occurring between each video viewed by

the subject in the study.

3 METHODS

Conducting an experiment that entirely focuses on

face-based capture provides an opportunity to reﬂect

on challenges that occur when working with face-

based data estimates. We build on this experience

in presenting examples that illustrate methodological

considerations and summarize lessons learned, as a

springboard to formulate a set of recommendations

that can be useful for continued work towards af-

fective video recommendation. This section describes

how we collected face data and processed the face-

based estimates for use in predictive modeling, taking

a step towards video recommendation.

3.1 Data Collection

Equipment. The equipment used for data col-

lection included two standard webcams operating

in real-time: the Logitech C922 Pro Stream We-

bcam and the Logitech Pro 9000 Webcam. One we-

bcam was used to capture the subject’s facial ex-

pressions, while the other was used to estimate the

subject’s pulse using the aforementioned photoplet-

hysmography method. Additional hardware included

a desktop computer with a 24” computer monitor with

external loudspeakers, keyboard, and a mouse.

Stimuli. The experiment included carefully se-

lected short video clips with content from movies, TV

programs, or videos. The clips intended to elicit re-

actions corresponding to ﬁve major emotions: happi-

ness, sadness, anger, fear, and surprise. The emotio-

nal impact of the videos was assessed jointly by the

authors. Three clips were included per emotion cate-

gory with a total of 15 video clips. We avoided con-

tent that might cause strong discomfort. Table 1 pro-

vides an example from each emotion category. Sub-

jects consented to participating in the IRB-approved

study.

Procedure. The data collection process is illustra-

ted in Figure 2. Each subject was given oral instructi-

ons explaining the outline of what the experiment

would look like. After completing the consent form

and receiving a walk-through of the experiment, par-

ticipants ﬁlled out a demographic survey. They did

not know what clips they were watching prior to vie-

wing the videos.

(Diaz et al., 2018) discussed that an experimen-

ter being present could potentially have an effect on

a viewer’s emotional expressions. Accordingly, the

subject was alone in the room during the experiment

to mitigate any such an effect. For each video:

1. the subject was shown an image with instructions

for the pulse calibration

2. a 50 second video of a countdown was shown in

order to calibrate the pulse estimation

Considerations for Face-based Data Estimates: Affect Reactions to Videos

189

Figure 2: Data collection procedure. Subjects viewed ﬁfteen videos to elicit affective reactions from ﬁve emotion categories.

3. the video was shown

4. the subject ﬁlled out a survey regarding the video

they watched

This was repeated 15 times, and each subject wat-

ched all videos. The order of the videos displayed

was randomized for each subject. The survey after

each video included the following questions, adapted

from (Diaz et al., 2018):

• Have you seen this video before?

• What did you feel when watching the video?

• On a scale of 1 to 5, how would you rate the vi-

deo?

• Would you want to watch similar videos? (Yes,

No, Maybe)

For the second question, the subject could choose

any one or more of the following emotions as appli-

cable: happiness, sadness, anger, fear, surprise, and

other (please specify). At the end of the experiment

the subjects were thanked for their participation and

received a cash payment of $12 USD.

The facial expressions were processed using Af-

fectiva ((McDuff et al., 2016)) in iMotions, focusing

on automatically inferred high-level facial expressi-

ons such as joy and anger, as shown in Table 2. All

features extracted from iMotions were represented as

a numeric value reﬂecting the conﬁdence that the fea-

ture was expressed. Every feature extracted was then

aggregated in ﬁve ways (min, max, average, median,

and standard deviation) for subsequent modeling ana-

lysis.

The estimated pulse was processed into three ty-

pes of features: (1) pulse derivative, or the change in

pulse from one sample to the next; (2) absolute pulse

derivative, meaning the absolute value of the pulse de-

rivative, and (3) pulse derivative direction represented

as 1 (increasing), 0 (no change), or -1 (decreasing).

We used measures of change as opposed to the exact

estimated pulse values because of differences in pulse

between individuals as well as concerns about inaccu-

rate values; by focusing on measures of change we

mitigated such issues and centered on trends instead.

Subjects. The data collection involved 32 volun-

teers (17 female and 15 male) recruited on campus

through study announcements. Twenty-six reported

an age between 18-24 and six reported an age bet-

ween 25-44. In the quantitative analysis, ﬁve subjects

were excluded because of data quality concerns.

HUCAPP 2019 - 3rd International Conference on Human Computer Interaction Theory and Applications

190

Table 1: Examples of the video clips used to elicit emotional reactions. Length of clip in parentheses.

Happy Sad Anger Fear Surprise

Despicable me 2 Marley & me Witness Shining Magic show

A man has a date

with a woman that

goes well. The day

after, he is happy

and dances around

the town. (1:57)

A dog is being put

down. Flashbacks

of the dog’s life in

a happy family are

shown. (2:02)

A group of Amish

people are entering

a town. When they

enter a gang of

youths harass them

as they are unwil-

ling to ﬁght back.

(1:16)

A video of a frig-

htened boy follo-

wed by a slow pan

through an empty

living room pai-

red with menacing

sound. (1:22)

A man is perfor-

ming magic where

he produces birds

out of nowhere. In

the end he also re-

veals a woman that

could not be seen

during the perfor-

mance. (2:01)

Figure 3: Theoretical diagram explicating three approaches to consider data from the viewing process.

Table 2: Facial expression features.

Anger Sadness Disgust

Joy Surprise Fear

Contempt Smile

Computational Modeling. As a step towards af-

fective video recommendation, and given the modest

size of the dataset, we used a Support Vector Classi-

ﬁer (SVC) from scikit-learn

to implement predictive

modeling of the subjects’ ratings of the videos (5 clas-

ses) and whether they would watch similar videos

again (3 classes). For the machine learning model

two types of face-based estimated feature modalities

were used: facial expressions and pulse. We used ab-

lation to tune the model to well-performing features.

For every model trained, the accuracy was calculated

using an average of the score for all folds where each

fold left one subject out. We also explored Decision

Tree and Random Forest methods, and we compared

https://scikit-learn.org

against a baseline classiﬁer from scikit-learn.

To investigate which data from the viewing pro-

cess enabled prediction, we considered three approa-

ches in the feature aggregation, as shown in Figure 3,

considering: (1) the entire video, (2) 10 second win-

dows anchored in time points where viewers demon-

strated a signiﬁcant change in their estimated pulse

(with 5 seconds before and after), and (3) the 10 last

seconds of the clip where the clip highlight tended to

occur.

4 FINDINGS

We ﬁrst include example ﬁndings from analyzing the

data from the participants that shed light on methodo-

logical challenges with face-based capture. Second,

we provide results from the classiﬁcation developed

based on the face-based estimated features. We also

discuss limitations of this study.

Considerations for Face-based Data Estimates: Affect Reactions to Videos

191

Figure 4: A viewer looks away.

4.1 Examples of Challenges

The following examples are indicative of challenges

encountered when affective estimates are based me-

rely on face-based capture.

Example 1: Looking Away. Subjects were not mo-

nitored by a person in the room. Several subjects did

at some point in the data collection process begin to

look around the room or at their phone. When this

happened the pulse estimation lost track of their face

and it also temporarily obstructed the facial expres-

sion processing; see Figure 4. A contributing reason

could be leaving subjects alone in the experimentation

room.

Example 2: Abundance of Joy. While more pro-

minent for happy videos, facial indicators of joy occur

for videos of various emotion categories, as shown in

Figure 5. We suspect that this is due to subjects smi-

ling when experiencing other emotions than just joy.

For instance, they could be smiling as a sign of frus-

tration or smiling at something nice in a sad scene.

(Hoque et al., 2012) reported on a study which ex-

plored how smiles occurred with such noncanonical

emotional reactions.

Example 3: Other Visual Reactions. There are

strong visual cues for emotional reactions that extend

beyond facial movements, which a human could re-

cognize immediately, but that may be missed by facial

expression analysis focusing on facial movements.

Figure 6 shows a subject shedding tears while viewing

a clip intended to elicit sadness; the analysis based

on facial movements detected only a small amount of

sadness for short periods of time as indicated by the

circled episodes in Figure 6.

4.2 Subpar Predictive Modeling with

Face-based Estimations

The results of the ablation and in turn the best perfor-

ming classiﬁers are in Table 3. Face-based features

from across the entire clip appear to generate more

accurate models, however, the subpar prediction per-

formance (only slight improvement over the compari-

son baseline) suggests that sole reliance on face-based

features, which suffer from methodological challen-

ges during capture or processing, did not aid robust

prediction, at least not in this case.

There are also other issues with face-based es-

timates. For instance, one face experienced repea-

ted track loss even though the subject remained still,

highlighting nonrobustness to the range of faces.

Limitations. The videos used were intended to eli-

cit emotional reactions yet were intentionally mild to

mitigate emotional triggers, and they were at most 3

minutes long, which may have resulted in less emoti-

onal expression or absence of such reactions. In addi-

tion, 27 participants represents a modest sample size

with implications for the effectiveness of the machine

learning modeling.

5 DISCUSSION

This case study identiﬁed challenges for face-based

data estimates with implications for producing reli-

able data, data analysis, and predictive modeling of

affective reactions, summarized here:

1. Users may not face the camera or their faces may

be obstructed and not capturable.

2. Users often multitask and distribute their attention

which limits face-based estimation.

3. Models are biased towards expected behaviors

and fail to identify reactions when users behave

unexpectedly.

4. Models do not yet robustly account for the full

range of human diversity.

We recognize that several scenarios need to be explo-

red further such as what happens when multiple faces

are tracked in a group simultaneously as well as the

impact of lightning conditions.

6 CONCLUSION

Affective video recommendation is an emerging ﬁeld.

While face-based data is an intuitive, unobtrusive mo-

HUCAPP 2019 - 3rd International Conference on Human Computer Interaction Theory and Applications

192

Figure 5: Abundance of joy. While happy videos are clearly marked by the estimated smiles, so are other categories too.

Figure 6: Tears shed when watching a sad clip as visual cue going beyond analyzed facial movements.

Table 3: Mean accuracy from leave-one-subject-out evaluation using SVC with ablation.

Mean Accuracy

Classiﬁer Question Entire video End instances Pulse instances No. labels

SVC Which rating? 36% 33% 31% 5

Baseline Which rating? 27% 28% 30% 5

SVC Watch similar videos? 48% 47% 47% 3

Baseline Watch similar videos? 46% 46% 47% 5

dality to consider for this application, methodological

challenges introduce complications, as illustrated in

this case study. To set the path to begin to respond to

these challenges we formulate three recommendati-

ons towards human-aware affective video recommen-

dation. First, system should detect loss of attention

measurement and adapt when needed to raise atten-

tion with visual or audio cues. Second, robust sys-

tems must also leverage multimodal sources of hu-

man behavioral data when face-based estimates fail

to provide adequate input. Third, face-based software

models must be trained with large and diverse sample

sizes, accounting for unexpected and uncooperative

behaviors.

ACKNOWLEDGEMENTS

We are grateful to Malm

o University and Rochester

Institute of Technology for their support of the student

authors. This material is based upon work supported

by the National Science Foundation under Award No.

IIS-1559889. Any opinions, ﬁndings, and conclusi-

ons or recommendations expressed in this material are

Considerations for Face-based Data Estimates: Affect Reactions to Videos

193

those of the author(s) and do not necessarily reﬂect

the views of the National Science Foundation.

REFERENCES

Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C. M., Ka-

zemzadeh, A., Lee, S., Neumann, U., and Narayanan,

S. (2004). Analysis of emotion recognition using fa-

cial expressions, speech and multimodal information.

In Proceedings of the 6th International Conference

on Multimodal Interfaces, ICMI ’04, pages 205–211,

New York, NY, USA. ACM.

Diaz, Y., Alm, C., Nwogu, I., and Bailey, R. (2018). To-

wards an affective video recommendation system. In

Workshop on Human-Centered Computational Sen-

sing at PerCom, pages 137–142.

Hoque, M. E., McDuff, D. J., and Picard, R. W. (2012).

Exploring temporal patterns in classifying frustrated

and delighted smiles. IEEE Transactions on Affective

Computing, 3(3):323–334.

Ioannou, S., Raouzaiou, A., A Tzouvaras, V., Mailis, T.,

Karpouzis, K., and Kollias, S. (2005). Emotion re-

cognition through facial expression analysis based on

a neurofuzzy network. Neural networks : the ofﬁcial

journal of the International Neural Network Society,

18:423–35.

McDuff, D., Mahmoud, A. N., Mavadati, M., Amr, M.,

Turcot, J., and Kaliouby, R. E. (2016). Affdex sdk:

A cross-platform real-time multi-face expression re-

cognition toolkit. In Kaye, J., Druin, A., Lampe, C.,

Morris, D., and Hourcade, J. P., editors, CHI Extended

Abstracts, pages 3723–3726. ACM.

Poh, M.-Z., McDuff, D. J., and Picard, R. W. (2011). Ad-

vancements in noncontact, multiparameter physiolo-

gical measurements using a webcam. IEEE Transacti-

ons on Biomedical Engineering, 58(1):7–11.

Rajenderan, A. (2014). An affective movie recommenda-

tion system. Master’s thesis, Rochester Institute of

Technology.

Shea, J. E., Alm, C. O., and Bailey., R. (2018). Contempo-

rary multimodal data collection methodology for reli-

able inference of authentic surprise.

Tarnowski, P., Koodziej, M., Majkowski, A., and Rak, R. J.

(2017). Emotion recognition using facial expressions.

Procedia Computer Science, 108:1175 – 1184. Inter-

national Conference on Computational Science, ICCS

2017, 12-14 June 2017, Zurich, Switzerland.

Zhao, S., Yao, H., and Sun, X. (2013). Video classiﬁca-

tion and recommendation based on affective analysis

of viewers. Neurocomputing, 119:101–110.

HUCAPP 2019 - 3rd International Conference on Human Computer Interaction Theory and Applications

194