Measuring Emotion Intensity: Evaluating Resemblance in Neural

Network Facial Animation Controllers and Facial Emotion Corpora

Sheldon Schiffer

Department of Computer Science, Occidental College, 1600 Campus Road, Los Angeles, U.S.A.

Keywords: Autonomous Facial Emotion, Emotion AI, Neural Networks, Animation Control, Video Corpora.

Abstract: Game developers must increasingly consider the degree to which animation emulates the realistic facial

expressions found in cinema. Employing animators and actors to produce cinematic facial animation by

mixing motion capture and hand-crafted animation is labour intensive and costly. Neural network controllers

have shown promise toward autonomous animation that does not rely on pre-captured movement. Previous

work in Computer Graphics and Affective Computing has shown the efficacy of deploying emotion AI in

neural networks to animate the faces of autonomous agents. However, a method of evaluating resemblance

of neural network behaviour in relation to a live-action human referent has yet to be developed. This paper

proposes a combination of statistical methods to evaluate the behavioural resemblance of a neural network

animation controller and the single-actor facial emotion corpora used to train it.

1 INTRODUCTION

As expensive as they are to design and produce,

photo-realistic human agents have become a common

attraction in contemporary video game design of non-

player characters (NPCs). To get them to behave with

emotional veracity, video game developers are using

AI techniques to control facial expressions.

Developers may choose between at least two

approaches. The first evolved through the Computer

Graphics research community. It prioritizes mimetic

resemblance of movement and modelling to the

appearance of a performing actor or model. The

second was developed by the Affective Computing

community, and prioritizes emotional resemblance,

which is the ability of the avatar to autonomously

elicit a facial expression based on an integrated

emotion model. These two approaches have evolved

over several decades using different models of

simulation. The former creates a system of expression

generation based on appearances on the surface of the

agent’s face. The latter attempts to encapsulate an

emotion generation system that is located “inside” the

agent. Both approaches rely on the same two

components: (1) a collection of video samples of the

face from which to extract structured data about a

https://orcid.org/0000-0001-5862-5239

subject’s facial state and (2) a Neural Network (NN)

controller trained from the aforementioned data, and

programmed to control an avatar’s facial mesh that

resembles the face of the performing actor found in

the collection of video samples.

In this paper we ask, how does a researcher

evaluate the resemblant quality of a NN facial

animation controller and the facial emotion video

corpora on which it was based? An evaluation

technique must determine if the video corpus and NN

architecture that drives an NPC’s facial expression

animation behave in an objectively similar manner to

the original actor’s facial elicitation as depicted in the

video corpus. Past research in the graphics

community has focused evaluation procedures on the

error reported by algorithms that render resulting

animation frames in relation to pre-defined visemes

or expressions. But autonomous emotion in a virtual

agent cannot be performed precisely the same way for

every stimulus. Such consistency would be perceived

as uncanny and mechanical. Thus, a statistical

approach that describes a range of acceptable “error”

is what we propose as unknown probabilistic causes

of variation in facial expressions.

A concern for accuracy is also shared by Affective

Computing researchers. The process of validating the

accuracy of facial emotion elicitation video corpora

160

Schiffer, S.

Measuring Emotion Intensity: Evaluating Resemblance in Neural Networ k Facial Animation Controllers and Facial Emotion Corpora.

DOI: 10.5220/0011655300003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 2: HUCAPP, pages

160-168

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

has primarily been used for research in the production

of NNs for Facial Emotion Recognition (FER)

software systems. The primary intention of corpus

validation has been to warrantee that the emotion

label value assignments for each frame for static

images, or each clip for dynamic images, is

statistically consistent. The system of giving intensity

values to emotion names identified in static photos

evolved from a century-old method of recognition

techniques (Darwin and Prodger, 1872/1998).

Contemporary emotion recognition classifies facial

muscle group behaviours into culturally and

linguistically determined emotion names, or “labels”

(Ekman 2006). The use of FERs provides a ground

truth referent on which to model the facial

expressions for NPCs. Developers of games and

interactive media need a method to determine if the

two components that influence the behaviours of an

animated character or agent – the NN and the video

corpus that trained it – are producing facial emotion

elicitations as intended. Thus, video game developers

of photo-realistic characters can draw from the

techniques of both graphical and affective

computation to determine the emotionally resemblant

quality of their corpora and NN. Using some aspects

of both approaches, a method of corpora production

and evaluation can provide consistently evaluated

data sources for training NN controllers. Two

statistical techniques are proposed that provide a

preliminary basis for analysis.

2 RELATED WORK

Research in computer graphics and affective

computing were consulted to develop a process of

evaluating resemblance derived from NN controllers

and the corpora used to train them.

2.1 Example-Based Animation

New methods of simulating facial elicitation in

Computer Graphics prioritize graphical accuracy of

modelling and animation over emotional autonomy.

Several studies by Paier et. al propose a “hybrid

approach” that use “example-based” video clips for

frame-by-frame facial geometry modelling, texture

capture and mapping, and motion capture (Paier et al.,

2021). In their experiment, a performing actor speaks

a few lines or elicits a set of idiosyncratically defined

gestures. The recording or real time live stream

provides information for automatic geometry and

dynamic facial texture generation (Paier et al., 2020).

Then, a NN using a variable auto-encoder (VAE)

integrates motion for mesh deformation, while

another NN selects animation sequences from an

annotated database. Database annotation of animation

has demonstrated the efficacy of movement data

classification of a single actor that can be used later

for semi-autonomous expressions utterances. Their

approach demonstrates highly resemblant avatar

animation for short single-word utterances or single-

gesture elicitations.

An assumption that using speech as a primary

modality for determining emotional states, belies the

belief that facial elicitation is more reliably

understood as a function of word utterance. The

emphasis on speech synchronization assumes that the

expressive meaning of an intended facial elicitation

will correspond to the semantic context of the spoken

word. This emphasis is found in a study by

Suwajanakorn et al. (2017) that uses the vast

collections of video samples of a U.S. President.

From a 17-hour corpus, the investigators mapped

speech from persons who were not the subject of their

video corpus, onto a moving and speaking face of the

presidential subject. Their method discovered that

optimal training of their NN benefited from

expression positions of the face of both past and

future video frames to best predict how to synthesize

deformations of the mouth right before, during and at

the completion of spoken utterances. Thus, their NN

incorporated Long Short-Term Memory (LSTM)

cells to predict mouth animation synthesis for the

video of upcoming visemes. For the experiments

conducted for this research, we also deployed LSTM

cells and found them useful for the same benefit.

From the standpoint of a designer of autonomous

agents for video games however, neither approach

mentioned thus far provides a model of fully

autonomous elicitation in response to measurable

stimuli. Both examples show that the use of single-

actor or single-subject corpora is viable for training a

NN to simulate the facial expressions of an actor’s

character design or that of a real person. Viability is

made possible with a NN that learns the dynamics of

facial expression based on labelled visemes. This

technique we also integrated through training with

multiple video clips of an actor repeating a

performance in reaction to the same stimuli. This

approach proved useful in our development of NNs

targeted to train specific emotions.

Another distinction in the Computer Graphics

approach is that their corpora structures do not

correlate with widely used psychological

classifications of emotions and their elicitation (e.g.

the six to twelve basic emotions identified by social

and computational psychologists). Neither Paier’s nor

Measuring Emotion Intensity: Evaluating Resemblance in Neural Network Facial Animation Controllers and Facial Emotion Corpora

161

Suwajanakorn’s research disclose a classification

system of emotions, and therefore the meaningfulness

of their synthesized expression rely on arbitrarily

selected spoken semantics rather than independently

systematized semantics of facial expression. The

graphics approach instead prioritizes frame-by-frame

facsimile of labelled visemes on a real human source

as displayed by its simulated avatar. Nonetheless, the

mimetic quality of results created by the workflows

of both Paier et al. and Suwajanakorn et al. must be

considered for autonomous facial emotion elicitation.

2.2 Performance and Appraisal Theory

Affective Computing has been much more focused on

the history of encoding models of emotions.

Automatic facial expression generation evolved over

several decades, lead in part by computational

psychologists and developers. Academic software

developers of virtual agents designed synthetic

emotion elicitation systems for fundamental posture,

gesture and facial expressions that infer emotional

states. The Oz Project, a collection of video game

experiments and research papers realized by Loyall,

Bates and Reilly in the late 1990s, made use of

emotion generation processes actors use to prepare

and train for performance (Loyall, 1997). The Method

approach, a series of practiced exercises as developed

by Russian theater director Constantin Stanislavsky

(Moore, 1984), were combined in the Oz Project with

the emotion system structure of Ortony, Clore &

Collins (OCC) known as Appraisal Theory (Ortony et

al., 1990). Together, these approaches were

implemented in the virtual agents of their experiments

(Bates et al., 1994). While the Oz Project made use of

a theory for motivating emotion elicitation for AI

agents, it did not use an exhaustive dataset from

which to draw machine learnable examples into a

model, as recently evolved by Schiffer et al. (2022).

2.3 Facial Corpora Production

The primary use of facial emotion corpora is to

provide a ground truth baseline for general emotion

recognition systems. This goal is different than

creating a corpus for simulation systems. Therefore,

our research had to adapt the progress of recognition

applications for our intended simulation application.

Published corpora reports indicate their baseline

definitions of static and/or dynamic emotion

elicitations of the human face. Distinctions between

corpora consist of two fundamental feature

categories: the method of production of the video

clips, and the method of validation of the corpus. Clip

production or selection methods diverge in the choice

to use actors as practiced by Bänziger et al. (2011)

versus non-actors (Vidal et al. 2020). Busso et al.

(2017) produced a corpus with tightly scripted

scenarios, while Metallinou et al. (2010) used more

improvisational techniques. As our ultimate goal was

to systematize simulated facial elicitation, we sought

to discover how actors with a scripted scenario

written to generate emotions were used in facial

emotion corpora. We considered how Lucey et al.

(2010) and Bänziger et al. (2011) produced clips in a

controlled studio environment to support an actor’s

undistracted preparation techniques.

Since our objective was simulation of a limited and

targeted set of facial elicitations, evaluation methods

we developed required fewer elicitation variations than

a corpus or NN designed for generic emotion

recognition. Nearly all corpora reference the Facial

Action Coding System (FACS) that correlates groups

of muscles, called Action Units (AUs), to

manipulations of the face to form expressions of at

least six basic emotion labels: anger, disgust, fear,

happiness, sadness, surprise (Cohn et al. 2007). These

labels are used to estimate emotion intensity or provide

perceived levels of arousal and valence (Soleymani,

2014). Our approach used the classification capacity of

FERs to develop single-actor facial emotion corpora on

a targeted set of emotions.

Drawing from the work of Bänziger et al. (2011),

our system similarly segregates all emotion label

values by intensity on the Russel Circumplex Model

(Posner et al., 2005). We sought to train NNs for

specific emotion labels where each would be

designated to control a set of facial AUs of an avatar’s

wireframe mesh modeled after the face of a

performing actor of the corpus. This approach allows

for classifying resemblance by evaluating the

intensity difference between the emotion label values

of the performing actor and their animated avatar.

2.4 Neural Networks for NPCs

Kozasa et al. (2006) showed a preliminary use of an

affective model for an emotive facial system in an

NPC based on a dataset of expressions. Theirs used a

3-layer feed-forward artificial neural network to train

an NPC from “invented” data for parameters fed to a

NN model as they claimed no databases at the time

existed to train their model. Later, using appraisal

theory-based design from virtual agents,

Mascarenhas et al. (2021) integrated the FAtiMA

architecture with a NN model in educational games.

Khorrami et al. (2016) show that the use of LSTM

cells for emotion recognition of facial video was

HUCAPP 2023 - 7th International Conference on Human Computer Interaction Theory and Applications

162

shown to improve previous NN performance for

emotion recognition. The method proposed in this

paper also adopts Long Short-Term Memory (LSTM)

cells in a NN. Unlike the methods that this paper’s

research proposes, these previous works did not use

NN models whose emotion elicitation training is

drawn from single-actor video corpora, but instead

chose corpora with clips of multiple human subjects.

3 PRODUCTION METHODS

Unlike many corpora, this research uses a single actor

as the subject of corpus. The intended use is to train a

NN to control facial animation of a photo-realistic

NPC in a 3d video game. The NPC becomes the

actor’s character Avatar. The general usefulness of

this research is the method of corpora production and

evaluation. A brief overview of our corpus production

and neural network design follows.

3.1 Corpus Production

There are two phases for our corpus production: first,

designing a dyadic conversational scenario and

second, rehearsing and recording video clips.

Scenario design consisted of two characters for actors

to perform asynchronously following a dialog

behaviour tree in the form of a directed acyclic graph.

Actors were cast and rehearsed in preparation of the

video clip recording. One actor played the Stimulus

Source character and recorded a video edited

beforehand of all path variations as if they were

addressing the other character, the Emotion Model.

Then the Emotion Model performed back to the

camera reacting to synchronized video from the pre-

recorded performance paths of the Stimulus Source as

shown in Figure 1.

Figure 1: Setup for sample recording.

The design of the dialog behaviour tree consisted

of distinct fixed start and end nodes with three layers

of six nodes in between. These intermediate layers

allow two possible nodes of dialog turns. No node or

edge could be repeated within a path of the tree as

illustrated in Figure 2. The rules of the graph allow 32

paths through the tree. Each edge segment (the circle

labelled letters) of the tree had a targeted emotion

label. Segments used the same lexical content from

the tree, though all paths had distinct dialog

sequences. With the variations in paths, the dialog

behaviour tree created a permutable performance

structure with stimuli for elicitations to occur.

Figure 2: Box nodes at dialog turns (3, 4, 5, 6) and monolog

events (0, 1, 2, 7). Edges represent mental actions. Orange-

colored nodes were used in the data analysis. As an acyclic

graph, a path can use edge C xor F and I xor L.

Each path was recorded 9 times using three

distinct degrees of intensity of action: high, medium,

and low, to provide more variation to train the NN as

practiced by Wingenbach, Ashwin and Brosnan

(2016). Thus 9 clips times 32 paths yielded 288 total

clips of the Emotion Model.

3.2 Post-Processing Emotion Analysis

Each clip was post-processed by the FER, Noldus

FaceReader 8, for frame-based emotion analysis. The

output data consisted of 7 normalized emotion label

values for each of six emotions, happiness, sadness,

anger, fear, surprise, disgust plus neutral. A new

tuple of seven normalized values were output 3 times

each second. Noldus FaceReader is a tested and

ranked FER system that has produced emotion

recognition validation results that match the accuracy

of human annotators (Skiendziel, Rösch and

Measuring Emotion Intensity: Evaluating Resemblance in Neural Network Facial Animation Controllers and Facial Emotion Corpora

163

Schultheiss 2019). Furthermore, the recognition

accuracy rate of Noldus FaceReader has been

documented as high at 94%. Its output is machine-

readable text consisting of tuple instances of emotion

recognition scores for each frame of video. These

scores became the data used for training the NN using

the Python language and the TensorFlow library. The

FER was used to analyse all emotion values, plus

neutral.

3.3 From Emotion Model to Avatar

The Emotion Model was orthographically

photographed from overlapping angles to produce a

photorealistic head mesh that resembles the actor.

The head mesh was generated by the FaceBuilder

plugin for the 3d animation software system Blender.

FaceBuilder is a modelling tool for supporting 3d

head animation with a facial rig whose vertex groups

are controlled by shape key actuators within Blender.

These shape keys were designed to move the same

alignments of facial muscle groups defined in the

AUs of FACS. The head mesh and the shape keys

embedded in the facial rig were deployed in the game

engine Unity 2022. The shape keys were put into

autonomous motion by programmable blend shapes

in Unity that receive streamed emotion data from the

NN animation controller responding to the face of the

Stimulus Source. The embedded NN receives the

FER data and “reacts” to it in a way that intends to

statistically resemble the character behaviour the

actor created in the video clips in the single-actor

corpus. The NN-generates prediction data in the form

of normalized emotion label values as a means of

autonomously controlling the FaceBuilder head mesh

to animate facial expressions.

Figure 4: Developing Emotion Model avatar with Blender

plugin Keen Tools FaceBuilder.

3.4 Recurrent Neural Network Design

Among the clips generated for this research, 68.8% of

the corpus (198 clips) was used only for training the

NN. 20.1% of the corpus (58 clips) was used only for

validation. The remaining 11.1% of the corpus (32

clips) was used to test the NN’s behavioural

resemblance to the actor corpora on which it was

trained.

The principal components of the NN follows a

Recurrent Neural Network (RNN) design. Each

component of the neural network was selected for its

probabilistic ability to choose values of coefficient

weights and biases for specific input features of the

data that the NN was trying to predict. Predicting the

facial elicitation of game characters based on training

data from an actor’s performance requires spatial and

temporal data representation. For our experiments,

facial feature positions were estimated from their

spatial contexts using Dense architectures (fully

connected). We used a Dense layer of perceptrons

that were fed two layers of bi-directional LSTM cells.

The LSTM layers auto-regressively receive data from

10 seconds in the past using 3 instances of emotion

label data per second. Since the data for this

experiment was fed pre-processed emotion data

tables (as opposed to a live video stream), the NN

analysed 10 seconds into the future as well. Temporal

relations of elicitation events in the data were

processed by LSTM cell layers, while spatial

relations of facial features were handled by the Dense

cell layer. Each emotion label was assigned its own

NN, so the designed recurrent NN was cloned into a

team of 7 NNs and trained on synchronized data

generated from each elicited emotion from the

Emotion Model and the Stimulus Source.

4 EVALUATION METHODS

The evaluation methods proposed provide the

developer of NPCs a quantitative process that

measures behavioural difference. Optimally

minimized difference in data can be interpreted as

statistical resemblance. It is our intention to

demonstrate statistically, that given the same or

similar stimulus, prediction data from the NN can

control animation that resembles observed FER-

generated data of the human Emotion Model on

which it was trained. The resemblance then depends

on minimizing the amount of error between predicted

behaviour performed by the Avatar and observed data

performed by the Emotion Model. But the predicted

data is not a single set for each instance in time.

Instead, each time-instance within a path through the

dialog behaviour tree is a video frame shared by at

least 9 video clips and their edge segments, as well as

other paths that share the same edge segment.

Therefore, since all the video clips are precisely

HUCAPP 2023 - 7th International Conference on Human Computer Interaction Theory and Applications

164

synchronized, each of the frames in the experiment

has a mean emotion value drawn from at least 9 clips

of the same edge segment. And this value can be used

to calculate error in relation to the predicted value at

that frame demonstrated by the NN. By looking at the

difference within ±1 standard deviation, two useful

statistical properties provide the results to determine

statistical resemblance. By calculating the Percentage

of Extreme Residuals (PER) that fall outside of ±1

standard deviation, a first test of resemblance can be

applied to the prediction data. To support those

results, the Root Mean Square Error (RMSE)

provides an amplification of variance. If the neutral

emotion label values are used as a benchmark, RMSE

becomes an additional statistical property to show if

a NN facial animation controller resembles the

character facial movement the actor generated, and if

the corpus that trained the NN is sufficiently robust to

confirm resemblance to any emotion labels.

4.1 Percent of Extreme Residuals

For each emotion and for any segment or combination

of segments of a path through the dialog behaviour

tree, it is useful to know how many frames have mean

emotion values that fall outside ±1 standard deviation

from the mean of observed values at that frame. For

this research, the Percentage of Extreme Residuals

(PER) is calculated as follows:

PER 

∑







 



  





: 𝜎





∑















Where n is the size of the set of all predicted emotion

values and p

is the predicted value of each frame

measured for emotion values in the dialog behaviour

tree edge segment set. For each absolute value of p

that exceeds σ

of all emotion values at the i-th frame,

increment the sum by 1 and divide the by n such that

we define σ

as the standard deviation at the i-th frame

of an edge segment. To calculate ±σ

, the set m is a

count of all video clips j that cover an edge segment

at the i-th frame where x

is the observed value at the

i-th frame and μ is the mean of all values at the i-th

frame of a given emotion for a given edge segment.

PER is the first measurement to consider.

Validation of a NN facial animation controller

should not require the facial mesh behaviour to

exactly animate the same way every time when it

receives the same input from the Stimulus Source.

But how wide should a range of variance be to seem

human-like? Consider ±1 standard deviation. With

the mean of any emotion value at each frame as a

reference, a range of resemblance can be defined

around the mean by using the standard deviation. The

experiments of this research found as resemblant the

frames where the predicted emotion value of a given

frame fall within ±1 standard deviation from the

mean. For this condition to occur, the predicted value

will fall into a value space with at least 68% of the

observed values. For frames that fall outside ±1

standard deviation, they shall be determined as not

resemblant.

4.2 Root Mean Square Error

Statistical methods proposed in this research consider

the mathematical characteristics of non-linear

regressive models deployed in NN design. The NN

deployed in this research uses RNN components: a

layer of LSTM cells that include sigmoid and tanh as

component functions, both of which are nonlinear.

Since the regression model embedded in the NN used

non-linear functions to autonomously elicit emotion,

the experiment used statistical methods that interpret

variance and are suited to non-linear regression.

The Root Mean Square Error (RMSE) is used to

measure facial elicitation resemblance as a function

of variance between two synchronized time series

data sets: observed test data and generated prediction

data. Each of the differences between the observed

and predicted values, referred to as residuals,

aggregates their magnitudes from point to point in a

data set. The resulting value is always positive where

0 is a lower bound representing a perfect fit between

the observed and predicted data. The mean of

observed values for each emotion at each frame was

used as a baseline to compare amplified variance

between the emotion labels. The RMSE is calculated

as follows:

RMSE 



∑



















is the observed value at the i-th frame. y

is the

predicted value for the i-th frame. n is the population

count, or total number of values in the population of

observed values. Under the radical sign, RMSE

aggregates and amplifies the variance by squaring the

summed difference before dividing by the population

count n. A subset of squared differences will be

magnified quadratically and summed once before

being squared. Meanwhile, a subset of squared and

summed smaller differences will diminish

quadratically. Thus, the extreme differences, large or

small, will be amplified. Since emotion values of

video frames of facial emotion elicitations are

normalized to fractions in a 0 to 1 scale,

Measuring Emotion Intensity: Evaluating Resemblance in Neural Network Facial Animation Controllers and Facial Emotion Corpora

165

magnification of variance provides a visible contour

of the behaviour of residual values in relation to the

mean of observed values. A low RMSE will lean

toward greater resemblance provided the PER is also

resemblant.

5 RESULTS

Statistical error scores were computed to determine

the behavioural resemblance between the mean of

emotion values at each frame for all of 6 emotions.

See Table 1.

Table 1: Sums of Emotion Values.

Emotions Analysed on 126 Frames

14 Clips with Edges B, G, H, I, K, M, N, P

Emotion Frame Value Sum % Neutral

Neutral

871.41 --

Angry 60.64 6.95

Disgusted

7.95 0.91

Happy

40.90 4.69

Sad 705.88 81.00

Scared 54.65 6.3

Surprised 54.40 6.24

Neutral was also computed and is used as a

benchmark from which to evaluate the accuracy of

other emotions. Neutral is the absence of emotion and

is theoretically at 1.0 when all other emotion values

are 0.0. Neutral nearly always has the highest

summation of accumulated emotion values over time

as its values increase each time the face returns to

neutral-dominant positions during transition to and

during the listening phase of dyadic conversation. In

all tests for this research, neutral value summations

exceeded all other emotion value summations for any

edge segment. Neutral therefore has the highest

probability of yielding the highest value instances of

any randomly analysed frame.

The highest summation of observed values should

provide the lowest percentage of PER errors for the

predicted values of emotion elicitations. As shown in

Table 2, neutral PER is 0.0794, the lowest of all

recognizable labels. Therefore, following the

behaviour of neutral, the next highest PER emotion

may provide proportionally resemblant results,

proportional in that the higher the percentage the

emotion’s sum is to the sum of neutral, the more

resemblant the primary emotion values are in relation

to the neutral scores. Table 1 shows the emotion with

the highest percentage close to neutral for frame value

sums is sadness with 81%. The next highest is anger

at 6.95%. With such a distant second position and so

far off the benchmark of neutral, one should doubt the

resemblance of anger, while taking note of sadness.

Table 2: Error for Edges B, G, H, I, K, M, N, P.

Emotions Analysed on 126 Frames

Error Between Mean of Observed and Predicted Data

Emotion meanSD RMSE meanRMSE PER

Neutral 0.1680 0.1857 0.1791 0.0794

Anger

0.0516 0.0783 0.0705 0.2698

Disgust 0.0071 0.0140 0.0093 0.2619

Happiness 0.0555 0.0891 0.0646 0.4365

Sadness

0.1982 0.2438 0.2354 0.1984

Fear 0.0378 0.0481 0.0431 0.1984

Surprise 0.0597 0.0801 0.0667 0.1984

The next consideration is the spread of emotion

values for each frame of each emotion. One may

notice in Table 3 that neutral again behaves as a

benchmark, evenly distributing values with a

relatively smooth and centred distribution. Sadness,

the emotion closest to neutral, while somewhat

skewed to the lower half of the distribution pentile

shows an even distribution. All other emotions are far

less evenly distributed, with most of the values

compressed into the first and second pentile of values.

Table 3: Proportion of Values.

Emotions Analysed on 126 Frames

14 Clips with Edges B, G, H, I, K, M, N, P

Emotion Emotion Value Ranges

<0.2 <0.4 <0.6 <0.8 < 1.0

Neutral

0.054 0.302 0.339 0.229 0.076

Anger

0.959 0.036 0.004 0.002 0.0

Disgust 0.999 0.001 0.0 0.0 0.0

Happiness 0.977 0.010 0.004 0.004 0.005

Sadness

0.258 0.255 0.239 0.202 0.045

Fear 0.979 0.018 0.002 0.0 0.0

Surprise 0.954 0.033 0.007 0.006 0.0

Lastly, again consider the error data as seen in

Table 2. Interpreting error requires the mean emotion

values of any emotion to be high enough so that the

region of the standard deviation is nearly all above

zero. If the standard deviation region is clipped by a

zero-value line, then the prediction values will likely

rest above near-zero as well, providing no “bottom

room” to dip below the standard deviation region.

Unlike sadness as shown in Figure 6 and neutral seen

in Figure 5, the predicted data for scared, surprised,

happiness and disgust show unreliability for the NN

and corpora for this research because their standard

deviation regions drift over the zero-value line

causing the PER and RMSE to appear to support

accuracy, when in fact the NN is reacting with very

little elicitation response for the given stimulus (flat

HUCAPP 2023 - 7th International Conference on Human Computer Interaction Theory and Applications

166

lining). Most interestingly, sadness (Figure 6) and to

some degree, anger (Figure 7), show some promising

responsiveness to the stimulus, reacting in similar

ways as the mean of the frame of emotion values in

the observed test data as indicated by the PER score

for both in Table 2. With the RMSE score for sadness

at 0.2438, its score is 0.0581 higher than neutral at

.01857. Anger shows a lower RMSE, but anger values

are still too low to be fully reliable with much of its

standard deviation clipped by the zero line and the

observed values of the test data also dropping to zero

for nearly 20 frames.

The difference between the RMSE and the

meanRMSE is that while the RMSE score looks only

at the difference between the mean of the observed

test data and the predicted data, the meanRMSE is the

mean of all the plotted RMSE scores shown in red in

each of Figures 4, 5 and 6. The plotted values show

the RMSE for the chosen emotion at each of the 126

frames examined in relation to the same synchronized

frame in the predicted value. The fact that the two

RMSE scores are close in value, provides a check on

the accuracy of the error assessment process.

Figure 5: Neutral provides a benchmark for other emotions.

Figure 6: Mean of Observed values in midrange with

standard deviation region unclipped.

One inexplicable anomaly is an apparent minimum

value PER of 0.1984 in the test results shown in Table

2. Mathematically, it has been determined that for the

126 frame results for each emotion, 25 frames fell

outside the standard deviation for the three emotions:

sadness, scared and surprise. It remains unclear if

this fact is a coincidence or caused by the test design.

6 CONCLUSION

Thus far, this paper has identified several statistical

properties relative to neutral: RMSE, PAR, and

Spread of Emotion Values. What has been

demonstrated is that at least one emotion label,

sadness, was successfully simulated. What might be

useful is a classification system for each of these

measurements that would provide discrete labels

within a range. Such a classification system could

indicate if the results will lead to a NN model that will

output predicted emotion values that range from

resemblant to not resemblant in relation to those

values elicited by its human actor referent. The aim

of this research is to expand the creative process of

character design for video games beyond the modeller

and animator and toward the skills of the actor. For

the methods proposed to become useful, they must

also produce salient results that confirm resemblance.

Thus far, this research has demonstrated a statistical

method to validate resemblance. Further investigation

should confirm its viability as a method of production

for game character production workflow.

REFERENCES

Bates, J., Loyall, A. B., Reilly, W., (1994). An Architecture

for Action, Emotion, and Social Behavior, In Artificial

Social Systems: Fourth European Workshop on

Modeling Autonomous Agents in a Multi-Agent World,

Springer-Verlag, Berlin.

Barros, P., Churamani, N., Lakomkin, E., Siquiera, H.,

Sutherland, A., Wermter. S. (2018). The OMG-

Emotion Behavior Dataset. In Proceedings of the

International Joint Conference on Neural Networks.

Bänziger, T., Mortillaro, M., Scherer, K.R., (2011).

Introducing the Geneva Multimodal Expression Corpus

for Experimental Research on Emotion Perception. In

Emotion. vol. 12, no. 5. American Psychological

Association, New York, NY, USA. 1161-1179.

Busso, C., Burmania, A., Sadoughi. N. (2017). MSP-

IMPROV: An Acted Corpus of Dyadic Interactions to

Study Emotion Perception. In Transactions on

Affective Computing, vol.10, no. 10. 67-80. New York:

IEEE.

Cohn, J., Ambadar, Z., Ekman, P. (2007). Observer-Based

Measurement of Facial Expression with the Facial

Action Coding System, in Handbook of Emotion

Measuring Emotion Intensity: Evaluating Resemblance in Neural Network Facial Animation Controllers and Facial Emotion Corpora

167

Elicitation and Assessment, eds. Coan, J. A., and Allen,

J. B., Oxford University Press.

Darwin, C., Prodger, P. (1872/1998). The Expression of the

Emotions in Man and Animals. Oxford University

Press, USA.

Ekman, P. (Ed.). (2006). Darwin and Facial Expression: A

Century of Research in Review. Cambridge, MA:

Malor Books, Institute for the Study of Human

Knowledge.

Khorrami, P., Le Paine, T., Brady, K., Dagli, C. and Huang,

T.S., (2016). How Deep Neural Networks Can Improve

Emotion Recognition on Video Data, in IEEE

International Conference on Image Processing 2016,

New York, NY, USA: IEEE, pp. 619-623.

Kozasa, C, Fukutake, H., Notsu, H., Okada, Y., Niijima, K.,

(2006). Facial Animation Using Emotional Model,

International Conference on Computer Graphics,

Imaging and Visualization, pp. 428-433.

Lewinski, P., Den Uyl, T. M., Butler, C. (2014). Automated

Facial Coding: Validation of Basic Emotions and FACS

AUs in FaceReader. Journal of Neuroscience,

Psychology, and Economics 7.4 (2014): 227.

Loyall, A. B., (1997). Believable Agents: Building

Interactive Personalities (No. CMU-CS-97-123),

Carnegie-Mellon University, Department of Computer

Science, accessed 12 October 2022 at: https:

//www.cs.cmu.edu/afs/cs/project/oz/web/papers/CMU-

CS-97-123.pdf

Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar, Z.,

Matthews, I. (2010). The Extended Cohn-Kanade

Dataset (CK+). In 2010 IEEE Computer Society

Conference on Computer Vision and Pattern

Recognition-Workshops. pp. 94-101.

Mascarenhas, S., Guimarães, M., Santos, P.A., Dias, J.,

Prada, R., Paiva, A., (2021). FAtiMA Toolkit -Toward

an Effective and Accessible Tool for the Development

of Intelligent Virtual Agents and Social Robots,”, arXiv

preprint arXiv:2103.03020.

Metallinou, A., Lee, C., Busso, C., Carnicke, S.,

Narayanan, S. (2010). The USC CreativeIT Database:

A Multimodal Database of Theatrical Improvisation. In

Proceedings of Multimodal Corpora: Advances in

Capturing, Coding and Analyzing Multimodality

Moore, S. (1984). The Stanislavski System: The

Professional Training of an Actor, Penguin Books, New

York, NY, USA, pp. 41-46.

Ortony, A., Clore, G. L., and Collins, A., (1990). The

Cognitive Structure of Emotions,” Cambridge, UK:

Cambridge University Press, pp. 34-58.

Paier, W., Hilsmann, A., and Eisert, P. (2021). Example-

Based Facial Animation of Virtual Reality Avatars

Using Auto-Regressive Neural Networks. IEEE

Computer Graphics and Applications, 41(4), pp. 52-63.

Paier, W., Hilsmann, A., and Eisert, P. (2020). Neural face

models for example-based visual speech synthesis. In

European Conference on Visual Media Production, pp.

1-10.

Posner, J., Russell, J. A., and Peterson, B. S. (2005). The

circumplex model of affect: An integrative approach to

affective neuroscience, cognitive development, and

psychopathology. Development and psychopathology,

17(3), 715-734.

Schiffer, S., Zhang, S., Levine, M. (2022). Facial Emotion

Expression Corpora for Training Game Character

Neural Network Models. VISIGRAPP.

Skiendziel, T., Rösch, A. G., Schultheiss, O.C. (2019).

Assessing the Convergent Validity Between Noldus

FaceReader 7 and Facial Action Coding System

Scoring. In PloS one 14.10 (2019): e0223905.

Soleymani, M., Larson, M., Pun, T., and Hanjalic, A.

(2014). Corpus Development for Affective Video

Indexing. In IEEE Transactions on Multimedia, 16(4),

pp. 1075-1089.

Suwajanakorn, S., Seitz, S. M., Kemelmacher-Shlizerman,

I. (2017). Synthesizing Obama: Learning Lip Sync

from Audio. ACM Transactions on Graphics, 36(4), pp.

1-13.

Vidal, A. Salman, A. Lin, W., Busso, C. (2020). MSP- Face

Corpus: A Natural Audiovisual Emotional Database. In

Proceedings of the 2020 International Conference on

Multimodal Interaction, pp. 397-405.

Wingenbach, T., Ashwin, C., Brosnan, M. (2016).

Validation of the Amsterdam Dynamic Facial

expression set – Bath Intensity Variation (ADFES-

BIV), In PLoS ONE 11(1): e0147112.

HUCAPP 2023 - 7th International Conference on Human Computer Interaction Theory and Applications

168