Assessing Facial Expressions in Virtual Reality Environments

Catarina Runa Miranda and Ver

onica Costa Orvalho

Instituto de Telecomunicac¸

oes, Universidade do Porto, Porto, Portugal

Keywords:

Facial Motion Capture, Emotion and Expressions recognition, Virtual Reality.

Abstract:

Humans rely on facial expressions to transmit information, like mood and intentions, usually not provided by

the verbal communication channels. The recent advances in Virtual Reality (VR) at consumer-level (Oculus

VR 2014) created a shift in the way we interact with each other and digital media. Today, we can enter a

virtual environment and communicate through a 3D character. Hence, to the reproduction of the users’ facial

expressions in VR scenarios, we need the on-the-ﬂy animation of the embodied 3D characters. However,

current facial animation approaches with Motion Capture (MoCap) are disabled due to persistent partial oc-

clusions produced by the VR headsets. The unique solution available for this occlusion problem is not suitable

for consumer-level applications, depending on complex hardware and calibrations. In this work, we propose

consumer-level methods for facial MoCap under VR environments. We start by deploying an occlusions-

support method for generic facial MoCap systems. Then, we extract facial features to create Random Forests

algorithms that accurately estimate emotions and movements in occluded facial regions. Through our novel

methods, MoCap approaches are able to track non-occluded facial movements and estimate movements in

occluded regions, without additional hardware or tedious calibrations. We deliver and validate solutions to

facilitate face-to-face communication through facial expressions in VR environments.

1 INTRODUCTION

In the last two decades, we lived a revolution of global

digital interactions and communication between hu-

mans (Jack and Jack, 2013). We erased geographic

barriers and started communicating with each other

through phones, computers and, more recently, in-

side virtual environments using Virtual Reality (VR)

headsets. Oculus VR company was the responsible

by bringing this hardware to consumer-level making

this way of interaction more appealing to common

users (Oculus VR 2014). However, VR communi-

cations remain a challenge. Human communication

strongly rely on a synergistic combination of verbal

(e.g. speech) and non-verbal (e.g. facial expressions

and gestures) signals between interlocutors (Jack and

Jack, 2013). Past communication technologies, like

phones and computers, adopted the image stream (e.g.

webcams) coupled with speech to transmit both sig-

nals creating more realistic and complete experiences

(Lang et al., 2012). In VR scenarios, we cannot use

image stream since we are interacting with the vir-

tual world embodied in 3D characters (Biocca, 1997;

Slater, 2014). As result, the demand for on-the-ﬂy al-

gorithms for 3D characters animation and interaction

is even higher. Ahead of unlocking both communica-

tion channels (i.e. verbal and non-verbal), the believ-

able animation of 3D characters using user’s move-

ments enhance the three components of the sense

of embodiment in VR environments: self-location,

agency and body ownership (Biocca, 1997; Kilteni

et al., 2012). Even with technological advances in

Computer Vision (CV) and Computer Graphics (CG),

the reproduction of human’s facial expressions as fa-

cial animation of 3D characters is still hard to achieve

(Pighin and Lewis, 2006). To automatise facial an-

imation, facial Motion Capture (MoCap) has been

widely used to trigger animation (Cao et al., 2014;

von der Pahlen et al., 2014; Cao et al., 2013; Li et al.,

2013; Weise et al., 2011). However, these approaches

are not suitable for consumer-level VR applications,

requiring or expensive setups (von der Pahlen et al.,

2014), manual complex calibrations (Cao et al., 2013;

Li et al., 2013; Weise et al., 2011) or do not support

the persistent partial occlusion of the face produced

by VR headsets (Cao et al., 2014).

To overcome the tracking problem created by per-

sistent partial occlusions, Li et al. (Li et al., 2015)

proposed a hardware based solution using a RGB-D

camera for capture and strain gauges (i.e. ﬂexible

metal foil sensors) attached to VR headset to mea-

sure the upper face movements that are occluded. But

488

Miranda, C. and Orvalho, V.

Assessing Facial Expressions in Virtual Reality Environments.

DOI: 10.5220/0005716604860497

In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 3: VISAPP, pages 488-499

ISBN: 978-989-758-175-5

again, this approach is not suitable for general user.

It requires a complex calibration composed by hard-

ware calibration to user and a blendshapes calibra-

tion to trigger animation. At the moment, this is the

unique on-the-ﬂy facial animation with MoCap solu-

tion compatible to VR environments.

Contributions: This work delivers and validates

consumer-level real-time methods for: (i) facial Mo-

Cap method for persistent partial occlusions created

by VR headsets and (ii) facial expressions prediction

algorithms of occluded face region using movements

tracked in non-occluded region. Compared to litera-

ture, we reduce user-dependent calibration and hard-

ware requirements, requiring only a common RGB

camera for capture. Our methods make current fa-

cial MoCap approaches compatible to VR environ-

ments and enable the extraction of key facial move-

ments of bottom and upper face regions. The move-

ments tracked and emotions detected can be com-

bined to: trigger on-the-ﬂy facial animation, enabling

non-verbal communication in VR scenarios; as input

for emotion-based applications, like emotional gam-

ing (e.g. Left 4 Dead 2 by Valve).

2 BACKGROUND

In this section, we aim to study the literature re-

garding two different topics: (i) facial MoCap solu-

tions for persistent partial occlusions created by VR

Head Mounted Displays (HMD) and (ii) partial oc-

clusions impact in facial expressiveness. The ﬁrst

topic presents state of the art facial MoCap solutions

to overcome the persistent occlusions’ issue. Then, in

(ii), we explore how these occlusions restrict face-to

face communication and their impact in face expres-

siveness. By the end, we search for a connection be-

tween occluded and non-occluded facial parts used as

guide for methodology deﬁnition.

2.1 Persistent Partial Occlusions:

A Today’s Problem

In literature, we are able to ﬁnd several promising

solutions for real-time automatic facial MoCap (Cao

et al., 2014; von der Pahlen et al., 2014; Cao et al.,

2013; Li et al., 2013; Weise et al., 2011). However,

the arise of VR commercial approaches of consumer-

level HMD’s (Oculus VR 2014), raised a new issue:

the real-time automatic tracking of faces partially oc-

cluded by hardware (i.e. persistent partial occlusions

of face) (Slater, 2014). Current MoCap approaches

adopt model-based trackers, which produce cumula-

tive errors in presence of persistent partial occlusions

(Cao et al., 2014). Therefore, due to the absence of

VR devices in mass-market, this issue was almost ig-

nored for years. This resulted in a lack of technolog-

ical solutions for face-to-face communication for VR

environments. Only in 2015, Li et al. (Li et al., 2015)

highlighted this problem and proposed a hardware

based tracking solution. This solution uses an RGB-D

camera combined with eight ultra-thin strain gauges

placed on the foam liner for surface strain measure-

ments to track upper face movements, occluded by

the HMD. The ﬁrst limitation of this approach is the

long initial calibration required to ﬁt the measures to

each individual’s faces using a training sequence of

FACS (Ekman and Friesen, 1978). Also, in subse-

quent wearings by the same person, a smaller cali-

bration is needed to re-adapt the hardware measures.

This training step allows the detection of user’s up-

per and bottom face expressions and activate a blend-

shape’s rig containing the full range of FACS shapes

(Ekman and Friesen, 1978). Besides the manipula-

tion complexity, the solution also presents drifts and

decrease of accuracy due to variations in pressure dis-

tribution from HMD placement and head orientation.

As consequence, HMD straps positioning inﬂuence

eyebrows’ movement detection (Li et al., 2015). Li

et al. solution is currently the only one available to

overcome the persistent partial occlusions issue, mak-

ing this an open research topic in CV algorithms for

facial MoCap.

2.2 Partial Occlusions and

Expressiveness

Everyday, humans’ communication use facial expres-

sions and emotions to transmit and enhance informa-

tion not provided by speech (Lang et al., 2012). Even

through technology, we always search for a way to use

the non-verbal communication channel. As example,

using video stream of our faces; virtual representa-

tions, like emotion smiles, cartoons or 3D characters

with pre-deﬁned facial expressions, etc. Understand-

ing facial expressions and improve their representa-

tion in 3D characters is one of the key challenges of

CG and plays an important role in digital economy

(Jack and Jack, 2013). This role is even more relevant

now, with recent advances in VR communications at

consumer level (Biocca, 1997). But how can we use

the common solutions of facial animations, like Mo-

Cap, if user’s face is occluded? Are we able to repre-

sent faces using information only from bottom of the

face? To answer these questions, we make a litera-

ture overview regarding several face regions impact

in non-verbal communication. The goal is to under-

stand how a partial occlusion of the face affects com-

Assessing Facial Expressions in Virtual Reality Environments

489

munication. We also researched for a relationship be-

tween occluded and non-occluded facial parts through

emotion-based and biomechanics studies. This infor-

mation was used to build one of this work hypothesis.

In a study about face perception (Fuentes et al.,

2013), we concluded that humans have independent

shape representations of upper and bottom parts of

the face. Similar conclusions are found in emotion

perception’s literature, where mouth and eyes play

different roles (Eisenbarth and Alpers, 2011; Lang

et al., 2012; Bombari et al., 2013). In (Eisenbarth and

Alpers, 2011; Bombari et al., 2013) it is shown that

according to the emotion detected participants used

information from eyes, or mouth or both. More pre-

cisely, in happy expressions participants used infor-

mation from the mouth; for sad and angry, from eyes;

and to fear and neutral, both mouth and eyes are used.

For additional information about non-verbal commu-

nication, we forward the reader to (Lang et al., 2012).

Taking these statements into account, if we occlude

certain region of the face, face-to-face communica-

tion is affected and we may not be able to decode ex-

pressions properly. Subsequently, the tracking of only

certain facial regions, like mouth, is not enough for

emotion recognition, for proper communication and

to generate believable facial animation of 3D charac-

ters.

From the biomechanical point of view, we know

that facial muscles work synergistically to create ex-

pressions. The muscles interweave with one another,

being difﬁcult to decode their boundaries, since their

terminal ends are interlaced with other muscles. A de-

tailed research about facial anatomy and biomechan-

ics can be accessed at Chapter 3 of the book Computer

Facial Animation (Parke and Waters, 1996). Several

studies in CG applied the biomechanical approach to

create coding systems. These coding systems param-

eterize human face enabling a faster generation of fa-

cial expressions in 3D characters (Ekman and Friesen,

1978; Pandzic and Forchheimer, 2003; Magnenat-

Thalmann et al., 1988). Although, they do not pro-

vide a clear solution for facial expressions estimation

constrained to certain regions of the face. Further-

more, the deﬁnition and prediction of facial expres-

sions is even harder when the diversity of facial ex-

pressions is considered. Scott McCloud (McCloud,

2006) explains the inﬁnite possibilities of facial ex-

pressions combinations (i.e. the way mixing any two

of universal emotions can generate a third expression,

which, in many cases, is also distinct and recogniz-

able enough to earn its own name) (McCloud, 2006).

Then, analyzing literature, we are able to attain

that occlusions generated by VR devices affect com-

munication and using only the information of non-

occluded regions is not enough to animate a 3D char-

acter. However, biomechanics and facial animation

coding systems show a connection between the dif-

ferent facial regions and how diverse and complex is

the world of possible expressions. Using these state-

ments, we describe a novel methodology to overcome

occlusions problem of facial MoCap and then, to as-

sess facial expressions using non-occluded face infor-

mation.

3 METHODOLOGY

The literature overview of previous section allowed us

to formulate the following hypothesis:

to create a method to estimate facial expressions of

upper face and emotions using only bottom face’s

movements.

Therefore, we deliver VR consumer-level meth-

ods that:

• overcomes the persistent partial occlusions issue

in MoCap, making possible the bottom face’s

movements tracking;

• recognizes universal emotions, plus neutral (Ek-

man and Friesen, 1975; Jack and Jack, 2013) us-

ing bottom face’s movements;

• estimates upper face’s movements (i.e. eyebrows

movements) using information tracked from bot-

tom part of the face.

Figure 1 shows the connection between our VR

methods. We start by presenting a method to make

generic MoCap systems compatible to persistent par-

tial occlusions produced by VR headsets. Then, ap-

plying this algorithm, we are able to track prop-

erly the bottom face’s features and use them to de-

velop methods that predict the following facial ex-

pressions: (i) universal emotions, plus neutral (Ekman

and Friesen, 1975; Jack and Jack, 2013) and (ii) eye-

brows movements. Combining aforementioned meth-

ods, we make possible the MoCap of upper and bot-

tom face movements and estimation of facial emo-

tions under persistent partial occlusions created by

VR headsets.

As setup, we suggest the usage of a Head Mounted

Camera (HMC) combined with the VR HMD (see

Figure 2). At ﬁrst, we justify the adoption of HMC

as capture hardware: When the user is inside the VR

environment he is not aware of the space around him.

The VR devices precisely substitute the user’s sen-

sory input and transform the meaning of their motor

outputs with reference to an exactly knowable alter-

nate reality (Slater, 2014). Hence, the user moves and

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

490

Figure 1: VR methods’ framework.

reacts to impulses from VR environment. If we want

to capture his face, we have to attach a capture device

(i.e. camera) to his body and the device should fol-

low user’s movements (see HMC on Figure 2). It is

not possible to use a static camera, because the user

is not going to be able to place himself in a position

proper for capture. A similar setup was also proposed

by Li et al. (Li et al., 2015), but we removed the strain

sensors.

In the next subsections, we provide a complete de-

scription of the VR methods.

3.1 VR Persistent Partial Occlusions: A

Novel Method

To deploy our occlusion support method for facial

MoCap, we used the following statement: we know

the kind of occlusion created by HMD, so we know

which part of the face is occluded. We also know

that MoCap algorithms fail in these situations because

they use a face model. When the face is occluded this

model starts not to ﬁt since there is not a full face be-

ing captured. As a solution, we use the knowledge

Figure 2: VR setup deﬁnition.

Figure 3: VR method: Persistent partial occlusions. From

left to right: calibration image without VR HMD; our

method uses cut point (red circle) to cut image an overlay at

subsequent images: at left, what facial MoCap method see

is a full face and, at right, the real image.

that the region occluded is the upper part of the face

to ”re-create” the whole face.

Our novel method overlays the upper part of the

face captured on a neutral pose during calibration.

Firstly, we assume that the higher visible point of the

face is the nose and deﬁne it as cut point (i.e. this

point can be changed to ﬁt the occlusion created by

certain HMD). Then, we detect the cut point with

the MoCap and we cut the upper part of the cali-

bration image (i.e. frame streamed) from the nose

up, and use it to overlay to all the next camera/video

frames. Hence, now the occluded part of the face is

replaced with a static neutral face. The MoCap sys-

tem is now able to detect the features in the combined

half static/ half expressive face (see Figure 3). We

ensure a proper re-creation of a face since we use a

HMC that removes the user’s head movements, i.e.

user’s face is in the same position during calibration

and next streamed images.

3.2 VR Assessing Facial Expressions

During the development of VR facial expressions

Assessing Facial Expressions in Virtual Reality Environments

491

method, we applied face features and machine learn-

ing know-how from our past real-time emotion recog-

nition research (Loconsole et al., 2014). In this novel

method, we set the following goals: real-time emo-

tion recognition of universal emotions (Ekman and

Friesen, 1975) and upper face expressions prediction

under VR scenarios. We aim to track facial expres-

sions ahead of only emotions, in order to get a wide

change of facial expressions and better cover and rep-

resentation of the diversity of faces (McCloud, 1993).

In opposition to the emotion classiﬁcation method

(Loconsole et al., 2014), where we needed to reduce

the number of features tracked, in VR scenarios we

have to maximize the information tracked in the bot-

tom part of the face. Therefore, the feature extraction

method should be able to retrieve enough information

to allow an accurate prediction of facial expressions

by the machine learning algorithm.

As a solution, we propose to use all the features

tracked of bottom face region (see Figure 4 blue rect-

angle) and apply a geometrical features extraction al-

gorithm. This algorithm is deﬁned as the Euclidean

distance between neutral face features (stored during

calibration step of previous persistent partial occlu-

sions method) and current frame (i.e. instant in time)

features. Summarizing, to each feature tracked p in

certain instant i, we calculate the distance D(p

, p

D(p

, p

) =

((p

(x) − p

(x))

+ (p

(y) − p

(y))

− p

,where:

is the 2D bottom face feature p at the instant i

in time;

is the 2D bottom face feature p of neutral ex-

pression captured during calibration;

− p

is the norm between p

and p

in Carte-

sian space.

Since the occlusion produced varies according to

VR headset used, we also created machine learning

models to assess facial expressions using the bot-

tom face features information including and exclud-

ing nose features. The bottom face features without

nose feature can be used by the different kinds of

HMD, since the nose region is the one affected by the

device size.

To create the machine learning models to predict

the emotions and upper face expressions, we used the

Cohn-Kanade (CK+) database (Lucey et al., 2010).

CK+ database contains posed and spontaneous se-

quences from 210 participants (i.e. cross-cultural

adults of both genres). Each sequence starts with a

neutral expression and proceeds to a peak expression.

This sequences are FACS coded and emotion labeled.

The transition between neutral and a peak expression

allowed us to detect spontaneous expressions and not

only pure full expressions.

To implement the algorithms, we adopted a

GPU version of Random Forest (Breiman, 2001) of

OpenCV (OpenCV, 2014) to generate respective ma-

chine learning models for real-time prediction. As

facial MoCap testing approach, we deployed the

Saragih et al. (Saragih et al., 2011) system. (see Fig-

ure 4 tracking landmarks in green).

3.2.1 VR Emotion Recognition: Novel Method

As preprocessing stage, we create the Random

Forests model that is used to predict emotions in

real-time (Loconsole et al., 2014). To build the

model for emotion classiﬁcation, to each database’s

sequence we applied the facial MoCap method and

extracted bottom face features. Using the ﬁrst frame

of the sequence as neutral expression, to subsequent

frames in the sequence, we calculate the distance

D(p

, p

), between bottom face features of current

frame and neutral expression’s frame. Thus, to train

the machine learning model for emotion recognition

we used aforementioned geometrical extraction algo-

rithm: distance D(p

, p

) of bottom face’s features of

each frame. As response value, to each distance cal-

culated, we used respective CK+ emotion label (see

Figure 4 blue processes).

As observed in the Figure 2, in runtime, we apply

once our occlusions support method and store neutral

face features. This step is only execute one time per

user. After, in runtime, the adapted facial MoCap sys-

tem delivers bottom face’s movements and distance

D(p

, p

) is calculated to each feature p. The group

of distances are used as input in the Random Forests

classiﬁer that predicts the user’s emotion represented

by that distances and respective accuracy’s percent-

age.

3.2.2 VR Facial Expressions Predictor: Novel

Method

To build the upper face expressions model, we also

applied the distance of neutral and expression bot-

tom face features as geometric extraction algorithm.

However, we have to deﬁne the movements that we

wanted to predict in order to create speciﬁc tags to the

training process. For simplicity, we set as upper face

expressions the prediction of eyebrows movements,

i.e. the detection if eyebrows are going up or down,

and the ”how much” they are moving compared to

a neutral position. This last parameter is measured

as a percentage of movement up/down compared to

neutral expression. Similarly to assumption made in

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

492

Figure 4: VR methods: Expressions predictor training (purple) and Emotion predictor training (blue) with CK+ database.

(Fuentes et al., 2013), we assume symmetry of the

eyebrows movements. To deﬁne the tags, we calcu-

lated the Euclidean distance D(p

, p

) between neu-

tral position of eyebrows and the expression positions

in the other frames of the sequence. If the average

of the eyebrows features indicated that they are going

up, we tagged ”up”; the opposite if the eyebrows went

down we tag ”down” (i.e. we used image coordinate

system, so this distance was negative when eyebrows

go up and vice-versa). Simultaneously to each frame

of the sequence tagged we saved the percentage of

movement compared to neutral position (up or down).

As result, to each frame of the sequence of each par-

ticipant in CK+ database we tagged: eyebrows ”up”

or ”down”, plus percentage of movement. In Figure 4

with purple processes, the reader can observe an ex-

ample of method’s framework.

At preprocessing stage, we trained two Random

Forests models with the same input data: the dis-

tances D(p

, p

) between neutral and current bottom

face features; but using one of the following response

values:

• ”up” and percentage of movement, if eyebrows

are rising

• ”down” and percentage of movement, if eyebrows

are descending

, to each frame of each sequence of CK+ database.

Since we are using a GPU approach of the classi-

ﬁer, with high computational performance, to max-

imize the prediction accuracy of eyebrows move-

ments, we trained two models: one to predict the rise

movement and, other, to predict the opposite. In run-

time, we apply the deﬁned geometrical features ex-

traction to the bottom face’s features tracked by the

adapted MoCap. The extracted features are used as

input in both Random Forests classiﬁers, to retrieve

one of the predictions:

1. eyebrows ”rising” and percentage of movement;

2. eyebrows ”descending” and percentage of move-

ment.

Since we are using two different classiﬁers, there is

a probability of confusion of both models return si-

multaneously an ”up” and ”down” movement. As a

solution, our method compares the accuracies of pre-

diction from the two classiﬁers’ predictions, and the

result delivered is the one with higher accuracy.

4 RESULTS AND VALIDATION

In this section, we show the results and statistical val-

idation of the methods proposed. Statistical analy-

sis was performed using R software (R Core Team,

2013).

4.1 VR Persistent Partial Occlusions

To test our occlusions method, we applied it to

Saragih et al. (Saragih et al., 2011) and Cao et al.

(Cao et al., 2014) MoCap systems (see Figures 5 and

6, respectively). At the Figure 7, we test a generic

partial occlusion created by a piece of paper.

As observed in the Figures 5, 6 and 7, our

occlusion-support method adapts to MoCap systems

making them compatible to persistent partial occlu-

sions. The ”paper” test case represented a generic

occlusion created by a random VR device. As con-

clusion, our method is not only adaptable to MoCap,

but it could be also used to generic partial occlusions

created by different VR HMD’s.

4.2 VR Assessing Facial Expressions

We divided the validation of our prediction methods

in two steps: (i) statistical validation and (ii) visual

validation.

To validate statistically our machine learning clas-

siﬁers we adopted a k-Fold Cross Validation (k-Fold

Assessing Facial Expressions in Virtual Reality Environments

493

Figure 5: VR method results: Persistent Partial Occlusions

method applied to Saragih et al. (Saragih et al., 2011) Mo-

Cap. The real image (left), our method result and what Mo-

Cap processes (middle) and ﬁnal result from our method

(right).

Figure 6: VR method results: Persistent Partial Occlusions

method applied to Cao et al. (Cao et al., 2014) MoCap.

Figure 7: VR method results: Persistent Partial Occlusions

method applied to Cao et al. MoCap algorithm (Cao et al.,

2014) to overcome a general occlusion created by a piece of

paper.

CRM) with k=10 (Rodriguez et al., 2010). The k-

Fold CRM, after iterating the process of dividing the

input data in k slices for k times, trains a classiﬁer

with k-1 slices. The remaining slices are used as test

sets on their respective k-1 trained classiﬁer, allow-

ing us to calculate the accuracy of each one of the

k-1 classiﬁers. The ﬁnal accuracy value is given by

the average of the k calculated accuracies. Though, to

each method we analyze k-Fold CRM accuracy to the

methods under different scenarios. We highlight that

this validation procedure ensures that the test dataset

is not the same of the training dataset. Therefore,

prediction accuracies are not calculated with test data

contained in the training dataset.

Furthermore, we provide a statistical analysis of

sensitivity versus speciﬁcity and positive versus neg-

ative predictive value (i.e. pred. in Tables) (Parikh

et al., 2008). The sensitivity measures the perfor-

mance of the classiﬁer in correctly predicting the

actual class of an item, while speciﬁcity measures

the same performance but in not predicting the class

of an item that is of a different class. Summariz-

ing, sensitivity and speciﬁcity measure the true pos-

itive and true negative performance, respectively. We

added the positive and negative predictive value anal-

ysis because these values reﬂect the probability that

a true positive/true negative is correct given knowl-

edge about the prevalence of each class in the data

analyzed.

By the end of this section, we validated visually

our VR methods regarding: occlusions, emotion and

facial expressions prediction. The visual validation

data was acquired in our laboratory and is not part

of the training dataset (learning made CK+ database).

The visual data was not acquired with HMC, but we

asked to the participants to avoid extreme head move-

ments. As result, we were able to test our VR method

of occlusion-support and the facial expressions meth-

ods simultaneously.

4.2.1 VR Emotion Recognition

Using the k-Fold CRM, we executed a method’s vali-

dation to two emotion recognition scenarios: (i) six

universal emotions of Ekman and Friesen (Ekman

and Friesen, 1975), plus neutral; (ii) four universal

emotions of Jack (Jack and Jack, 2013), plus neu-

tral. The six universal emotions (Ekman and Friesen,

1975) are the commonly used and accepted by liter-

ature studies. However, recent advances in psychol-

ogy of the emotions show that these emotions are not

reproducible throughout different cultures. The non-

universality of Ekman’s emotions is explored by the

survey (Jack and Jack, 2013). This complete study

defends that only a subset of the six ”universal” emo-

tions is universally recognized, i.e. Joy/Happy, Sur-

prise, Anger and Sad/Sadness. This subset excludes

fear and disgust, since these emotions present low

recognition cross-culturally being biologically adap-

tive movements from the emotions surprise and anger,

respectively (Jack and Jack, 2013).

Therefore, the Table 1 shows the k-Fold CRM ac-

curacies to the two scenarios.

In the Table 1, we observe an increase of the accu-

racy detection when recognizing four emotions, com-

pared to six emotions classiﬁcation. This result is not

surprising, since we are reducing the number of emo-

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

494

Table 1: k-Fold CRM Accuracy comparison to scenario (i) and to the scenario (ii). Results in percentage (%).

Emotions k-Fold Accuracy (%) 95% Conﬁdence Interval

Six (Ekman and Friesen, 1978) 64.80 [61.72;67.79]

Four (Jack and Jack, 2013) 69.07 [65.59;72.40]

tions predicted. In addition, we detect that the bottom

features of the face allow a weak recognition of face

emotions, resulting in accuracies lower than 70%.

More in detail, we report in the Tables 2 and 3,

a statistical analysis of each emotion recognition ob-

tained with Random Forests classiﬁer to scenario (i)

and (ii), respectively.

Both statistical analysis resulted in a p-value lower

than 2.2 × e

−

16 to a signiﬁcance level of 5%, which

validates our method’s hypothesis: classifying the

six/four universal emotions using bottom of face fea-

tures tracking. Speciﬁcally, to scenario (i) at the Table

2, we observe an overall low sensitivity to emotions

classiﬁed (with exceptions to Joy/Happy and Neu-

tral). The opposite is observed to speciﬁcity. This

indicates that the method does not have high accu-

racy to detect a certain class, however, does not pre-

dict incorrectly. The predictive values weighted us-

ing information about the class prevalence in popu-

lation, show an overall increase of accuracy for true

positive and maintain to negative. Therefore, as ex-

ample to Surprise, despite our classiﬁer only being

able to positively identify surprise in 59.40% of the

time there is a 71.82% chance that, when it does, such

classiﬁcation is correct. Looking to Table 3, com-

pared to previous results of scenario (i) at Table 2,

we observe an increase of sensitivity, while maintain-

ing an high accuracy of speciﬁcity. In general, the

same is observed in positive and negative predictive

values. This is expected, since decreasing the number

of classes of emotions will decrease the degree of con-

fusion that lead to a better split between classes, re-

sulting in a better emotion recognition method. These

results conﬁrm the statement of Background section,

i.e. bottom face features provide incomplete informa-

tion about face expression of emotions. Though, our

method presents better performance when four uni-

versal emotions (Jack and Jack, 2013) are classiﬁed.

4.2.2 VR Facial Expressions Predictor

To analyze and validate the VR facial expressions pre-

dictor, we executed the k-Fold cross-validation to the

classiﬁer eyebrows ”rising” and to classiﬁer eyebrows

”descending”. Taking into account the variance of

nose tracking with the type of HMD used, we pro-

pose to study the inﬂuence of tracking these features

(subset S1) and not tracking the nose features (subset

S2) in the prediction of eyebrows’ movements. Av-

erage K-Fold CRM accuracies and respective conﬁ-

dence intervals can be accessed in the Table 4.

In the Table 4, we observe a small decrease of ac-

curacy when the nose features tracking is removed.

Although, the conﬁdence intervals show that this de-

crease is only signiﬁcant in eyebrows ”up” detection.

Our method allows an high performance of eyebrows

”up” estimation (at least, 85%) compared to eyebrows

”down” estimation (at least, 66%). The different re-

sults arise from the fact that we are using an emo-

tion database for training, where there is more data

describing the ”rising” movement than the opposite

(i.e. only anger and sadness emotions usually present

this facial expression behavior (Ekman and Friesen,

1978)).

Similarly to emotion recognition method,

we present the statistical analysis of sensitiv-

ity/speciﬁcity and positive/negative predictive values

to both eyebrows movements using the subsets S1

and S2.

Both p-values of further analysis are lower than

the signiﬁcance level (i.e. p-value equal to 2.2 ×

−

16 < 0.05 ). Therefore, both methods are suit-

able for eyebrows movement estimation using bottom

face’s movements. Table 4 shows that the method is

able to classify the eyebrows ”up” movement accu-

rately, with exception for speciﬁcity using the subset

S2. So, the removal of nose features tracking leads,

essentially, to a decrease in accuracy of the classiﬁer

in not giving incorrect predictions. However, when

we take in to account the prevalence of the class in

population, the overall accuracy of prediction to both

positive and negative values increase, presenting val-

ues above 84.04%.

Table 6 contains the statistical analysis to the pre-

diction of eyebrows ”descending” movement with

(S1) and without (S2) nose features tracking.

Observing the Table 6, we observe that our

method predicts correctly the ”descending” move-

ments of the eyebrows, at least, 73.18% of the time

and does not predict incorrectly this movements in

at least, 63.97% of the time. The lower values

are obtained to the subset S2, however, the differ-

ences between subsets performance are not signiﬁ-

cant. Similar behavior is beheld taking into account

the prevalence of the class in the population. The pos-

itive/negative predictive values are not signiﬁcantly

different between sensitivity/speciﬁcity. As expected

by previous k-Fold CRM results, prediction of the

Assessing Facial Expressions in Virtual Reality Environments

495

Table 2: Statistical Analysis of scenario (i) - Results in percentage (%).

Anger Disgust Fear Joy Sadness Surprise Neutral

Sensitivity 53.15 39.44 26.09 81.29 12.70 59.40 90.80

Speciﬁcity 86.55 97.70 95.84 95.17 99.13 96.35 85.39

Positive pred. 40.21 57.14 39.34 75.90 50.00 71.82 75.51

Negative pred. 91.56 95.40 92.62 96.45 94.31 93.81 94.92

Table 3: Statistical Analysis of scenario (ii) - Results in percentage (%).

Anger Joy Sadness Surprise Neutral

Sensitivity 75.50 77.85 13.80 68.75 80.09

Speciﬁcity 76.16 95.14 99.07 98.39 91.34

Positive pred. 45.06 81.46 66.67 88.51 80.44

Negative pred. 92.31 94.00 89.52 94.59 91.16

Table 4: k-Fold CRM Accuracy comparison facial expressions assessed (Eyebrows Up or Down) with subset S1 and S2.

Results in percentage (%).

Eyebrows movements k-Fold Accuracy(%) 95% Conﬁdence Interval

Up S1 91.47 [89.76;92.98]

Up S2 87.02 [84.97;88.89]

Down S1 70.63 [67.99;73.18]

Down S2 69.13 [66.40;71.76]

Table 5: Eyebrow Up prediction - Statistical Analysis to

subsets S1. Results in percentage (%).

Eyebrows Up S1 S2

Sensitivity 97.34 96.27

Speciﬁcity 71.79 59.18

Positive pred. 92.04 87.65

Negative pred. 92.31 84.06

Table 6: Eyebrow Down prediction - Statistical Analysis to

subsets S1. Results in percentage (%).

Eyebrows Down S1 S2

Sensitivity 77.13 73.18

Speciﬁcity 62.73 63.97

Positive pred. 71.57 72.09

Negative pred. 69.28 65.23

”descending” movement presents lower performance

compared to prediction of the opposite movement.

Again, this result occurred due to the low prevalence

of the ”down” class in population. This statement is

conﬁrmed by the lower inﬂuence shown in positive

and negative predictive values when compared to sen-

sitivity and speciﬁcity, respectively.

Summarizing, our methods of facial expressions

prediction are suitable for the estimation of eyebrows

movements using features from the bottom of the

face, specially in estimation of the ”rising” move-

ment. This conclusion corroborates the hypothesis of

this work: our results traduce a connection between

bottom and upper face behaviors.

4.2.3 VR Assessing Facial Expressions: Visual

Validation

Applying the methods to videos where the partici-

pants expressed emotions (Ekman and Friesen, 1975),

we are able to check visually the performance of

the methods: occlusions support, emotion recogni-

tion and expressions prediction. We chose a non-VR

scenario in order to verify if the upper face move-

ments and emotions predicted (using only bottom

face’s movements) match the original facial expres-

sions. Results can be observed in the Figures 8, 9, 10

and 11.

Looking throughout the Figures, we verify that

our occlusion method is able to ”re-create” the face

even not using a HMC. Regarding emotion recogni-

tion using only the facial features (green dots), in the

Figure 8, 9 and 10, we show three examples of cor-

rect classiﬁcation. Figure 11 presents an example of

a wrong emotion recognition. The classiﬁer returned

Anger when the user’s emotion label of the video was

Sad. This confusion is predicted since the bottom fea-

tures inherent to Anger and Sad emotions are identical

(Ekman and Friesen, 1975).

Regarding the facial expressions prediction

method, in the Figures 8 and 11 we observed that

the algorithm correctly estimates eyebrows ”down”,

which is conﬁrmed by the original images. The

same is detected in the Figure 9 for eyebrows ”up”

predictor. Moreover, in the Figure 10, comparing

eyebrows of image analyzed and original image, we

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

496

Figure 8: VR Assessing Facial Expressions: Emotion

Recognition result (blue) and Expression Predictor result

(green). Check that our emotion and prediction match orig-

inal image eyebrows movements (green box).

Figure 9: VR Assessing Facial Expressions: Emotion

Recognition result (blue) and Expression Predictor result

(red). Check that our emotion and prediction match orig-

inal image eyebrows movements (green box).

observe no movement, which traduced in a correct no

estimation of movement from both predictors.

5 CONCLUSIONS

This work delivers VR consumer-level methods to

achieve the three goals: make MoCap systems com-

patible to persistent partial occlusions, real-time

recognition of universal emotions and real-time pre-

diction of upper face movements using bottom face

features tracking. Combining the three methods de-

ployed, we are able to track in real-time facial ex-

pressions from non-occluded and occluded facial re-

gions. The development of these methods lead to im-

provement in the three components of sense of em-

bodiment, i.e. enhances the sense of self-location,

agency and body ownership within the VR environ-

ments (Kilteni et al., 2012).

Analyzing the results, we conclude that the three

goals proposed where achieved. We deliver a method

to make MoCap systems able to track bottom face fea-

tures under generic partial occlusions created by dif-

ferent HMD’s. Note, we do not deliver a method that

is able to overcome generic and unpredicted facial oc-

Figure 10: VR Assessing Facial Expressions: Correct Emo-

tion Recognition result (blue) and no Expression Predictor

result, since there is not movement. Check original image

in green box.

Figure 11: VR Assessing Facial Expressions: Incorrect

Emotion Recognition result (blue) and Expression Predic-

tor result. Check original image to see that Expression Pre-

dictor is correct (green box).

clusions, since we require the knowledge of the area

occluded. Then, using these facial features, we were

able to deﬁne methodologies to real-time recognition

of four universal emotions (Jack and Jack, 2013) with

an accuracy of 69.07% and prediction of facial move-

ments in the occluded regions, i.e. eyebrows ”rising”

with accuracy of 91.47% and ”descending” with an

accuracy of 70.63%. The results obtained with the

facial expressions prediction method conﬁrmed our

method’s hypothesis. Therefore, besides bottom fea-

tures of the face being not enough to describe the six

emotions of Ekman and Friesen (Ekman and Friesen,

1975), our predictor of facial expression decode a

connection between bottom face and upper face fea-

tures. As explained in methodology, the combination

of both emotion and expressions tracked/predicted

make us able to access a wide range of facial expres-

sions enabling us to represent the diversity of faces

(McCloud, 1993). This conclusion opens new lines of

research to predict more complex movements of the

face, even when we are not able to track them using

CV algorithms. Furthermore, our methods outputs

enable the real-time animation of 3D characters, since

we deliver information of facial features combined to

emotions, suitable to activate different types of rigs.

Assessing Facial Expressions in Virtual Reality Environments

497

Ahead of 3D characters animation, our methods are

suitable for emotion-based applications, like affective

virtual environments, advertising or emotional gam-

ing.

As future work, we aim to deﬁne a transfer al-

gorithm and use movements and emotions estimated

to trigger facial animation. Furthermore, we intend

to study how the estimation of more facial behaviors

information (e.g. forehead and eye movements) and

combination of speech data can improve the anima-

tion and user embodiment in VR environments.

ACKNOWLEDGEMENTS

This work is supported by Instituto de

Telecomunicac¸

oes (Project Incentivo ref: Pro-

jeto Incentivo/EEI/LA0008/2014 and project UID

ref: UID/EEA/5008/2013) and University of Porto.

The authors would like to thanks Elena Kokkinara

from Trinity College Dublin and Pedro Mendes from

University of Porto for their support given in the

beginning of the project.

REFERENCES

Biocca, F. (1997). The cyborg’s dilemma: Progressive

embodiment in virtual environments. Journal of

Computer-Mediated Communication, 3(2):0–0.

Bombari, D., Schmid, P. C., Schmid Mast, M., Birri, S.,

Mast, F. W., and Lobmaier, J. S. (2013). Emotion

recognition: The role of featural and conﬁgural face

information. The Quarterly Journal of Experimental

Psychology, 66(12):2426–2442.

Breiman, L. (2001). Random forests. Machine learning,

45(1):5–32.

Cao, C., Hou, Q., and Zhou, K. (2014). Displaced dynamic

expression regression for real-time facial tracking and

animation. ACM Transactions on Graphics (TOG),

33(4):43.

Cao, C., Weng, Y., Lin, S., and Zhou, K. (2013). 3d shape

regression for real-time facial animation. ACM Trans.

Graph., 32(4):41.

Eisenbarth, H. and Alpers, G. W. (2011). Happy mouth

and sad eyes: scanning emotional facial expressions.

Emotion, 11(4):860.

Ekman, P. and Friesen, W. (1978). Facial Action Coding

System: A Technique for the Measurement of Facial

Movement. Consulting Psychologists Press, Palo Alto.

Ekman, P. and Friesen, W. V. (1975). Unmasking the face:

A guide to recognizing emotions from facial cues.

Fuentes, C. T., Runa, C., Blanco, X. A., Orvalho, V., and

Haggard, P. (2013). Does my face ﬁt?: A face image

task reveals structure and distortions of facial feature

representation. PloS one, 8(10):e76805.

Jack, R. E. and Jack, R. E. (2013). Culture and facial ex-

pressions of emotion Culture and facial expressions of

emotion. Visual Cognition, 00(00):1–39.

Kilteni, K., Groten, R., and Slater, M. (2012). The sense of

embodiment in virtual reality. Presence: Teleopera-

tors and Virtual Environments, 21(4):373–387.

Lang, C., Wachsmuth, S., Hanheide, M., and Wersing, H.

(2012). Facial communicative signals. International

Journal of Social Robotics, 4(3):249–262.

Li, H., Trutoiu, L., Olszewski, K., Wei, L., Trutna, T.,

Hsieh, P.-L., Nicholls, A., and Ma, C. (2015). Fa-

cial performance sensing head-mounted display. ACM

Transactions on Graphics (Proceedings SIGGRAPH

2015), 34(4).

Li, H., Yu, J., Ye, Y., and Bregler, C. (2013). Realtime facial

animation with on-the-ﬂy correctives. ACM Transac-

tions on Graphics, 32(4).

Loconsole, C., Runa Miranda, C., Augusto, G., Frisoli,

G., and Costa Orvalho, v. (2014). Real-time emo-

tion recognition: a novel method for geometrical fa-

cial features extraction. 9th International Joint Con-

ference on Computer Vision, Imaging and Computer

Graphics Theory and Applications (VISAPP 2014),

01:378–385.

Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar,

Z., and Matthews, I. (2010). The extended cohn-

kanade dataset (ck+): A complete dataset for action

unit and emotion-speciﬁed expression. In Computer

Vision and Pattern Recognition Workshops (CVPRW),

2010 IEEE Computer Society Conference on, pages

94–101. IEEE.

Magnenat-Thalmann, N., Primeau, E., and Thalmann, D.

(1988). Abstract muscle action procedures for human

face animation. The Visual Computer, 3(5):290–297.

McCloud, S. (1993). Understanding comics: The invisible

art. Northampton, Mass.

McCloud, S. (2006). Making Comics: Storytelling Secrets

Of Comics, Manga And Graphic Novels Author: Scott

McCloud, Publisher: William Morrow. William Mor-

row Paperbacks.

OpenCV (2014).

Pandzic, I. S. and Forchheimer, R. (2003). MPEG-4 facial

animation: the standard, implementation and appli-

cations. Wiley. com.

Parikh, R., Mathai, A., Parikh, S., Sekhar, G. C., and

Thomas, R. (2008). Understanding and using sensi-

tivity, speciﬁcity and predictive values. Indian journal

of ophthalmology, 56(1):45.

Parke, F. I. and Waters, K. (1996). Computer facial anima-

tion, volume 289. AK Peters Wellesley.

Pighin, F. and Lewis, J. (2006). Performance-driven facial

animation. In ACM SIGGRAPH.

R Core Team (2013). R: A Language and Environment for

Statistical Computing. R Foundation for Statistical

Computing, Vienna, Austria. ISBN 3-900051-07-0.

Rodriguez, J., Perez, A., and Lozano, J. (2010). Sensitiv-

ity analysis of k-fold cross validation in prediction er-

ror estimation. Pattern Analysis and Machine Intelli-

gence, IEEE Transactions on, 32(3):569–575.

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

498

Saragih, J. M., Lucey, S., and Cohn, J. F. (2011). De-

formable model ﬁtting by regularized landmark mean-

shift. International Journal of Computer Vision,

91(2):200–215.

Slater, M. (2014). Grand challenges in virtual environ-

ments. Frontiers in Robotics and AI, 1:3.

von der Pahlen, J., Jimenez, J., Danvoye, E., Debevec, P.,

Fyffe, G., and Alexander, O. (2014). Digital ira and

beyond: creating real-time photoreal digital actors. In

ACM SIGGRAPH 2014 Courses, page 1. ACM.

Weise, T., Bouaziz, S., Li, H., and Pauly, M. (2011).

Realtime performance-based facial animation. ACM

Transactions on Graphics (TOG), 30(4):77.

Assessing Facial Expressions in Virtual Reality Environments

499