Expressive Talking Head for Interactive Conversational Systems

Paula Dornhofer Paro Costa and Jos´e Mario De Martino

School of Electrical and Computer Engineering, Dept. of Computer Engineering and Industrial Automation,

University of Campinas, Campinas, S˜ao Paulo, Brazil

1 STAGE OF THE RESEARCH

The present doctoral research project is being devel-

oped in the following sequential phases:

1. Creation of a motion capture data and audiovi-

sual expressive speech database for Brazilian Por-

tuguese;

2. Study of sample-based synthesis techniques suit-

able for the reproduction of expressive speech;

3. Development of an expressive talking head for

Brazilian Portuguese;

4. Proposal of a strategy to generalize the synthesis

methodology to a compact database of samples

captured under partially controlled conditions.

The project is entering into its third phase, mov-

ing towards the proposal of an expressive speech syn-

thesis methodology with a pilot implementation being

developed for Brazilian Portuguese.

2 OUTLINE OF OBJECTIVES

The objective of this project is to develop an expres-

sive speech facial animation synthesis methodology

capable of improving the videorealism of the current

state-of-the-art expressive talking heads in terms of

the range of the expressed emotional states and the

ability of reproducing non-verbal communication sig-

naling, with the ultimate goal of creating animated

faces capable of inspiring user thrust and empathy .

We aim at applications where intelligent systems

are capable of emulating our natural and intuitive

face-to-face communication mechanisms in which the

talking head act as an embodied conversational agent

(ECA), (Figure 1).

Objectives that run in parallel are: to obtain a pi-

lot implementation of a Brazilian Portuguese expres-

sive talking head and to contribute with an compre-

hensive audiovisual expressive speech corpus for this

language.

Figure 1: Intelligent system face-to-face communication

metaphor.

3 RESEARCH PROBLEM

The automated synthesis of videorealistic talking

heads capable of expressing emotions remains a chal-

lenging problem in computer graphics. Even the most

sophisticated models capable of rendering impressive

high quality face images are not capable of inspiring

user empathy and thrust without the appropriate mod-

eling of the non-verbal communication signals, a task

that still requires human intervention. On the other

hand, there is a great demand for expressive talking

heads for applications like virtual assistants, tutors,

game characters, newscasters, social agents and for

controlled experiments in psychology and behavioral

sciences.

For centuries, painters, cartoonists, and animators

have being able to reproduce realistic images of ex-

pressive faces following rules learned from the ex-

haustive observation of the human behavior. Simi-

larly, the problem of developing computer automated

videorealistic facial animation can be stated as the

problem of making the machine learn how the humans

behave while expressing different emotional and cog-

nitive states and deriving models to reproduce them –

a typical supervised machine learning task.

This doctoral project concentrates on two main as-

pects of this problem: the computational model for

emotions and the visual modeling of the face.

Categorical emotion models, like the “big

six” (Ekman, 1972) universally recognizable facial

expressions (anger, disgust, fear, happiness, sadness,

Dornhofer Paro Costa P. and De Martino J..

Expressive Talking Head for Interactive Conversational Systems.

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

surprise) have been widely applied to the develop-

ment of expressive avatars. However, we argue that

this modeling lacks completeness when the objective

is the implementation of interactive embodied conver-

sational agents. In order to capture user empathy, for

example, a virtual airport assistant informing a ﬂight

cancellation, should not express any of the “big six”

plain emotions but a set of more complex facial ex-

pressions capable of dealing with the user frustration,

indicating that the system recognizes the ﬂight cancel-

lation as an undesirable event. For this purpose, “ap-

praisal” models, that take into consideration the eval-

uation process that leads to an emotional response,

seem to provide a more embracing characterization of

emotions. In particular, we adopt the model proposed

by Ortony, Clore and Collins (OCC model) since it

presents a concise but comprehensive vocabulary of

22 emotions that arises as reactions to events, agents

or objects (Ortony et al., 1988).

Another important question is how to synthesize

photorealistic appearances and to reproduce the dy-

namics of the speech combined to the expression of

emotions. Put in other words, how to delude human

observers: specialists trained since the born to de-

tect the smallest variations on the signals conveyed

by the voice, the face and the body. In this work we

explore the image-based, or 2D, synthesis technique

as a mean to obtain inherently photorealistic expres-

sive faces, avoiding the typical synthetic looking of

model-based (3D) facial animation systems.

4 STATE OF THE ART

Since the pioneering work of Parke (Parke, 1972),

many others have contributed with different ap-

proaches to improve the level of videorealism of syn-

thetic talking faces: (Bregleret al., 1997), (Ezzat and

Poggio, 1998), (Brand, 1999), (Cosatto and Graf,

2000), (Pasquariello and Pelachaud, 2002), (Ezzat

et al., 2002).

In the last decade, it can be observed an emerging

interest in adding to the synthetic talking heads the

capability of expressing emotions.

In (Chuang and Bregler, 2005), for example, the

authors focus in the difﬁculties to edit motion cap-

ture data. In their proposal, they take expressionless

speech performance as input, analyze the content and

modify the facial expression according to a statistical

model. The expressive face is then retargeted onto a

3D character using blendshape animation. The paper

presents the results of the methodology using as train-

ing data three short video sequences including three

basic expressions: neutral, angry, and happy.

In (Beskow and Nordenberg, 2005), an expres-

sive 3D talking head is implemented using an MPEG-

4 compatible model. An amateur actor was recorded

portraying ﬁve different emotions (happy, sad, angry,

surprised, and neutral) and a Cohen-Massaro coartic-

ulation model was trained for each emotion.

A 3D expressive speech-driven facial animation

system is also presented in (Cao et al., 2005). In

this system the inputs are the speech to be animated

and a set of emotional tags. A high-ﬁdelity motion

capture database was built with a professional actor

representing ﬁve emotions: frustrated, happy, neutral,

sad and angry. The motion capture data, together with

the timed phonetic transcript of the recorded utter-

ances, were used to construct what the authors call

an “anime” graph, where an “anime” corresponds to

a dynamic deﬁnition of a viseme. The synthesis con-

sists in searching the best path in this graph through

the minimization of a cost function that penalizes dis-

continuity in unnatural visual transitions.

Four emotions (neutral, happy, angry, sad) were

captured with a motion capture system in (Deng et al.,

2006). The resulting material was used to build a

coarticulation model and a visual appearance model

that are combined to generate a 3D facial animation

at synthesis time.

As novel approach to model emotions in a facial

animation system, in (Jia et al., 2011) the authors

parameterize eleven emotions (neutral, relax, sub-

missiveness, surprise, happiness, disgust, contempt,

fear, sorrow, anxiety and anger) according to the PAD

(Pleasure, Arousal, Dominance) dimensional emotion

model. In this system, the acoustic features of the

speech are used to drive an MPEG-4 model.

More recently, (Anderson et al., 2013) present a

2D VTTS (visual text-to-speech) which is capable of

synthesizing a talking head given an input text and

a set of continuous expression weights. The face is

modeled using an active appearance model (AAM),

from a corpus containing six emotions: neutral, ten-

der, angry, afraid, happy and sad.

These works illustrate, through a diverse range of

approaches, the challenges imposed by this research

problem.

5 METHODOLOGY

5.1 Corpus

In order to study different aspects of expressive

speech, ten professional actors, Brazilian Portuguese

native speakers, were divided in two types of exper-

iments. The ﬁrst experiment consisted in asking the

ExpressiveTalkingHeadforInteractiveConversationalSystems

Figure 2: Motion capture.

actors to represent the emotions of the OCC model.

For that purpose, 22 recording scripts were designed

coherently with the cognitive state description given

by the model ensuring that each speech has occur-

rences of all Brazilian Portuguese context-dependent

visemes (visual phonemes), as proposed in (De Mar-

tino et al., 2006). Each speech segment was recorded

with a neutral expression, followed by the actor rep-

resentation of the corresponding emotional state. In a

second experiment, the objective was to obtain data to

investigate how the expressive signals observed in the

face are modulated by different personality traits. For

that purpose, we adopted the simpler modeling of the

“big six” emotions and the actors were asked to rep-

resent them with three different extroversion charac-

teristics: shy, neutral/equilibrated and extroverted. In

this experiment, the speech segments were the same

for all emotions and personalities.

Each experiment consisted of a motion capture

session and a video session. During the mocap ses-

sion, 63 markers distributed over the actor’s face/head

were tracked by 8 infrared Vicon cameras (Figure 2).

In the video session, the actor repeated the same per-

formance in front of a chroma-key background, with-

out markers, makeup, or facial and hair adornments.

Both experiments resulted in a set of motion capture

and video samples of expressive faces of 4 female ac-

tresses and 6 male actors with ages ranging from 20

to 60 years old.

5.2 Visual Face Model

Taking samples from the audiovisual corpus as train-

ing data, our methodology involved the exploratory

study of the Active Appearance Model (AAM) as a

facial synthesis model.

Active appearance modeling is capable of param-

eterizing an image database of facial expressions in

terms of their shapes and texture parameters, making

possible, given some constraints, the generation of fa-

cial expressions and head poses that were not present

in the original database. Besides, applying an anal-

ysis by synthesis approach, the derived AAM model

can be used as a machine learning tool to analyze the

expressive speech dynamics of the original corpusand

to derive rules to reproduce it.

A well known problem of AAM, however, is its

smoothing effect on ﬁnal images and the blurred as-

pect typically observed inside the mouth, a region that

experiences a great appearance variation but that has

no ﬁducial anchor points. In this study we explored

different setups for deriving AAM model in order to

determine the limits of this model given the charac-

teristics of our corpus.

In the typical implementationof AAM, the coordi-

nates of the feature points (x

, y

), ..., (x

, y

) are orga-

nized as a shape vector s = [x

, y

, ..., x

, y

]. Principal

component analysis (PCA) is applied to the training

base of shape vectors and the shape model is repre-

sented by the linear combination of the mean facial

shape s

and n facial shape eigenvectors:

s = s

∑

i=1

(1)

After the deﬁnition of a shape model, each train-

ing facial image is warped to obtain the mean facial

shape s

. The shape-free facial images are trans-

formed in texture vectors A. A second PCA is per-

formed and the texture is modeled as a linear combi-

nation of the mean facial appearance A

and m eigen-

faces:

A = A

∑

i=1

(2)

When the parametric shape and appearance mod-

els are used independently, the model is called an

independent-AAM. To remove correlation between

shape and texture model parameters and to make the

model representation more compact a third PCA can

be performed on the concatenated shape and texture

parameters. In this case a combined-AAM is ob-

tained.

The study was conducted using as training

database of 873 images with semi-automatically 56

labeled feature points (Figure 3). The images in the

database cover all the Brazilian Portuguese visemes

for each OCC emotion, together with neutral expres-

sion samples.

We explored multiple modeling building scenarios

in order to evaluate the videorealism attained by each

approach:

VISIGRAPP2014-DoctoralConsortium

Figure 3: Feature points used to deﬁne the facial shape.

Figure 4: Comparing corresponding synthesized frames

with two different AAM building scenarios. Repair that

in both images the region inside the lips is blurred but the

image at the left has greater deﬁnition of the teeth and the

tongue than the image at the right. The left image was syn-

thesized using a mixed approach: an independent model

was used to synthesize the shape, while a combined model

was used to synthesize the appearance.

• full face independent-AAM;

• parts based independent-AAM;

• full face combined-AAM;

• parts based independent-AAM.

Multiple masks conﬁgurations and a mixed ap-

proach (an independent model was used to synthesize

the head trajectory and a combined model to generate

the facial texture) were also tested (Figure 4).

5.3 Expressive Talking Head Synthesis

Once deﬁned the facial model, further studies are

needed to deﬁne the methodology to drive the facial

animation synthesis. While some approaches derive

the parameters to guide the facial animation from the

speech, we assume that the speech to be animated

could be provided by a TTS, a typical subsystem of

an interactive and intelligent computational system.

From this assumption, we aim to develop a syn-

thesis system that receives as inputs the speech to be

animated timed phonetic transcription, a tag about the

emotion to be animated and a tag about the personal-

ity or simply, an intensity modulator of the emotion

to be expressed.

A challenging problem regarding this approach is:

while the timed phonetic transcription provides infor-

mation that enables the determination of the dynamics

of speech articulatory movements and, consequently,

the shape of the face, it does not provide information

about the appearance to be assumed. We plan to solve

this problem through regression models that are able

to correlate the animation articulatory targets with the

appearance of samples from the image database.

5.4 Generalization and Evaluation

As a part of our research, we aim to derive a synthe-

sis strategy that could be generalized for smaller im-

age database with fewer samples of facial expressions

and possibly captured under uncontrolled conditions.

This poses the problem of learning the trajectories of

facial expressions transitions in the original database

and trying to reproduce such trajectories in an image

space with missing samples, or “holes” in the path.

Once again, we plan to use regression models to ﬁll

in the blanks.

Finally, an essential and challenging step of our

methodology is the subjective evaluation process,

since there are no widely accepted evaluation proto-

cols of expressive talking heads. In this case, we plan

to to assess two main aspects of the synthesized ani-

mations: their level of videorealism and their capabil-

ity of expressing a range of predeﬁned emotions with

ﬁdelity.

6 EXPECTED OUTCOME

As described in the preceding sections this project ex-

pects to propose an innovative synthesis methodology

for expressive talking heads. We also aim to develop

a Brazilian Portuguese pilot implementation and use

it to drive audiovisual expressive speech evaluations.

The implementation to be pursued is illustrated in

Figure 5, where the facial animation system receives

as input: parameters driven from the speech (such

as the articulatory target or acoustic features that can

drive the animation), a tag of emotion to be synthe-

sized (that can be informed, for instance, by an in-

telligent system) and a personality trait parameter (or

alternatively, a mood parameter) that modulates the

visual synthesis.

As a secondary outcome, we expect to make

the Brazilian Portuguese expressive corpus publicly

ExpressiveTalkingHeadforInteractiveConversationalSystems

Figure 5: Expressive Talking Head System.

available as an organized database of emotions, en-

abling the development of other prospective works.

Finally, we also expect to propose a generalization

synthesis methodology to make possible the creation

of simple expressive talking heads from a few images

of a real face.

ACKNOWLEDGEMENTS

This work is supported by the Brazilian National

Council for Scientiﬁc and Technological Develop-

ment (CNPq) under Grant No. 141366/2010-9.

REFERENCES

Anderson, R., Stenger, B., Wan, V., and Cipolla, R. (2013).

Expressive Visual Text-to-Speech Using Active Ap-

pearance Models. In IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), pages 3382–

3389.

Beskow, J. and Nordenberg, M. (2005). Data-driven synthe-

sis of expressive visual speech using an MPEG-4 talk-

ing head. In INTERSPEECH, pages 793–796. ISCA.

Brand, M. (1999). Voice puppetry. In SIGGRAPH ’99: Pro-

ceedings of the 26th annual conference on Computer

graphics and interactive techniques, pages 21–28,

New York, NY, USA. ACM Press/Addison-Wesley

Publishing Co.

Bregler, C., Covell, M., and Slaney, M. (1997). Video

Rewrite: driving visual speech with audio. In SIG-

GRAPH, pages 353–360.

Cao, Y., Tien, W. C., Faloutsos, P., and Pighin, F. H.

(2005). Expressive speech-driven facial animation.

ACM Transactions on Graphics, 24(4):1283–1302.

Chuang, E. and Bregler, C. (2005). Mood swings: expres-

sive speech animation. ACM Transactions on Graph-

ics, 24(2):331–347.

Cosatto, E. and Graf, H. P. (2000). Photo-Realistic Talking-

Heads from Image Samples. IEEE Transactions on

Multimedia, 2(3):152–163.

De Martino, J. M., Magalh˜aes, L. P., and Violaro, F.

(2006). Facial animation based on context-dependent

visemes. Computer & Graphics, 30:971–980.

Deng, Z., Neumann, U., Lewis, J., Kim, T.-Y., Bulut, M.,

and Narayanan, S. (2006). Expressive Facial Anima-

tion Synthesis by Learning Speech Coarticulation and

Expression Spaces. IEEE Transactions on Visualiza-

tion and Computer Graphics, 12(6):1523–1534.

Ekman, P. (1972). Universals and Cultural Differences in

Facial Expressions of Emotion. In Proceedings of

the Nebraska Symposium on Motivation, number 19,

pages 207–282. Lincoln University of Nebraska Press.

Ezzat, T., Geiger, G., and Poggio, T. (2002). Trainable vide-

orealistic speech animation. In SIGGRAPH, pages

388–398.

Ezzat, T. and Poggio, T. (1998). MikeTalk: A Talking Facial

Display Based on Morphing Visemes. In CA, pages

96–102.

Jia, J., Zhang, S., Meng, F., Wang, Y., and Cai, L. (2011).

Emotional Audio-Visual Speech Synthesis Based on

PAD. IEEE Transactions on Audio, Speech & Lan-

guage Processing, 19(3):570–582.

Ortony, A., Clore, G., and Collins, A. (1988). Cognitive

Structure of Emotions. Cambridge University Press.

Parke, F. I. (1972). Computer generated animation of faces.

In ACM’72: Proceedings of the ACM annual con-

ference, pages 451–457, New York, NY, USA. ACM

Press.

Pasquariello, S. and Pelachaud, C. (2002). Greta: A simple

facial animation engine. Soft Computing and Industry,

pages 511–525.

VISIGRAPP2014-DoctoralConsortium