An Anthropomorphic Perspective for Audiovisual Speech Synthesis

Samuel Silva and Ant

onio Teixeira

DETI — Dep. of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal

IEETA — Institute of Electronics Engineering and Informatics of Aveiro, University of Aveiro, Aveiro, Portugal

Keywords:

Audiovisual Speech, Articulatory Synthesis, European Portuguese.

Abstract:

In speech communication, both the auditory and visual streams play an important role, ensuring both a certain

level of redundancy (e.g., lip movement) and transmission of complementary information (e.g., to empha-

size a word). The common current approach to audiovisual speech synthesis, generally based on data-driven

methods, yields good results, but relies on models controlled by parameters that do not relate with how hu-

mans do it, being hard to interpret and adding little to our understanding of the human speech production

apparatus. Modelling the actual system, adopting an anthropomorphic perspective would provide a myriad of

novel research paths. This article proposes a conceptual framework to support research and development of

an articulatory-based audiovisual speech synthesis system. The core idea is that the speech production sys-

tem is modelled to produce articulatory parameters with anthropomorphic meaning (e.g., lip opening) driving

the synthesis of both the auditory and visual streams. A ﬁrst instantiation of the framework for European

Portuguese illustrates its viability and constitutes an important tool for research in speech production and the

deployment of audiovisual speech synthesis in multimodal interaction scenarios, of the utmost relevance for

the current and future complex services and applications.

1 INTRODUCTION

Among humans, the most natural form of communi-

cation is speech, which makes it a promising modal-

ity to foster a natural and transparent interaction with

a wide range of devices, from smartphones to intelli-

gent homes. As a result, speech is becoming a com-

mon option as an input and output modality. Speech,

as a form of communication among humans is not

limited to the auditory signal. In fact, the visual fea-

tures of speech production, such as the movement

of the lips, play an important role in adding to the

auditory signal, whether by providing redundant in-

formation or by complementing it (e.g., conveying a

particular emphasis to particular words). Therefore,

output that jointly explores both the auditory and vi-

sual streams of speech, known as audiovisual speech

synthesis, can bring a wide range of advantages over

auditory-only speech in providing a richer and more

resilient method of communication and adding a more

natural feel to human-machine interaction.

Current technologies allow collecting large

amounts of data regarding particular biomechanical

phenomena (e.g., cerebral, muscular, and facial

activity) and a common approach is to use that data

to train models that enable its use for particular inter-

action tasks such as converting a certain myoelectric

signal pattern in a particular action. Although this

is, indisputably, a very important advance and, in

many aspects, the only possible approach considering

our knowledge of the involved physiology and

the current stage of technology, training a model

commonly results in a ’black box’ converting input

data into speciﬁc outputs. This often means that the

parameters controlling the model are not actually

related with anatomical or physiological elements

and hard to conceptualize. Therefore, these models

serve their purpose, but provide a limited margin for

customization and a minor contribution to our under-

standing of the phenomena, which continues to be

very partial and limited. However, when technology

allows a deeper insight into the biomechanics of a

phenomena we can move from ’blindly’ modelling

the transformation of inputs into outputs to modelling

the actual system and its components, adopting a

set of more anthropomorphic parameters and, in

this manner, learn about the phenomena and how to

replicate it.

While the literature is proliﬁc in providing sev-

eral methods for audiovisual synthesis (Schabus et al.,

2014), yielding good overall results, these often rely

on data-driven/statistical approaches, i.e., the collec-

Silva S. and Teixeira A.

An Anthropomorphic Perspective for Audiovisual Speech Synthesis.

DOI: 10.5220/0006150201630172

In Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2017), pages 163-172

ISBN: 978-989-758-212-7

163

tion of big data sets to train a model or provide the

grounds for concatenation-based systems. The result

is that we end up with systems that: (1) require the

collection of large amounts of data; (2) are strongly

related with the data they are built from; and (3) do

not relate directly with the production system, i.e., the

vocal tract and facial movements.

Regarding speech production, technological ad-

vances enable studying the static and dynamic as-

pects of the vocal tract providing enough information

to support building ﬁrst models that can be used to

study speech production and enable advancing artic-

ulatory speech synthesis (e.g., Birkholz, 2013; Rubin

et al., 1981; Teixeira et al., 2008), i.e., speech syn-

thesized by replicating the articulatory patterns per-

formed by humans. The advantage of such a bioin-

spired system is that it depends on anthropomorphic

parameters we can understand, and can be used to as-

sess theoretical aspects, including the importance of

particular vocal tract conﬁgurations or timings in the

production of speciﬁc sounds, while also providing

a speech synthesis system. Likewise, we consider

that an articulatory-based approach to audiovisual

speech synthesis, where we build a model replicat-

ing the audiovisual speech production system, could

provide several advantages, in two domains: (a) it

can be used as a highly customizable tool support-

ing research (Files et al., 2015), enabling experiment-

ing with the different articulators and timings to un-

derstand how speech is produced, how the auditory

and visual streams relate, and serving educational and

clinical applications (Massaro, 2005); (b) it can pro-

vide a more versatile system for deploying richer out-

puts in interaction scenarios. For example, since it

does not depend on the collection of novel data, in

an articulatory approach the voice can be more eas-

ily customized, the visual part completely changed

(ranging from a cartoon-like character to a realisti-

cally looking avatar), and articulatory limitations in-

troduced (for instance, limit the amplitude of the jaw

movement to simulate someone speaking with its jaw

movement hindered).

Following this line of thought, we present the con-

ceptualization and ﬁrst instantiation of a framework

to support articulatory-based audiovisual speech syn-

thesis oriented to support both the research on the au-

diovisual aspects of speech production and its deploy-

ment as an interaction modality.

The remainder of this article is organized as fol-

lows: section 2 brieﬂy reviews current trends in au-

diovisual speech synthesis; section 3 discusses the

main motivations for pursuing an articulatory ap-

proach to audiovisual speech, in the context provided

by the state-of-the-art, and presents the basic concepts

behind articulatory synthesis; our conceptual view on

how to pursue articulatory audiovisual speech synthe-

sis is described in section 4 and section 5 describes a

ﬁrst implementation of the described framework; ﬁ-

nally, section 6 presents conclusions and major routes

for future work.

2 CURRENT TRENDS IN

AUDIOVISUAL SPEECH

SYNTHESIS

A wide range of approaches to audiovisual speech

synthesis have been presented in the literature. In a

recent review, (Mattheyses and Verhelst, 2015) high-

light four aspects deﬁning an audiovisual speech syn-

thesizer: 1) the properties of the input information

(e.g., text or audio); 2) the properties of the out-

put (e.g., in 2D or 3D); 3) how the visual artic-

ulators are deﬁned (e.g., 3D modelling) and ani-

mated (e.g., anatomy-based or performance-driven

systems (

Zelezn

y et al., 2015) ); and 4) how the dif-

ferent visual conﬁgurations that need to be attained

are predicted (e.g., rule-based, concatenative).

When performing visual-only speech synthesis, a

common approach is to use visemes, a visual rep-

resentation of the relevant articulators, to represent

phonemes. Whenever a certain phoneme occurs, the

corresponding viseme is used. Since the visible parts

of the vocal tract (e.g., lips) adopt similar conﬁgu-

rations for different sounds, and modelling visemes

may require laborious time and a digital artist (Serra

et al., 2012), it is common to use the same viseme

for multiple phonemes. Then, these visemes work

as keyframes and are animated by interpolating in-

between. In many cases, even though the proposed

systems provide a visual speech synthesis that is

smooth, it does not have any underlying mechanism

to actually implement visual coarticulation. Many

phonemes to one viseme mappings, for example, do

not account for visual coarticulation (Mattheyses and

Verhelst, 2015) and alternatives to minimize this issue

have been proposed by adding visemes for diphones

and triphones).

Additionally, several authors have addressed coar-

ticulation by proposing models that deﬁne how the

interpolation between visemes is performed consid-

ering, for example, models of facial biomechanics or

through a model for visual coarticulation in which in-

terpolation between visemes depends on weights as-

sociated with the corresponding phonemes, according

to their dominance (e.g., Cohen and Massaro, 1993).

In recent years, data-driven/statistical approaches

BIOSIGNALS 2017 - 10th International Conference on Bio-inspired Systems and Signal Processing

164

have taken the lead (e.g., Schabus et al., 2014). In

concatenative visual synthesis, a data collection stage

is performed considering a corpus that should be

phonetically rich enough to include all the required

phones/segments and relevant contexts. So, when

synthesizing, the system retrieves the longest possi-

ble segments from the database, to minimize the num-

ber of concatenations, and synthesizes the desired au-

diovisual speech. In the case of visual speech based

on 3D models a performance-driven animation is re-

quired entailing data collections that involve complex

settings.

For the Portuguese language, just a few audiovi-

sual synthesis systems have been proposed. In a no-

table example, Serra et al. (Serra et al., 2012) propose

an audiovisual speech synthesis system for European

Portuguese (EP) enabling automatic visual speech an-

imation from text or speech audio input. The authors

consider a phoneme-to-viseme strategy for animat-

ing a 3D model and perform a preliminary evaluation

of two different sets of visemes reaching the conclu-

sion that using a small number of visemes (the same

viseme is attributed to more phonemes) has a negative

impact on the perceived quality of the output.

3 ARTICULATORY SYNTHESIS

In this section, we present the basic concepts provid-

ing the grounds for articulatory synthesis and, based

on the current state-of-the-art, discuss the main moti-

vations to enrol on an articulatory approach to audio-

visual speech synthesis.

3.1 Articulatory Phonology

In articulatory phonology (Browman and Goldstein,

1990; Hall, 2010), the basic unit of speech is not the

segment or the feature, but the articulatory gesture.

Gestures are a set of instructions that deﬁne how a

constriction is formed and released in the vocal tract,

such as lip closure.

In articulatory phonology, the vocal tract con-

ﬁguration is generally deﬁned by considering ﬁve

tract variables: lips (LIP), tongue tip (TT), tongue

body (TB), velum (VEL) and glottis (GLO). Ges-

tures are speciﬁed based on these variables and the

constrictions’ location (CL) – labial, dental, alveo-

lar, postalveolar, palatal, velar, uvular and pharyngeal

– and degree (CD) – closed (for stops), critical (for

fricatives) and narrow, mid, and wide (for approxi-

mants and vowels). In this context, the gestural spec-

iﬁcation for the alveolar stop [t], for example, would

be: tract variable tongue tip, CD: closed, CL: alveolar.

This deﬁnes the goal of the gesture: the target.

A gesture is a dynamic action characterized by

several phases: the onset of the movement; its

progress until reaching the target; release, when it

starts moving away from the constriction; and offset,

after which the articulator is no longer under the con-

trol of the gesture.

Gestures are combined to form larger elements

(e.g., syllables and words). This combination is not

a simple matter of their sequence in time. Gestures

blend with other gestures according to phasing prin-

ciples: a certain point in the trajectory of one gesture

is phased with respect to the trajectory of other ges-

tures. The speciﬁcation of the different gestures in-

volved in articulating a particular word, along with

the time intervals deﬁning the regions of active con-

trol for each gesture, is called a gestural score. Given

that gestures need to be combined, a gestural score is

just the ﬁrst step for computing articulator’s trajecto-

ries, which will result from its interpretation and mod-

iﬁcation.

One Matlab implementation of the articulatory

phonology framework described above, named TAsk

Dynamics Application (TADA), has been proposed

by the Haskins Laboratories (Saltzman and Munhall,

1989; Nam et al., 2006). A gestural model considers

the syllables of the input text to generate the gestural

score. This includes the speciﬁcation of the various

gestures required and their activation intervals based

on models for inter-gestural planning and gestural-

coupling. Gestural scores are considered by the task

dynamic model to generate the ﬁnal time functions

for articulator trajectories considering the articulators

of the CASY vocal tract model (Rubin et al., 1996).

These trajectories are considered to conﬁgure the vo-

cal tract model and provide the basis to compute the

acoustic output (synthetic speech).

3.2 Motivation

Overall, at the current stage of technology, the quality

provided by data-driven/statistical approaches to au-

diovisual speech synthesis surpasses the quality pos-

sible with articulatory approaches. However, an artic-

ulatory approach serves a wider set of goals beyond

the production of the ﬁnal audiovisual speech output.

The articulatory based approach to speech syn-

thesis starts by not requiring the collection of

large datasets for training. Furthermore, in data-

driven/statistical systems the synthesized speech de-

pends on the nature of the collected data, e.g., age

and gender of the speaker, and creating a new voice

involves almost the same amount of effort and re-

An Anthropomorphic Perspective for Audiovisual Speech Synthesis

165

sources as the ﬁrst. An articulatory synthesizer pro-

vides enough versatility to be customized to synthe-

size voices with a wide range of characteristics such

as a speciﬁc gender or speech disorder. Therefore,

while a signiﬁcant effort might be required to attain

a ﬁrst good quality audiovisual speech synthesis sys-

tem, new voices will not involve as much effort.

The parameters involved in an articulatory-based

speech synthesizer are more intuitive than those con-

trolling a data-driven model since they relate with ac-

tual elements of the real system. This means that

when a particular synthesis goes wrong one can in-

tuitively make sense of what might have happened or

where to perform changes. And this ability to experi-

ment with the different parameters, in search for per-

ceptually better outputs, also contributes to improve

our knowledge about speech production.

An approach that favours a biomechanical per-

spective of the system can also be valuable for educa-

tional and clinical purposes. The audiovisual speech

synthesis system can be used as an assistant tool for

teaching phonetics and therapy, providing an illustra-

tion of the different aspects of speech production.

The consideration of visemes for audiovisual

speech synthesis, although quite common and provid-

ing reasonable results, raises several issues. A partic-

ular viseme assumes that all articulators are under the

inﬂuence of a phone, which is not true, since a phone

might be deﬁned by a single gesture (for example,

[b]) without inﬂuencing all articulators. Even con-

sidering a weighted approach, to explicitly deﬁne the

importance of each viseme, does not address this as-

pect. Additionally, the use of audiovisual concatena-

tive synthesis for clinical purposes (e.g., speech ther-

apy), to provide someone with an illustration of how

to properly articulate a particular token, raises con-

cerns as to whether we are actually providing a proper

reference to the patient. An articulatory-based audio-

visual synthesis approach would provide more solid

ground over which these different aspects can be tack-

led.

Additionally, since the articulatory-based system

models the different articulators and their dynamic

behaviour, one can envisage that if it is associated

with a technology, such as ultrasound imaging, it

might provide instant assessment of speaker perfor-

mance, e.g., during speech therapy exercises, by com-

paring speaker performance with the computed trajec-

tories, and provide speciﬁc clues on what to improve

or support the choice of particular exercises to surpass

those difﬁculties.

Overall, we consider that these aspects support

investing in audiovisual speech synthesis approach

from an articulatory perspective, favouring further

study and understanding of human anatomy, physi-

ology and behaviour, in this context.

4 CONCEPTUAL FRAMEWORK

FOR ARTICULATORY

AUDIOVISUAL SPEECH

SYNTHESIS

In line with the current state-of-the-art and the differ-

ent motivations presented, establishing the relevance

of an articulatory approach to audiovisual speech syn-

thesis, we start by characterizing the envisaged sce-

narios and establish a set of requirements leading

to the conceptualization of a framework supporting

R&D in articulatory audiovisual speech synthesis.

4.1 Scenarios and Requirements

We envisage two distinct scenarios for exploring ar-

ticulatory audiovisual speech synthesis that proﬁt

from each other: (1) improve our knowledge of

speech production and study the synergies between

the audio and visual aspects of speech; and (2) use

audiovisual speech synthesis as a rich output modal-

ity in multimodal interaction.

4.1.1 Scenario 1: Research in Speech Production

A set of technologies (e.g., Scott et al., 2014) and

methods (e.g., Silva et al., 2015) provide an increas-

ing amount of data to characterize different aspects

of speech production, regarding, e.g., the conﬁgura-

tion of the vocal tract for producing particular sounds,

coarticulation and synchronization. The collected

data allows a better understanding of the static and

dynamic aspects of speech production, but its value

can be further established if, based on this data, we

are able to postulate rules or even theories that can

be tested (simulated) in an experimental setting. Only

then we will have conﬁrmation that we fully charac-

terized and understand the observed phenomena and

it might potentially trigger our awareness of further

aspects that need clariﬁcation. Naturally, this sort

of research, as previously emphasized, requires mod-

elling the articulatory system and not just its outputs.

This rationale also applies to the joint consideration of

auditory and visual speech for which an articulatory-

based audiovisual speech synthesis system would be

an important tool.

Overall, this scenario requires: (1) a system that

is able to move from a text input to an audiovi-

sual speech output, modelling all the different stages;

BIOSIGNALS 2017 - 10th International Conference on Bio-inspired Systems and Signal Processing

166

(2) the deﬁnition of each language speciﬁcity, re-

garding particular sounds and their articulation; (3)

a face model controlled by anthopomorphic parame-

ters, which are easily related with our understanding

of a real speaker; and (4) constant evaluation of re-

sults to assess output (perceptual) quality and guide

future developments.

4.1.2 Scenario 2: Multimodal Interaction

The research in audiovisual speech production should

also enable the deployment of a system that can be

used as an interaction modality. Research in multi-

modal interaction envisages complex contexts such

as ambient assisted living (AAL) where a heteroge-

neous set of users experiences with a multitude of

interactive devices (Almeida et al., 2016; Teixeira

et al., 2016). One concrete example is provided by

the ongoing Marie Curie IAPP project IRIS (Freitas

et al., 2014) focusing on enhancing communication

in a household for a family spanning a wide range of

skills, limitations and motivations. As with text-to-

speech (Almeida et al., 2014), enabling a customiza-

tion of the voice and visual appearance of the avatar,

eventually personalized to a particular context or user,

might also provide improved acceptability. On the

subject of avatar versatility, having a 3D model cov-

ers a broader range of scenarios, from simple 2D front

view up to its insertion in a virtual environment.

Additionally, although the research focus in

speech production might favour a particular language,

for interaction we are also interested in enabling artic-

ulatory audiovisual speech synthesis in multiple lan-

guages. This is in line with recent efforts in the lit-

erature providing a generic speech modality encapsu-

lating support for multiple languages (Almeida et al.,

2014). We also argue that one additional motivation

for evolving an articulatory audiovisual speech syn-

thesizer might come from speciﬁc requirements de-

riving from its use in real interaction scenarios. This

is only possible if developers are able to integrate it

in their applications, as made possible, e.g., by a mul-

timodal interaction architecture and framework (e.g.,

Teixeira et al., 2016) in which the audiovisual speech

synthesis modality should be integrated.

This scenario requires, overall: (1) ability to eas-

ily move from our research in audiovisual speech pro-

duction into its deployment as an interaction modal-

ity; (2) methods that exhibit moderate computa-

tional costs and use technologies that are compati-

ble with the large variety of devices envisaged; (3)

language speciﬁc modules sufﬁciently decoupled and

interchangeable to enable support for additional lan-

guages; and (4) customizable visual stream, to sup-

port personalized 3D avatars.

4.2 Conceptual Framework

Considering the authors’ experience in articulatory-

based speech synthesis and the motivations previ-

ously presented, the corner stone of our proposal is

that articulatory-based audiovisual speech synthesis

can inherit from the articulatory speech synthesis ap-

proach and be driven by the same anthropomorphic

parameters.

In what follows, we present the main conceptual

aspects of a framework to support R&D in articula-

tory audiovisual speech synthesis serving the envis-

aged scenarios and requirements. Our view consid-

ers three main blocks: an audiovisual speech synthe-

sis core supported by an evaluation module and inte-

grated as an interaction modality.

Figure 1 depicts these three blocks and puts into

evidence the synergies that should exist among them,

contributing for advancing the usefulness and qual-

ity of articulatory audiovisual speech synthesis, as de-

tailed in what follows.

Figure 1: Overview of the conceptual framework for per-

forming research and development in articulatory-based au-

diovisual speech synthesis.

4.2.1 Audiovisual Synthesis Core

The audiovisual synthesis core should constitute the

central element of the framework, dealing with the

actual production of the audiovisual output. Figure 2

depicts the main features of the envisaged text-driven

synthesis approach.

The language package deals with all aspects that

are speciﬁc in each language such as the gestural

model. The vocal tract static and dynamic modelling

deals with moving from the text and language speciﬁc

data to the deﬁnition of the different gestures and their

interactions and, ﬁnally, to the physiological (anthro-

pological) parameters. These parameters drive a vo-

cal tract model providing the geometry for the speech

synthesizer.

An Anthropomorphic Perspective for Audiovisual Speech Synthesis

167

Figure 2: Main aspects of the articulatory based audiovisual speech synthesis core. The highlight is a set of anthropomorphic

parameters driving both the auditory and visual synthesis.

Figure 3: Diagram depicting the main aspects of the conceptual integration between the audiovisual speech synthesis core and

an interaction modality made available in a multimodal interaction context.

On the visual synthesis side, context speciﬁc re-

quirements, such as speciﬁc avatar look (e.g., person-

alized to a particular user or a character in a ﬁlm) are

provided and a custom avatar model is created sup-

porting a set of controls for animation. One of the

notable aspects of this conceptual framework, worth

emphasizing, is that we argue that the anthropolog-

ical parameters (e.g., lip aperture and protrusion),

which drive the auditory synthesis, should also serve

to conﬁgure the model used for visual synthesis. Fi-

nally, both streams are rendered to provide audiovi-

sual speech synthesis.

The dashed lines, in ﬁgure 2, coming from the

evaluation module, are just to emphasize that this

module, provided certain assessment data, is envis-

aged as being able to exert its inﬂuence at different

levels in the audiovisual synthesizer as further ex-

plained ahead.

4.2.2 Interaction Modality

The second aspect we consider relevant is the integra-

tion of the audiovisual speech synthesis system into

an interaction modality, deployed as part of a multi-

modal interaction framework. We argue that, without

this context of use, research in articulatory audiovi-

sual speech synthesis will fall short of its true poten-

tial due to a lack of real application scenarios.

The main aspects regarding this element of the

conceptual framework are depicted in ﬁgure 3. The

main components presented for the interaction modal-

ity (i.e., generation, styling and render) derive from

the conceptual view of an output component proposed

by the W3C for a multimodal interaction architec-

ture (W3C Consortium, 2003).

Articulatory audiovisual synthesis entails a con-

siderable level of complexity, particularly regarding

the computation of the articulatory parameters driv-

ing the models. Therefore, since the resulting inter-

action modality can be added to applications running

on any kind of device, the core computations, should

BIOSIGNALS 2017 - 10th International Conference on Bio-inspired Systems and Signal Processing

168

Figure 4: Main features of the conceptualized evaluation module, part of the proposed audiovisual speech synthesis conceptual

framework.

occur outside of the device, if needed. This is also

aligned with another important requirement. If we

want to rapidly move from the research outcomes into

the interaction scenario, placing most of the core on-

line, e.g., as a webservice, enables fast deployment

of new versions without the need to update the in-

stalled applications. The generation module, depicted

in ﬁgure 3, would be responsible for managing the

interface with the online service, providing the input

text and requirements. However, it is important to

note that, to preserve some versatility on the modality

side might entail leaving part of the conceptualized

audiovisual speech synthesis core locally, particularly

the decisions about how to style and render features.

These aspects explain why the interaction modality,

although providing an encapsulation of the features

provided by the audiovisual synthesis core, is not con-

ceptualized as containing the whole system.

4.2.3 Evaluation

To validate the outcomes of the produced audiovisual

speech synthesis and to provide feedback that guides

research, evaluation is an important part of the envis-

aged framework and the overall aspects of the eval-

uation module are depicted in ﬁgure 4. In this re-

gard, perceptual tests are one of the most used op-

tions in this domain, providing researchers with an

assessment of the adequacy of the outputs regarding,

for example, the synthesis of speciﬁc sounds or the

synchronization between audio and visual synthesis.

Nevertheless, even though perceptual tests remain the

most important source of ’gold standards’, the sys-

tematic development of an audiovisual speech syn-

thesis system requires additional evaluation methods

enabling a faster and more frequent assessment of dif-

ferent alternatives, additions, and improvements. For

example, an unsupervised method that could provide

a ﬁrst evaluation, to guide system evolution before the

much more costly perceptual tests, would be a valu-

able asset. The stream to evaluate, at each time (i.e.,

auditory, visual or both), the focused parameters, and

the text to synthesize should be deﬁned by a set of

criteria that can include, e.g., a speciﬁc corpora. The

unsupervised evaluation might be based on features

computed from the considered streams or might re-

quire support data to serve as reference, e.g., features

extracted from video streams of real speakers.

The perceptual tests can also proﬁt from the de-

ployment of the audiovisual system as an interaction

modality. Adding to the more controlled lab settings,

data can also be collected from who is using the in-

teraction modality, potentially covering different con-

texts, and a vital step, we argue, in assessing the ac-

ceptability and usefulness of the audiovisual synthesis

in real scenarios.

Another important aspect regarding evaluation is

how the assessment data might be used, in a feedback

loop, to guide system improvement. The ﬁrst option

is that the researcher, based on the evaluation out-

comes, makes changes to the system. Another alter-

native, extending the idea of unsupervised assessment

methods, is to enable unsupervised adjustment of sys-

tem parameters based on the evaluation data. Such an

unsupervised evaluation loop would allow automated

parameter tuning (e.g., avatar lip aperture adjustment)

and a systematic assessment of the inﬂuence of each

parameter in the overall quality.

5 FRAMEWORK

INSTANTIATION

In what follows, we present the ﬁrst instantiation of

the audiovisual speech synthesis conceptual frame-

An Anthropomorphic Perspective for Audiovisual Speech Synthesis

169

Figure 5: Diagram of the articulatory audiovisual speech synthesis core depicting details regarding the technical options

adopted for each of the main blocks.

work proposed in the previous section. At this mo-

ment, we have a full ﬁrst instantiation of the audio-

visual speech synthesis core, considerable work re-

garding its integration as an interaction modality and

a very initial sketch of the evaluation module.

5.1 Audiovisual Speech Synthesis Core

Figure 5 depicts the audiovisual speech synthesis

core, following the conceptual structure presented

earlier, but now providing additional detail regarding

the concrete implementation options considered for

each of its blocks.

Considering our research interests and previous

work, this ﬁrst instantiation of the framework sup-

ports audiovisual synthesis in EP. Note that this is

not a limiting aspect to the versatility of this instan-

tiation, since the change of the language speciﬁc el-

ements is the key requirement for the framework to

support another language. In the language speciﬁc

block, the main aspect is the data required to derive

gestures from words and we considered previous con-

tributions by Teixeira et al. (Teixeira et al., 2008) and

Oliveira (Oliveira, 2009) providing the gestural def-

inition of EP sounds. The language speciﬁc module

also includes the syllabiﬁcation of the input text, and

its conversion to phonemes which, at this time, is still

not fully automated and relies on a static dictionary.

For the static and dynamic modelling of the vocal

tract we considered TADA that provides the compo-

sition of the gestural score and the dynamic model

computing the articulator trajectories along time con-

sidering coarticulation effects. It is TADA that pro-

vides the antropomorphic parameters concerning lip,

velar and tongue movement that subsequently drive

the auditory and visual synthesis.

Regarding auditory speech synthesis, for the sake

of simplicity, we consider, at this stage, the partial

implementation of the CASY synthesizer, provided

along with TADA (Rubin et al., 1981) that suffers

from a few limitations particularly regarding nasality,

an important aspect for EP. The envisaged alternative

is the consideration of SAPWindows (Teixeira et al.,

2002), which was built for EP, but other alternatives

are possible, e.g., based on the model and synthesizer

proposed by Birkholz (Birkholz, 2013).

As conceptualized, our approach considers the

set of anthropomorphic parameters driving auditory

speech synthesis (e.g., lip and tongue conﬁgurations),

as input data for the visual speech synthesis yielding

inherent synchronization among both streams.

The 3D model used (see ﬁgure 6) has an advanced

animation rig provided by FACEINMOTION

. It has

two distinctive features that make it interesting in this

context: (1) it enables controlling aspects, such as

lip aperture, with changes propagating smoothly to

the face; (2) the rigging is easily transferable across

models, meaning that personalized avatars can be cre-

ated keeping the bones, i.e., the elements in the avatar

model controlling parts of the face, common among

models.

The articulator trajectories are transformed in an-

imation parameters mostly by performing amplitude

adaptation or by using the same articulator data to

modify a group of bones. For example, lip protrusion

data is used to manipulate bones regarding the upper

and lower lip, and the left and right mouth corners

since the model does not yet support a higher level

http://www.faceinmotion.com/

BIOSIGNALS 2017 - 10th International Conference on Bio-inspired Systems and Signal Processing

170

Figure 6: The photorealistic avatar considered for the in-

stantiation of the conceptual framework.

control for this gesture.

Since animation relies on the articulator trajecto-

ries it is basically independent from the language or

articulatory model considered in TADA, as long as ar-

ticulator trajectories are generated in the same value

ranges.

5.2 Interaction Modality

To move into the envisaged scenario of deploying the

audiovisual speech synthesis system as an interaction

modality, it has been spread between local and online

services.

First, TADA has been optimized to run faster and

relevant parts converted to C language, overall reduc-

ing the required computation time by half. Then, and

to move it away from its native Matlab environment,

not ﬁtted for integration into an interaction modal-

ity, it was encapsulated inside a RESTful service re-

turning articulators’ trajectories for given sentences.

This also fulﬁls the goal of putting the computational

weight outside of the local device.

Regarding the rendering of the audiovisual syn-

thesis, we considered that at least part of it needed

to be on the client side to provide some level of con-

trol and customization (e.g., a personalized avatar) on

how the synthesis is integrated and used. Therefore,

the ﬁnal rendering is performed on the client. Figure 5

depicts the part of the core that is actually performed

by the client device.

For rendering the audiovisual stream, on the client

side, we considered WebGL in conjunction with

Three.js, the later used to manipulate the model ac-

cording to the received anthropomorphic parame-

ters. Using WebGL has the advantage of not re-

quiring the installation of any additional libraries

since it is natively supported by most web browsers,

allowing for out-of-the-box multi-platform, multi-

device support. To make it clear, this does not in-

validate the adoption of another forms of render-

ing, e.g., including the avatar in a complex vir-

tual scene. A simple example video can be found

at http://sweet.ua.pt/sss/resources/visualspeech/ﬁles/

opapaestanotrabalho2.mp4

5.3 Evaluation

At this time, the audiovisual synthesis core already

provides output streams that can be used for percep-

tual tests. However, regarding the unsupervised as-

sessment, we are still at an early stage, studying how

the reference data should be gathered and used.

6 CONCLUSIONS

This article argues in favour of an anthropomorphic

(articulatory) perspective for audiovisual speech syn-

thesis and proposes the main aspects of a concep-

tual framework that should enable its evolution. This

framework considers articulatory audiovisual speech

synthesis as a research tool in speech production, sup-

ported by an evaluation module and tightly connected

to its use in multimodal interaction systems. A ﬁrst

instantiation of the conceptual framework is also pre-

sented demonstrating the viability of the proposal. It

already provides audiovisual synthesis of simple sen-

tences in EP. At it current stage, it already enables

moving from working on ground work such as ges-

tural models up to having a speaking avatar in a web

browser.

Considering the proposed conceptual framework,

our ﬁrst instantiation still does not fully address eval-

uation. While at its current stage the system can en-

able the execution of perceptual tests, we have yet to

deploy a systematic quantitative evaluation module to

cover all stages of the framework.

ACKNOWLEDGEMENTS

Samuel Silva is funded by grant SFRH/BPD/108151-

/2015 from FCT. Research partially funded by IEETA

Research Unit funding (UID/CEC/00127/2013.) and

Marie Curie Actions IRIS (ref. 610986, FP7-

PEOPLE-2013-IAPP).

REFERENCES

Almeida, N., Silva, S., and Teixeira, A. (2014). Design and

development of speech interaction: A methodology.

In Proc. of HCII, LNCS 8511, pages 370–381, Crete,

Grece.

An Anthropomorphic Perspective for Audiovisual Speech Synthesis

171

Almeida, N., Silva, S., Teixeira, A. J. S., and Vieira, D.

(2016). Multi-device applications using the multi-

modal architecture. In Dahl, D., editor, Multimodal

Interaction with W3C Standards: Towards Natural

User Interfaces to Everything, (to appear). Springer,

New York, NY, USA.

Birkholz, P. (2013). Modeling consonant-vowel coarticu-

lation for articulatory speech synthesis. PLoS ONE,

8(4):1–17.

Browman, C. P. and Goldstein, L. (1990). Gestural speci-

ﬁcation using dynamically-deﬁned articulatory struc-

tures. Journal of Phonetics, 18:299–320.

Cohen, M. M. and Massaro, D. W. (1993). Modeling coar-

ticulation in synthetic visual speech. In Models and

techniques in computer animation, pages 139–156.

Springer.

Files, B. T., Tjan, B. S., Jiang, J., and Bernstein, L. E.

(2015). Visual speech discrimination and identiﬁca-

tion of natural and synthetic consonant stimuli. Fron-

tiers in psychology, 6.

Freitas, J., Candeias, S., Dias, M. S., Lleida, E., Ortega, A.,

Teixeira, A., Silva, S., Acarturk, C., and Orvalho, V.

(2014). The IRIS project: A liaison between industry

and academia towards natural multimodal communi-

cation. In Proc. Iberspeech, pages 338–347, Las Pal-

mas de Gran Can

aria, Spain.

Hall, N. (2010). Articulatory phonology. Language and

Linguistics Compass, 4(9):818–830.

Massaro, D. W. (2005). The Psychology and Technology of

Talking Heads: Applications in Language Learning,

pages 183–214. Springer Netherlands, Dordrecht.

Mattheyses, W. and Verhelst, W. (2015). Audiovisual

speech synthesis: An overview of the state-of-the-art.

Speech Communication, 66:182 – 217.

Nam, H., Goldstein, L., Browman, C., Rubin, P., Proctor,

M., and Saltzman, E. (2006). TADA manual. New

Haven, CT: Haskins Labs.

Oliveira, C. (2009). From Grapheme to Gesture. Linguis-

tic Contributions for an Articulatory Based Text-To-

Speech System. PhD thesis, University of Aveiro (in

Portuguese).

Rubin, P., Baer, T., and Mermelstein, P. (1981). An articula-

tory synthesizer for perceptual research. The Journal

of the Acoustical Society of America, 70(2):321–328.

Rubin, P., Saltzman, E., Goldstein, L., McGowan, R., Tiede,

M., and Browman, C. (1996). CASY and extensions

to the task-dynamic model. In Proc. Speech Prod.

Seminar, pages 125–128.

Saltzman, E. L. and Munhall, K. G. (1989). A dynamical

approach to gestural patterning in speech production.

Ecological psychology, 1(4):333–382.

Schabus, D., Pucher, M., and Hofer, G. (2014). Joint audio-

visual hidden semi-markov model-based speech syn-

thesis. J. of Selected Topics in Signal Proc., 8(2):336–

347.

Scott, A. D., Wylezinska, M., Birch, M. J., and Miquel,

M. E. (2014). Speech mri: Morphology and function.

Physica Medica, 30(6):604 – 618.

Serra, J., Ribeiro, M., Freitas, J., Orvalho, V., and Dias,

M. S. (2012). A proposal for a visual speech anima-

tion system for european portuguese. In Proc. Iber-

SPEECH, pages 267–276, Madrid, Spain. Springer.

Silva, S., Almeida, N., Pereira, C., Martins, A. I., Rosa,

A. F., e Silva, M. O., and Teixeira, A. (2015). Design

and development of multimodal applications: A vi-

sion on key issues and methods. In Proc. HCII, LNCS.

Teixeira, A., Oliveira, C., and Barbosa, P. (2008). European

Portuguese articulatory based text-to-speech: First re-

sults. In Proc. PROPOR, LNAI 5190 , pages 101–111.

Teixeira, A., Silva, L., Martinez, R., and Vaz, F. (2002).

SAPWindows - towards a versatile modular articu-

latory synthesizer. In Proc. of IEEE Workshop on

Speech Synthesis, pages 31–34.

Teixeira, A. J. S., Almeida, N., Pereira, C., e Silva, M. O.,

Vieira, D., and Silva, S. (2016). Applications of

the multimodal interaction architecture in ambient as-

sisted living. In Dahl, D., editor, Multimodal Inter-

action with W3C Standards: Towards Natural User

Interfaces to Everything, (to appear). Springer, New

York, NY, USA.

W3C Consortium (2003). W3C multimodal interaction

framework - technical note (accessed oct 2016).

Zelezn

y, M., Kr

noul, Z., and Jedli

cka, P. (2015). Analy-

sis of Facial Motion Capture Data for Visual Speech

Synthesis, pages 81–88. Springer International Pub-

lishing, Cham.

BIOSIGNALS 2017 - 10th International Conference on Bio-inspired Systems and Signal Processing

172