GLOTTAL SOURCE ESTIMATION ROBUSTNESS

A Comparison of Sensitivity of Voice Source Estimation Techniques

Thomas Drugman, Thomas Dubuisson, Alexis Moinet, Nicolas D’Alessandro and Thierry Dutoit

TCTS Lab, Facult

e Polytechnique de Mons, 31 Boulevard Dolez, 7000 Mons, Belgium

Keywords:

Speech Processing, Speech Analysis, Voice Source, Glottal Formant.

Abstract:

This paper addresses the problem of estimating the voice source directly from speech waveforms. A novel

principle based on Anticausality Dominated Regions (ACDR) is used to estimate the glottal open phase. This

technique is compared to two other state-of-the-art well-known methods, namely the Zeros of the Z-Transform

(ZZT) and the Iterative Adaptive Inverse Filtering (IAIF) algorithms. Decomposition quality is assessed on

synthetic signals through two objective measures: the spectral distortion and a glottal formant determination

rate. Technique robustness is tested by analyzing the inﬂuence of noise and Glottal Closure Instant (GCI)

location errors. Besides impacts of the fundamental frequency and the ﬁrst formant on the performance are

evaluated. Our proposed approach shows signiﬁcant improvement in robustness, which could be of a great

interest when decomposing real speech.

1 INTRODUCTION

Source-ﬁlter modeling is one of the most widely used

in speech processing. Its success is certainly due to

the physiological interpretation it relies on. In this ap-

proach, speech is considered as the result of a glottal

ﬂow ﬁltered by the vocal tract cavities and radiated

by the lips. Our paper focuses on the glottal source

estimation directly from the speech signal. Typical

applications where this issue is of interest are voice

quality assessment, statistical parametric speech syn-

thesis, voice pathologies detection, expressive speech

production,...

The goal of this paper is twofold. First a simple

principle based on anticausality domination is pre-

sented. Secondly, different source estimation tech-

niques are compared according to their robustness.

Their decomposition quality is assessed in different

conditions via two objective criteria : a spectral dis-

tortion measure and a glottal formant determination

rate. Robust source estimation is of a paramount im-

portance since ﬁnal applications have to face adverse

decomposition conditions on real continuous speech.

The paper is structured as follows. In section 2

a theoretical background on source estimation meth-

ods is given. The experimental protocol we used for

the comparison is deﬁned in Section 3. In Section 4

results are exposed and the impact of different fac-

tors on the estimation quality is discussed. Section 5

concludes the paper and proposes some guidelines for

future work.

2 SOURCE ESTIMATION

TECHNIQUES

We here present two popular voice source estimation

methods, namely the Zeros of the Z-Transform de-

composition (ZZT) and the Iterative Adaptive Inverse

Filtering technique (IAIF). ZZT basis relies on the ob-

servation that speech is a mixed-phase signal (Doval

et al., 2003) where the anticausal component corre-

sponds to the vocal folds open phase, and where the

causal component comprises both the glottis closure

and the vocal tract contributions (see Figure 1). As for

the IAIF method, it isolates the source signal by iter-

atively estimating vocal tract and source parts. After

this brief state of the art, our approach based on Anti-

causality Dominated Regions (ACDR) is explained.

2.1 ZZT-based Decomposition of

Speech

For a series of N samples (x(0), x(1), ...x(N − 1))

taken from a discrete signal x(n), the ZZT rep-

resentation is deﬁned as the set of roots (zeros)

, Z

, ...Z

N−1

) of the corresponding Z-Transform

202

Drugman T., Dubuisson T., Moinet A., D’Alessandro N. and Dutoit T. (2008).

GLOTTAL SOURCE ESTIMATION ROBUSTNESS - A Comparison of Sensitivity of Voice Source Estimation Techniques.

In Proceedings of the Inter national Conference on Signal Processing and Multimedia Applications, pages 202-207

DOI: 10.5220/0001936702020207

 SciTePress

Figure 1: Illustration of the source-ﬁlter modeling for one

voiced period. The Glottal Closure Instant (GCI) has the

particularity to allow the separation of glottal open and

closed phases, corresponding respectively to anticausal and

causal signals.

X(z):

X(z) =

N−1

∑

n=0

x(n)z

−n

= x(0)z

−N+1

N−1

∏

m=1

(z −Z

) (1)

In order to decompose speech into its causal and

anticausal contributions (Bozkurt et al., 2007), ZZT

are computed on frames centered on each Glottal Clo-

sure Instant (GCI) and whose length is twice the fun-

damental period at the considered GCI. These latter

instants can be obtained either by electroglottographic

(EGG) recordings or by extraction methods applied

on the speech signal (see (Kawahara et al., 2000)

for instance). The spectrum of the glottal source

open phase is then computed from zeros outside the

unit circle (anticausal component) while zeros with

modulus lower than 1 give the vocal tract transmit-

tance modulated by the source return phase spectrum

(causal component).

2.2 Iterative Adaptive Inverse Filtering

The inverse ﬁltering technique aims at removing the

vocal tract contribution from speech by ﬁltering this

signal by the inverse of an estimation of the vocal tract

transmittance (this estimation being usually obtained

by LPC analysis). Many methods implement the in-

verse ﬁltering in an iterative way in order to obtain a

reliable glottal source estimation.

One of the most popular iterative method is the

IAIF (Iterative Adaptive Inverse Filtering) algorithm

proposed in (Alku et al., 1992). In its ﬁrst version,

this method implements LPC analysis so as to esti-

mate the vocal tract response and use this estimation

in the inverse ﬁltering procedure. Authors proposed

an improvement in (Alku et al., 2000), in which the

LPC analysis is replaced by the Discrete All Pole

(DAP) modeling technique (El-Jaroudi and Makhoul,

1991), more accurate than LPC analysis for high-

pitched voices.

The block diagram of the IAIF method is shown in

Figure 2 where s(n) stands for the speech signal and

g(n) for the glottal source estimation.

Figure 2: Block diagram of the IAIF method (from the doc-

umentation of TKK Aparat (Aparat, 2008)).

The 1

block performs a high-pass ﬁltering in or-

der to reduce the low frequency ﬂuctuations inherent

to the recording step. The 2

and 3

blocks compute

a ﬁrst estimation of the vocal tract, which is used in

the 4

and 5

blocks to compute a ﬁrst estimation of

the glottal source. This estimation is the basis of the

second part of the diagram (7

to 12

blocks) where

the same treatment is applied in order to obtain the

ﬁnal glottal source estimation.

Based on this method the TKK Aparat (Airas,

2008) has been developed as a sofware package pro-

viding an estimation of the glottal source and its

model-based parameters. We used the toolbox avail-

able on the TKK Aparat website (Aparat, 2008) for

our experiments.

2.3 Causality/Anticausality Dominated

Regions

As previously mentioned, analysis is generally per-

formed on two-period long GCI-centered speech

frames. Since GCI can be interpreted as the starting

point for both causal and anticausal phases, it dermar-

cates the boundary between causality/anticausality

dominated regions. As the domination zone of in-

ﬂuence is limited around the GCI, a sharp window

(typically a Hanning-Poisson or Blackman window)

is applied to the analysis frame (see Figure 3). Since

the causal contribution (comprising the source re-

turn phase and the vocal tract components) from the

previous period is generally negligible just before

the current GCI, the Anticausality Dominated Region

(ACDR) makes a good approximation of the source

GLOTTAL SOURCE ESTIMATION ROBUSTNESS - A Comparison of Sensitivity of Voice Source Estimation

Techniques

203

Figure 3: Effect of a sharp GCI-centered windowing on a

two-period long speech frame. The Anticausality Domi-

nated Region (ACDR) approximates the glottal source open

phase.

Figure 4: Table of parameter variation range.

open phase. As long as the window is centered on

a GCI and is sufﬁciently sharp, this simple principle

is applicable directly to the speech signal and even

more on a ﬁrst source estimation (obtained by IAIF

for example). The dependency on the GCI detection

for both ZZT and ACDR techniques will be discussed

in Section 4.2.

3 EXPERIMENTAL PROTOCOL

The experimental protocol we opted for is close to

the one presented in (Sturmel et al., 2007). De-

composition is achieved on synthetic speech signals

for different test conditions. The idea is to cover

the diversity of conﬁgurations one could ﬁnd in con-

tinuous speech by varying all parameters over their

whole range. Synthetic speech is produced accord-

ing to the source-ﬁlter model by passing a known

train of Liljencrants-Fant glottal waves (Fant et al.,

1985) through an auto-regressive ﬁlter extracted by

LPC analysis on real sustained vowel uttered by a

male speaker. As the mean pitch during these utter-

ances was about 100 Hz, it reasonable to consider that

the fundamental frequency should not exceed 60 and

240 Hz in continuous speech. Perturbations are mod-

eled in two ways: by adding a white Gaussian noise

on the speech signal and by making an error on the

GCI location (see Sections 4.1 and 4.2). Figure 4

summarizes all test conditions (which makes a total

of 59280 experiments).

Four source estimation techniques are here com-

pared : ZZT, IAIF, ACDR principle applied to both

speech and IAIF source frames. In order to assess

the decomposition quality we used two objective mea-

sures:

• Spectral Distortion. Many frequency-domain

measures for quantifying the distance between

Figure 5: Histogram of relative error on the glottal formant

determination (SNR=50dB).

two speech frames x and y arise from the speech

coding litterature. Ideally the subjective ear sensi-

tivity should be formalised by incorporating psy-

choacoustic effects such as masking or isophone

curves. A simple relevant measure is the spectral

distortion (SD) deﬁned as:

SD(x, y) =

−π

(20log

X(ω)

Y (ω)

dω

2π

(2)

where X(ω) and Y (ω) denote both signals spec-

tra in normalized angular frequency. In (Paliwal

and Atal, 1993), authors argue that a difference of

about 1dB (with a sampling rate of 8kHz) is rather

imperceptible. In order to have this point of refer-

ence between estimated and targeted sources we

used the following measure:

SD(x, y) ≈

8000

4000

∑

(20log

estimated

( f )

re f erence

( f )

(3)

• Glottal Formant Determination Rate. The am-

plitude spectrum for a voiced source (as shown

in Figure 1) generally presents a resonance called

glottal formant. As this latter parameter is an es-

sential feature, an error on its determination after

decomposition should be penalized. An example

of relative error on the glottal formant determi-

nation is displayed in Figure 5 for SNR = 50dB.

Many attributes characterizing a histogram can

be proposed to evaluate a technique performance.

The one we used for our results is the underly-

ing surface between ±10% of relative error, which

is an image of the determination rate given these

bounds.

In the next Section results are averaged for all consid-

ered frames.

SIGMAP 2008 - International Conference on Signal Processing and Multimedia Applications

204

Figure 6: Impact of noise on the spectral distortion.

4 RESULTS

A quantitative comparison between described meth-

ods is here presented. More precisely results are ori-

ented so as to answer the two following questions:

”How techniques are sensitive to perturbations such

as noise or GCI location error?” and ”What is the im-

pact of factors such as the fundamental frequency or

the ﬁrst formant on the decompostion quality?”.

4.1 Noise Sensitivity

As a reminder a white Gaussian noise has been added

to the speech signal at different SNR levels. This

noise models not only recording or production noise

but also every little deviation to the theoretical frame-

work which distinguishes real and synthetic speech.

Results according to both spectral distortion and glot-

tal formant determination rate are displayed in Fig-

ures 6 and 7. Among all techniques, ZZT turns out

to be the most sensitive. This can be explained by

the fact that a weak presence of noise may dramat-

ically perturb the roots position in the Z-plane, and

consequently the decomposition quality. Interestingly

the utility of applying our proposed ACDR concept

is clearly highlighted (see notably the improvement

when applied to the IAIF estimation). Even when di-

rectly performed on the speech signal, ACDR princi-

ple clearly yields robust and efﬁcient results.

4.2 GCI Location Sensitivity

Another perturbation that could affect a method accu-

racy is a possible error made on the GCI location. De-

tecting these particular events directly on speech with

a reliable precision is still an open problem although

some interesting ideas have been proposed (Kawa-

hara et al., 2000). Consequently it is rare that de-

Figure 7: Impact of noise on the glottal formant determina-

tion rate.

Figure 8: Impact of a GCI location error on the glottal for-

mant determination rate (clean conditions).

tected GCIs exactly match their ideal position when

analyzing real speech. To take this effect into ac-

count we have tested the inﬂuence of a deviation to

the real GCI location (GCIs are known for synthetic

signals). Results for the glottal formant determination

rate are shown in Figure 8 for clean conditions (no

noise added). As mentioned in (Bozkurt et al., 2007),

the ZZT technique is strongly sensitive to GCI detec-

tion, since this latter pertubation may affect the whole

zeros computation. A similar performance degrada-

tion is also observed for ACDR-based methods due

to their inherent way of operating. Nevertheless this

effect occurs to a lesser extent.

4.3 The Inﬂuence of Pitch

Female voices are known to be especially difﬁcult to

analyze and synthesize. The main reason is their high

fundamental frequency which implies to treat shorter

periods. As a matter of fact the vocal tract response

has not the time to freely return to its initial state be-

tween two glottal sollication periods. Consequently

the performance of ACDR method applied to high-

GLOTTAL SOURCE ESTIMATION ROBUSTNESS - A Comparison of Sensitivity of Voice Source Estimation

Techniques

205

Figure 9: Impact of the fundamental frequency on the spec-

tral distortion (clean conditions).

pitched speech will intrinsically degrade, as it relies

on the assumption that the vocal tract response is neg-

ligible in the ACDR. This hypothesis turns out to be

acceptable in a certain extent and might be reconsid-

ered for high pitch values. Figure 9 presents the evo-

lution of spectral distortion with respect to the fun-

damental frequency. Unsurprinsingly all methods de-

grade as the pitch increases, and this in a comparable

way.

4.4 The Inﬂuence of the First Formant

In (Bozkurt et al., 2004), authors already reported er-

roneous glottal formant detection due to incomplete

separation of F

. As argued in previous subsection,

particular conﬁgurations may lead to reconsider the

assumption of ACDR applied on speech. More pre-

cisely, decomposition quality mainly depends on the

3 following parameters relative values: the pitch (F

the ﬁrst formant (F

) and the glottal formant (F

). The

greater is F

with regard to F

and F

and the more se-

vere will be the decomposition conditions. Intuitively

this latter case can be interpreted as an ever increasing

interference between causal and anticausal parts.

In our experiments ﬁlter coefﬁcients were extracted

by LPC analysis on four sustained vowels. Even

though the whole spectrum may affect the decomposi-

tion, it is reasonable to consider that the effect of the

ﬁrst formant is preponderant. To give an idea, here

are the corresponding ﬁrst formant values: /a/:728Hz

, /e/:520Hz , /i/:304Hz , /u/:218Hz. The impact of

the vowel on the decomposition accuracy is plotted in

Figure 10. As expected a clear tendency of perfor-

mance reduction as F

diminishes is observed.

Figure 10: Impact of the ﬁrst formant on the glottal formant

determination rate (clean conditions).

5 CONCLUSIONS AND FUTURE

WORK

This paper addressed the problem of source estima-

tion robustness. A comparison between four different

techniques was carried out on a complete set of syn-

thetic signals. These latter methods were the Zeros

of the Z-Transform (ZZT), the Iterative Adaptive In-

verse Filtering (IAIF), and our proposed concept of

Anticausality Dominated Region (ACDR) applied ei-

ther directly on speech, or on a ﬁrst source estima-

tion (thanks to IAIF in our case). Two formal criteria

were used to assess their quality of decomposition:

the spectral distortion and the glottal formant determi-

nation rate. Robustness was ﬁrst evaluated by adding

noise to speech. In a general way this noise modeled

every little deviation to the ideal production scheme.

Interestingly both ACDR-derived methods were the

most robust and efﬁcient. Another perturbation we

considered was a possible error made on the GCI lo-

cation. In a second step the inﬂuence of the pitch (F

)

and the ﬁrst formant (F

) was analyzed. Decompo-

sition quality was interpreted as a trade-off between

three amounts: F

, F

and the glottal formant (F

). In

all our experiments ACDR-based techniques gave the

more promising results.

As future work we plan to investigate the incorpora-

tion of these methods in the following ﬁelds:

• Statistical Parametric Speech Synthesis. Hid-

den Markov models (HMM) have recently shown

their ability to produce natural sounding speech

(Tokuda et al., 2002). We already adapted this

framework for the French language. A major

drawback of such an approach is the ”buzziness”

of the generated voice. This inconvenience is

typically due to the parametric representation of

speech. Including a more subtle modeling of the

SIGMAP 2008 - International Conference on Signal Processing and Multimedia Applications

206

voice source could lead to enhanced naturalness

and intelligibility.

• Expressive Voice. User-friendliness is one of the

most important demand from the industry. Since

expressivity is mainly managed by the source,

an emotional voice synthesis engine should take

into account realistic glottal source model param-

eters. Techniques presented in this paper could

be used to estimate these parameters on speech

samples extracted from an expressivity-oriented

speech database.

• Pathological Speech Analysis. Speech patholo-

gies are most of the time due to the irregular be-

haviour of the vocal folds during phonation. This

irregular vibration can be induced by nodules or

polyps on the folds and should result in irregular

values of model parameters. Methods here pre-

sented could hence be used to estimate the glottal

source and its features on pathological speech in

order to quantify the pathology level.

ACKNOWLEDGEMENTS

Thomas Drugman is supported by the “Fonds Na-

tional de la Recherche Scientiﬁque” (FNRS) and

Nicolas D’Alessandro by the FRIA fundings. The au-

thors also would like to thank the Walloon Region for

its support (ECLIPSE WALEO II grant #516009 and

IRMA RESEAUX II grant #415911).

REFERENCES

Airas, M. (2008). TKK Aparat: An environment for voice

inverse ﬁltering and parameterization, volume 33,

pages 49–64. Logopedics Phoniatrics Vocology.

Alku, P., Svec, J., Vilkman, E., and Sram, F. (1992). Glottal

wave analysis with pitch synchronous iterative adap-

tive inverse ﬁltering. Speech Communication, 11(2-

3):109–117.

Alku, P., Svec, J., Vilkman, E., and Sram, F. (2000). Analy-

sis of voice in breathy, normal and pressed phonation

by comparing inverse ﬁltering and videokymography.

In ICSLP 2000, Proceedings of the International Con-

ference on Spoken Language Processing, pages 885–

888.

Aparat (2008). Tkk aparat main page. http://aparat.

sourceforge.net/index.php/Main_Page.

Bozkurt, B., Couvreur, L., and Dutoit, T. (2007). Chirp

group delay analysis of speech signals. Speech Com-

munication, 49(3):159–176.

Bozkurt, B., Doval, B., and Dutoit, T. (2004). A method

for glottal formant frequency estimation. In Proc. IC-

SLP, International Conference on Spoken Language

Processing, Jeju Island (Korea).

Doval, B., d’Alessandro, C., and Henrich, N. (2003). The

voice source as a causal/anticausal linear ﬁlter. In Pro-

ceedings ISCA ITRW VOQUAL03, Geneva, Switzer-

land.

El-Jaroudi, A. and Makhoul, J. (1991). Discrete all-pole

modeling. IEEE Transactions on signal processing,

39(2):411–423.

Fant, G., Liljencrants, J., and Lin, Q. (1985). A four-

parameter model of glottal ﬂow. In STL-QPSR4, pages

1–13.

Kawahara, H., Atake, Y., and Zolfaghari, P. (2000). Ac-

curate vocal event detection method based on a ﬁxed-

point analysis of mapping from time to weighted av-

erage group delay. In ICSLP 2000, Proceedings of the

International Conference on Spoken Language Pro-

cessing, volume 4, pages 664–667.

Paliwal, K. and Atal, B. (1993). Efﬁcient vector quantiza-

tion of lpc parameters at 24 bits/frame. IEEE Trans.

Speech Audio Processing, 1(1):3–14.

Sturmel, N., D’Alessandro, C., and Doval, B. (2007). A

comparative evaluation of the zeros of z transform rep-

resentation for voice source estimation. In INTER-

SPEECH 2007, Antwerp, Belgium, pages 558–561.

Tokuda, K., Zen, H., and Black, A. (2002). An hmm-based

speech synthesis system applied to english. In Proc.

IEEE Workshop on Speech Synthesis 02, Santa Mon-

ica, USA, pages 227–230.

GLOTTAL SOURCE ESTIMATION ROBUSTNESS - A Comparison of Sensitivity of Voice Source Estimation

Techniques

207