PITCH-ASYNCHRONOUS GLOTTAL INVERSE FILTERING OF

NORMAL AND PATHOLOGICAL VOICES BASED ON

HOMOMORPHIC PREDICTION

Rub´en Fraile

, Malte Kob

, Juana M. Guti´errez

, Nicol´as S´aenz-Lech´on

Juan Ignacio Godino-Llorente

and V´ıctor Osma-Ruiz

Department of Circuits & Systems Engineering, Universidad Polit´ecnica de Madrid

Carretera de Valencia Km 7, 28031 Madrid, Spain

Erich Thienhaus Institute, Hochschule f¨ur Musik Detmold

Neustadt 22, D32756 Detmold, Germany

Keywords:

Speech analysis, Glottal inverse ﬁltering, Vocal tract ﬁlter, Cepstral analysis.

Abstract:

Inverse ﬁltering of speech signals for the separation of vocal tract and glottal source effects has a wide variety

of potential applications, including the assessment of glottis-related aspects of voice function. Among all

existing approaches to inverse ﬁltering, this paper focuses on homomorphic prediction. While not favoured

much by researchers in recent literature, such an approach offers two advantages over others: it does not

require previous estimation of the fundamental frequency and it does not rely on any assumptions about the

spectral enevelope of the glottal signal. The performance of homomorphic prediction is herein assessed and

compared to that of an adaptive inverse ﬁltering method making use of synthetic voices produced with a

biomechanical voice production model. The reported results indicate that the performance of inverse ﬁltering

based on homomorphic prediction is within the range of that of adaptive inverse ﬁltering and, at the same

time, it has a better behaviour when the spectral envelope of the glottal signal does not suit an all-pole model

of predeﬁned order.

1 INTRODUCTION

In a discrete signal processing framework, the most

basic, yet widely used, voice production models map

the human phonatory system to a set of linear systems

connected in cascade. In their simplest form, such

models consist of three blocks (Rabiner and Schafer,

1978): the voice source (glottal source) G(z), which

may be either a quasi-harmonic or a noise-like sig-

nal, depending on the type of sound being modelled

(voiced or unvoiced), a vocal-tract ﬁlter V (z) that

adds resonances to the signal and a radiation ﬁlter

R(z) that accounts for ﬂow-to-pressure conversion

in the lips. Actually, using non-invasive measuring

techniques only the output of the composite system

S(z) = G(z) ·V (z) · R(z) can be measured as an au-

dio signal. In this context, glottal inverse ﬁltering is

a blind deconvolution problem consisting in estimat-

ing the glottal source G(z) from the recorded voice

S(z) without any a priori knowledge on the form of

V (z) ·R(z). While the extraction of the glottal source

from recorded voice is of interest for many applica-

tions and several algorithms for such inverse ﬁltering

have been proposed, to present a standard automatic

procedure is still lacking (Walker and Murphy, 2007).

The highest difﬁcultyin glottal inverse ﬁltering (as

in other blind deconvolution problems) is the discrim-

ination among the effects of individual systems of the

overall output. In the speciﬁc case of glottal inverse

ﬁltering, this problem is approached in a number of

different ways (Walker and Murphy, 2007):

• Pitch-synchronous approaches: They require

identiﬁcation of glottal closure instants. The rea-

son for this need may either be for analysing glot-

tal closed phase (Wong et al., 1979), which is a

time interval in which the vocal tract receives no

input and, therefore, output voice is assumed to

be independent of glottal source, or for isolating

glottal source periods whose waveform may be

estimated either adaptively (Akande and Murphy,

Fraile R., Kob M., M. Gutierrez J., Sáenz-Lechón N., Ignacio Godino-Llorente J. and Osma-Ruiz V. (2010).

PITCH-ASYNCHRONOUS GLOTTAL INVERSE FILTERING OF NORMAL AND PATHOLOGICAL VOICES BASED ON HOMOMORPHIC PREDICTION.

In Proceedings of the Third International Conference on Bio-inspired Systems and Signal Processing, pages 45-52

DOI: 10.5220/0002699300450052

 SciTePress

2005) or jointly with the vocal tract (Fu and Mur-

phy, 2006).

• Pitch-asynchronous approaches: These ap-

proaches do not necessarily require either

detection of speciﬁc time instants or pitch-period

calculation, though the performance of some

of them may be increased including pitch syn-

chronism. The most known scheme in this

group is the Iterative Adaptive Inverse Filtering

(IAIF) algorithm (Alku, 1992). It assumes a

two-pole model for the glottal source and uses

such assumption to reﬁne the all-pole vocal tract

estimation. A similar approach based on lattice

ﬁlters has been reported in (G´omez-Vilda et al.,

2008). An alternative asynchronous approach

consists in making use of the deconvolution ca-

pabilities of cepstrum for discriminating between

glottal source and vocal tract plus radiation. Such

approach was ﬁrstly proposed in (Oppenheim and

Schafer, 1968) and was reﬁned afterwards with

the addition of pole-zero modelling (Kopec et al.,

1977). Recently, its use for the estimation of

vocal tract resonances has been reported (Rahman

and Shimamura, 2005).

The interest of the glottal source waveform for the

assessment of laryngeal pathologies comes from the

close expected relationship between laryngeal func-

tion and the glottal waveform itself. Some results on

this application have been reported, for instance, in

(de Oliveira-Rosa et al., 2000), (G´omez-Vilda et al.,

2007) and (G´omez-Vilda et al., 2008). However, the

application of glottal inverse ﬁltering techniques to

pathological voices has a number of difﬁculties that

should not be disregarded. In the ﬁrst place, patholog-

ical voices may not have a clear harmonic or quasi-

harmonic structure (see type 3 voice segments in

chap. 4 of (Sapienza and Hoffman-Ruddy, 2009)) and

some pathologies may prevent complete glottal clo-

sure (Sapienza and Hoffman-Ruddy, 2009)(chap. 5).

Therefore, the implementation of pitch-synchronous

approaches may be problematic in such cases. In the

second place, assumptions about the spectral enve-

lope of the glottal waveform (e.g. a 12 dB/oct decay

(Walker and Murphy, 2007)) that are inherent to some

approaches, for instance IAIF, may not be valid for

pathological voices. In addition, other not yet solved

issues of inverse ﬁltering, no matter its application,

have to be considered too. One of the most remark-

able of such issues is the evaluation of the inverse ﬁl-

tering algorithms themselves. Although a set of ob-

jective measures for this evaluation has been proposed

(Moore and Torres, 2008), these rely on the expected

characteristics of the glottal source waveform, not on

the measured characteristics, as the glottal source is

commonly unknown. One way to solve that problem

is the usage of synthetic voices for the assessment of

the algorithms (Walker and Murphy, 2007), but the

validity of this approach depends on the realism of

the used voice synthesisers.

In the previously described context, this article

reports on the evaluation of two inverse ﬁltering ap-

proaches for pathological voice signal analysis. Due

to the above-mentioned potential characteristics of

pathological voices, pitch-asynchronous approaches

have been preferred. Among these, the performance

of IAIF (Alku, 1992) has been compared to that of

a variant of the homomorphic prediction (HoP) pro-

posed in (Kopec et al., 1977). The performance has

been evaluated using synthetic voice signals produced

with a physical voice model (Kob et al., 1999) (Kob,

2002a). The usage of synthetic voices has allowed

an objective and quantitative performance evaluation

that has been carried out both in temporal and spec-

tral domains. The rest of the paper is organised as fol-

lows: in section 2 a description of the voice simulator

and the voices produced with it is provided, section 3

contains a description of the analysed inverse ﬁltering

algorithms, section 4 presents the results of applying

these algorithms to the synthetic voices and, last, sec-

tion 5 is dedicated to the conclusions.

2 SIMULATED VOICE SIGNALS

2.1 Simulation Model

The materials used for the herein reported experi-

ments have been synthetic voice signals generated

with the VOX simulator (Kob, 2002b). An overview

of the simulation model can be found in (Kob et al.,

1999) and a more thorough description in (Kob,

2002a). As far as this paper is affected, the simulation

model consists roughly of two blocks: glottis model

and vocal tract model. The glottis model is formed by

a set of vocalis-mucosa pairs connected among them

and with the larynx walls by means of springs of tun-

able stiffness. Within each pair, both the vocalis and

the mucosa are represented by one mass each, the mu-

cosa above the vocalis, and connected between them

also by a spring. For the work reported in this pa-

per, each vocal fold has been modelled by a series of

15 vocalis-mucosa pairs. Two types of glottis have

been simulated: a normal glottis with the vocal folds

having straight edges and uniform stiffness and mass

distribution and a glottis with one pair of nodules sim-

ulated by a localised concentration of mass and irreg-

ular vocal-fold edges. The speciﬁc form of the vo-

cal folds has been chosen so as to mimic the move-

BIOSIGNALS 2010 - International Conference on Bio-inspired Systems and Signal Processing

Figure 1: Schematic representation of the vocal fold model:

Up: Parameters of a single element composed by vocalis

mass, mucosa mass and springs.

Down: Arrangement in the horizontal plane of 30 of above-

described elements (15 at each side) to simulate the glot-

tis (normal glottis on the left and one with nodules on the

right).

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

−0.015

−0.01

−0.005

0.005

0.01

0.015

Distance to Vocal Folds (m)

Radius (m)

Figure 2: Schematic representation of the vocal tract model.

ment of one of the nodular glottis examples provided

in (Sapienza and Hoffman-Ruddy, 2009). As for the

rest of parameters, default values suggested in (Kob,

2002a) have been used. A schematic representation

of the glottis model is given in ﬁgure 1.

As for the vocal tract, among the possible mod-

elling approaches offered by VOX, the “stairs” model

has been chosen. This corresponds to the Kelly-

Lochbaum model. It considers the vocal tract as a se-

ries of 45 concatenated tubes with different diameters

and the pressure wave propagation along the inside of

such structure is simulated. The simulation approach

is similar to that of (Mathur et al., 2006). Figure 2 de-

picts one simulated conﬁguration of the vocal tract.

Speciﬁcally, three different vocal tract shapes have

been simulated which intend to mimic the articulation

of the vowels /a/ (as in “hat”), /o/ (as in “toe”) and /u/

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

−2

−1

Normal voice

t (sec)

Normalised flow amplitude

0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05

−2

−1

Nodular voice

t (sec)

Normalised flow amplitude

Figure 3: Simulated normalised glottal ﬂow for normal (up)

and nodular (down) voices.

(as in “foot”). These shapes have been obtained from

(Story and Titze, 1998).

Therefore, on the whole the experiments have

been realised with six signals which come from the

combination of the two glottis models with the three

vocal tract models. For this particular experiment, un-

coupled simulation of vocal folds and vocal tract has

been chosen. Indeed, this eases the task of inverse

ﬁltering algorithms, since they are fed with signals

for which the simpliﬁed voice production model men-

tioned in section 1 is fully valid, which is not the case

of real voices. However, the choice is justiﬁed be-

cause such a model has proven to be valid for a wide

variety of applications and the purpose of the herein

reported work is to provide a comparative analysis of

algorithms and not absolute performance measures.

2.2 Characteristics of Simulated Glottal

Signals

Figure 3 shows part of the simulated glottal ﬂow sig-

nals for normal and nodular voices. In both plots,

the signals have been normalised to be zero-averaged

and with mean square value equal to one (µ

= 0,

= 1). The sampling frequency of both signals is

= 17640 Hz. While both signals have a fundamen-

tal frequency of 206 Hz –see the modulus of the cor-

responding Discrete Fourier Transforms (DFT) plot-

ted in ﬁgure 4–, the normal voice presents a smooth

glottal cycle, while the nodular voice has an irregular

glottal cycle with oscillations shorter that one cycle.

These correspond to a disjoint oscillation of the an-

terior and posterior parts of the glottis, with the nod-

ules being the limit between those parts. In spectral

domain, this is reﬂected by a non-harmonic structure

(peaks are not repeated at regular frequency intervals)

with spectral peaks higher than that corresponding to

the fundamental frequency (ﬁgure 4, down).

In ﬁgure 4, the spectrum of a linear prediction

(LP) model ﬁtted to the glottal signals has also been

depicted. The spectrum of the glottal ﬂow is typically

PITCH-ASYNCHRONOUS GLOTTAL INVERSE FILTERING OF NORMAL AND PATHOLOGICAL VOICES

BASED ON HOMOMORPHIC PREDICTION

−15

Normal voice

f (Hz)

Fitted 2−order LP model

Simulated glottal flow

−15

Nodular voice

f (Hz)

Fitted 2−order LP model

Simulated glottal flow

Figure 4: DFT of the simulated normalised glottal ﬂow for

normal (up) and nodular (down) voices. In both cases, the

spectrum of an ﬁtted 2-pole LP model has been plotted.

assumed to have an envelope with a 12 dB/dec de-

cay, which can be modelled with a 2-pole LP model

(Walker and Murphy, 2007). In fact, a 2-to-4 pole

LP model for the glottal ﬂow is proposed in (Alku,

1992) and a similar number of poles (1 to 3) is sug-

gested in (G´omez-Vilda et al., 2008). The discrete

all-pole modelling (DAP) algorithm (El-Jaroudi and

Makhoul, 1991) has been used for the ﬁtting of the LP

model. For the normal voice in ﬁgure 4 the LP model

approximately corresponds to the simulated signal as

for the placement of the highest spectral peak and the

smooth decay of the spectral envelope above 1 kHz.

In contrast the slope of the decay does not coincide.

However, in the case of the nodular signal, the decay

slope of both the model and the signal is roughly sim-

ilar, but the highest peak of the spectral envelope is

clearly different and the behaviour of the signal for

high frequencies (above 3 kHz) does not match the

model. These divergencies between the signals and

the all-pole LP models pose the interest of using in-

verse ﬁltering algorithms that make as less previous

assumptions as possible regarding the spectral enve-

lope of the vocal tract.

3 INVERSE FILTERING

ALGORITHMS

3.1 Iterative Adaptive Inverse Filtering

(IAIF)

As mentioned in section 1, the IAIF algorithm (Alku,

1992) is a pitch-asynchronous scheme that has be-

come prototypical for adaptive approaches to inverse

ﬁltering. The ﬂow diagram of the IAIF algorithm is

represented in ﬁgure 5. Within this algorithm, the

voice signal is processed by three blocks:

1. A high-pass ﬁlter with cut-off frequency near

Figure 5: Flow diagram of the IAIF algorithm.

30 Hz that removes slow variations of the glottal

ﬂow.

2. A FIR ﬁlter that removes the effect of the vo-

cal tract out of the voice signal. This block is

called inverse ﬁltering because is response H

(z)

is sought to be the inverse of the vocal tract re-

sponse:

(z) =

V (z)

. (1)

Since V (z) corresponds an all-pole LP model:

V (z) =

∑

j=1

· z

− j

⇒ (2)

⇒ H

(z) = 1+

∑

j=1

· z

− j

3. An integrator whose response H

(z) compensates

for the radiation effect of the lips. Such radiation

effect is, approximately, a differentiation, thus:

(z) =

1− ρ· z

−1

≈

1− z

−1

R(z)

. (3)

where ρ is a positive real number close to 1 but

lower, so as to ensure ﬁlter stability.

The purpose of the white blocks in ﬁgure 5 is to

provide a good estimate for V (z). For this purpose,

BIOSIGNALS 2010 - International Conference on Bio-inspired Systems and Signal Processing

a two-step procedure is repeated twice. In the ﬁrst

pass, a single-pole LP model is assumed for the glottal

waveform G(z) (G1) and, based on this assumption a

-pole LP model is ﬁtted (LPC analysis) to the result

of inverse-ﬁltering the voice signal with that glottal

model (T1). Such t

-pole LP model is a ﬁrst estimate

of V (z) which is used to produce a reﬁned 2-to-4 pole

LP model of G(z) (G2). This allows, again, an im-

proved estimation of V (z) using a LP model with t

poles (T2). The LPC analysis is done with the DAP

algorithm (El-Jaroudi and Makhoul, 1991), which has

shown to provide better model estimates than clas-

sical autocorrelation methods. A free implementa-

tion of the IAIF algorithm can be found in (Arias and

B¨ackstr¨om, 2008).

3.2 Homomorphic Prediction (HoP)

An alternative, pitch-asynchronous approach for in-

verse ﬁltering is that of homomorphic (or cepstral)

analysis. This was ﬁrst proposed in (Oppenheim and

Schafer, 1968) and it is based on two ideas:

• The introduction of a logarithm operation allows

to transform the convolution operation in a sum.

That is, in the speech production model could be

expressed in “log-z” domain as:

logS(z) = logG(z) + logV (z) + logR(z) (4)

• The vocal tract affects the spectral envelope of

voice, while the glottal source is responsible for

the ﬁne quasi-harmonic structure; therefore, the

inverse transform of (4) should result in a signal

in which G(z) and V (z) are highly separable.

A combination of these two ideas with the LP

modelling approach was introduced in (Kopec et al.,

1977) under the name of homomorphic prediction

(HoP). This combination consisted in calculating the

real cepstrum of the voice signal:

S[q] = F

−1



log





jω







G[q] +

V [q] +

R[q] ,

(5)

eliminating the part fo the cepstrum corresponding to

the harmonic structure of the voice signal by short-

pass liftering:

[q] =



S[q] if q < q

0 if q ≥ q

(6)

and ﬁtting an all-pole LP model to the remaining

minimum-phase signal that can be recovered as:

V (z)|

z=e

jω

≈ e

{

[q]

}

. (7)

The Fourier transform F and its inverse F

−1

can

be implemented with the DFT if a sufﬁcient num-

ber of zeroes is appended to the original voice signal

(Childers et al., 1977). A very similar approach has

more recently been used in (Rahman and Shimamura,

2005) to estimate the resonances of the vocal tract for

high-pitched voices.

While such an approach for separating the effects

of glottis and vocal tract has been applied to the iden-

tiﬁcation of the vocal tract response, both in (Kopec

et al., 1977) and (Rahman and Shimamura, 2005), it

has not been tested yet for glottal-source recovery. In

fact, the liftering operation in (6) separates between

spectral envelope and harmonic structure, provided

that q

is chosen to be less but close to the funda-

mental period of the signal. However, it does not dif-

ferentiate between the component of the spectral en-

velope of the signal due to the vocal tract and that

due to the glottal waveform itself. The glottal wave-

form is expected to provide a smooth decay of the

spectral envelope (as illustrated in ﬁgure 4), while the

vocal tract is expected to produce peaks associated to

its resonant frequencies. Therefore, in the cepstrum,

the lowest values of q in (5) should be associated to

the glottal waveform and the highest values below the

fundamental period should be associated to the effect

of the vocal tract.

Thus, the following algorithm, based in HoP, is

proposed for inverse ﬁltering:

1. Apply an integrating ﬁlter (3) to the voice sig-

nal so as to compensate for the lip radiation

R(z). This produces a radiation-compensated sig-

nal S

(z).

2. Calculate the cepstrum of S

(z) as in (5):

[q] = F

−1



log





jω







≈

G[q] +

V [q] .

(8)

3. Choose and appropriate value of q

to perform

long-pass liftering:

[q] =



0 if q ≤ q

S[q] if q > q

. (9)

4. Recover the minimum-phase signal:

(z)



z=e

jω

= e

{

[q]

}

. (10)

5. Fit a discrete all-pole LP model to S

(z). Such a

model should is an estimate of V (z).

6. Use the estimate of V (z) to perform inverse ﬁlter-

ing on S

(z), hence obtaining an estimate of the

glottal ﬂow derivative G(z).

4 RESULTS

The two inverse ﬁltering algorithms described in sec-

tion 3, namely IAIF and HoP, have been applied to

PITCH-ASYNCHRONOUS GLOTTAL INVERSE FILTERING OF NORMAL AND PATHOLOGICAL VOICES

BASED ON HOMOMORPHIC PREDICTION

1000 2000 3000 4000 5000 6000 7000 8000

−40

−30

−20

−10

f (Hz)

Vocal tract and estimates for normal voice

Simulated vocal tract

HoP vocal tract estimate

IAIF vocal tract estimate

1000 2000 3000 4000 5000 6000 7000 8000

−30

−20

−10

f (Hz)

Vocal tract and estimates for nodular voice

Simulated vocal tract

HoP vocal tract estimate

IAIF vocal tract estimate

Figure 6: Simulated and estimated vocal tract frequency re-

sponses for normal (up) and nodular (down) voices.

the six synthetic voice signals mentioned in section

2. In all six cases, signal interval with lengths equal

to 717 ms have been chosen, skipping the initial tran-

sient phase of the simulation. The outputs of both

IAIF and HoP have undergone an additional integra-

tion to obtain the normalised glottal ﬂow from its

derivative. Such integration has been performed with

ﬁlter H

(z) in (3). The values for the rest of the sim-

ulation parameters are summarised in table 1. The

value of ρ has been tuned manually between 0.9 and

0.99, values of t and g are as recommended in (Alku,

1992) and (G´omez-Vilda et al., 2008) and the value

of q

has been chosen considering the widest vocal

tract formant bandwidths (around 300 Hz) reported in

(Akande and Murphy, 2005). As for the input signals,

all of them have been normalized to have zero mean

and unit standard deviation.

Table 1: Values for simulation parameters.

ρ 0.92

Vocal-tract model order t ⌈ f

(kHz)⌉ = 18

Glottal model order g (IAIF) 2

Liftering thres. q

(HoP) ⌈ f

/300⌉ = 59

Figure 6 show the amplitude of the frequency

response of the simulated vocal tract corresponding

to the vowel /a/ and its estimates provided by the

IAIF and HoP algorithms for both normal and nodu-

lar voices. Qualitatively, it can be noticed that while

the IAIF algorithm allows a better identiﬁcation of the

resonant frequencies of the vocal tract, the HoP pro-

0 0.005 0.01 0.015 0.02 0.025 0.03

−2.5

−2

−1.5

−1

−0.5

0.5

1.5

2.5

t(s)

Normalised glottal flow estimates for normal voice

Simulated glottal flow

HoP estimate

IAIF estimate

0 0.005 0.01 0.015 0.02 0.025 0.03

−2.5

−2

−1.5

−1

−0.5

0.5

1.5

2.5

t(s)

Normalised glottal flow estimates for nodular voice

Simulated glottal flow

HoP estimate

IAIF estimate

Figure 7: Simulated and estimated glottal ﬂows for normal

(up) and nodular (down) voices.

vides a better ﬁt to its locally averaged shape. This is

specially true for frequencies below 2000 Hz.

The simulated and estimated glottal ﬂows for 30

ms segments of the signals corresponding to the vo-

cal tract /o/ are plot in ﬁgure 7. Qualitatively, both

for normal and nodular voices, HoP tends to provide

a better ﬁt to the original signal than IAIF. In order to

provide a quantitative comparison of the performance

of both approaches, the available signals have been

randomly split in segments with durations uniformly

distributed between 70ms and 210ms. For each seg-

ment, both the IAIF and HoP algorithms have been

applied to produce glottal ﬂow estimates ˆg

[n], where

i = 1.. . J stands for the processed voice segment. Af-

terwards, such estimates have also been normalized

and time aligned with the corresponding simulated

glottal ﬂows g

[n]. The experiment has been repeated

60 times with each signal with differentstarting points

for the voice segments and an average signal-to-error

measure has then been computed as follows:

SER = −20· log







∑

j=1

( ˆg

[n] − g

[n])

[n]







(11)

where E{·} means time averaging and J is the total

number of segments along the 60 iterations. The ob-

tained values of SER for the six test signals are com-

piled in table 2. The performance of HoP for this par-

ticular set-up and the used test signals is signiﬁcantly

better than that of IAIF for the normal voices and it is

BIOSIGNALS 2010 - International Conference on Bio-inspired Systems and Signal Processing

roughly similar for the pathological voices.

Table 2: Values in dB of the SER, as deﬁned in (11).

Vocal tract model /a/ /o/ /u/

Normal Voice

HoP 13.09 11.97 12.29

IAIF 10.69 7.40 9.29

Nodular Voice

HoP 1.25 5.44 5.44

IAIF 2.37 4.37 3.72

An insight into the reasons for the reported results

can be fund by analysing the spectral behaviour of the

glottal waveform estimation error. Speciﬁcally, ﬁgure

8 shows the plot of:

E (ω) = 20· log

∑

j=1





jω



− G



jω





(12)

for both normal and nodular voices and for the vo-

cal tract /a/, for which HoP has the worst compara-

tive performance. For normal voice, the HoP estima-

tion error is lower than that of IAIF for all frequen-

cies except for an interval around the ﬁrst harmonic,

which is very close to the peak provided by the 2-

pole LP model of the glottal signal (recall ﬁgure 3).

It should be remembered that the 2-pole LP model

in this case was not able to match the decay of the

spectral envelope of the signal, hence the wide inter-

val over which HoP provides a better estimate of the

glottal ﬂow. In contrast, the ﬁt between the decay of

the 2-pole LP model and the signal was much better

for the simulated pathological signal (recall ﬁgure 3,

again). This results in a better performance of IAIF

except around the position of the spectral peak of the

2-pole model (500 to 1100 Hz) and for very low fre-

quencies (less than 200 Hz). Therefore, the compar-

ative performance of HoP versus IAIF seems to be

directly affected by the goodness of the ﬁt between

the spectral envelope of the glottal ﬂow and the all-

pole LP model used to model that ﬂow. In this case,

a good ﬁt, as is the case for the nodular voice, results

in a similar performance of IAIF and HoP, while a

divergence results in HoP outperforming IAIF.

5 CONCLUSIONS

Within this paper, the applicability of homomorphic

prediction to inverse ﬁltering has been analysed. This

approach, while it has not been favoured much by re-

searchers during the last years was ﬁrstly proposed

in (Kopec et al., 1977) for the estimation of the res-

onance structure of the vocal tract. Herein, it has

−20

−10

frec(Hz)

Average spectral estimation error for normal voice

HoP normal voice

IAIF normal voice

−20

−10

frec(Hz)

Average spectral estimation error for nodular voice

HoP nodular voice

IAIF nodular voice

Figure 8: Estimation error in spectral domain for normal

(up) and nodular (down) voices.

been shown that a modiﬁcation of such an algorithm

by using long-pass liftering in cepstral domain in-

stead of short-pass liftering provides an alternative

method for the identiﬁcation of the glottal source in

voice signals. This method shares with iterative adap-

tive inverse ﬁltering its capability for working pitch-

asynchronously, which is a relevant issue for the pro-

cessing of pathological voices. Moreover, though

not directly studied in this paper, previously reported

analysis indicate that HoP may outperform IAIF for

high-pitched voices (Rahman and Shimamura, 2005)

(Walker and Murphy, 2007) in pitch-asynchronous

schemes.

The performance of both HoP and IAIF has been

analysed quantitatively by using synthetic voice sig-

nals provided by a biomechanical simulator of voice

production. The reported results show that both algo-

rithms provide a similar performance when the glot-

tal signal suits well the all-pole LP model assumed in

IAIF, while HoP gives better ﬁgures when this does

not happen. Therefore, it can be concluded that HoP

provides a scheme for inverse ﬁltering that is ade-

quate for pathological voices since it works pitch-

asynchronously and that is more robust against vari-

ability in the spectral envelope of the glottal signal,

since it does not impose assumptions related to its

spectral decay.

PITCH-ASYNCHRONOUS GLOTTAL INVERSE FILTERING OF NORMAL AND PATHOLOGICAL VOICES

BASED ON HOMOMORPHIC PREDICTION

ACKNOWLEDGEMENTS

This research was carried out within a project funded

by the Ministry of Science and Technology of

Spain (TEC2006-12887-C02) and within the “Jos´e

Castillejo” program for the mobility of researchers.

The work has also been done in the context of Euro-

pean COST action 2103.

REFERENCES

Akande, O. O. and Murphy, P. J. (2005). Estimation of the

vocal tract transfer function with application to glottal

wave analysis. Speech Communication, 46(1):15 – 36.

Alku, P. (1992). An automatic method to estimate the time-

based parameters of the glottal pulseform. In IEEE In-

ternational Conference on Acoustics, Speech and Sig-

nal Processing, volume 2, pages 29–32.

Arias, M. and B¨ackstr¨om, T. (2008). TKK aparat.

http://aparat.sourceforge.net (visited May 2009).

Childers, D., Skinner, D., and Kemerait, R. (1977). The

cepstrum: A guide to processing. Proceedings of the

IEEE, 65(10):1428–1443.

de Oliveira-Rosa, M., Pereira, J., and Grellet, M. (2000).

Adaptive estimation of residue signal for voice pathol-

ogy diagnosis. IEEE Transactions on Biomedical En-

gineering, 47(1):96–104.

El-Jaroudi, A. and Makhoul, J. (1991). Discrete all-pole

modeling. IEEE Transactions on Signal Processing,

39(2):411–423.

Fu, Q. and Murphy, P. (2006). Robust glottal source estima-

tion based on joint source-ﬁlter model optimization.

IEEE Transactions on Audio, Speech and Language

Processing, 14(2):492–501.

G´omez-Vilda, P., Fern´andez-Baillo, R., Nieto, A., D´ıaz, F.,

Fern´andez-Camacho, F. J., Rodellar, V.,

Alvarez, A.,

and Mart´ınez, R. (2007). Evaluation of voice pathol-

ogy based on the estimation of vocal fold biomechan-

ical parameters. Journal of Voice, 21(4):450 – 476.

G´omez-Vilda, P., Fern´andez-Baillo, R., Rodellar-Biarge,

V., Nieto-Lluis, V.,

Alvarez-Marquina, A., Mazaira-

Fern´andez, L. M., Mart´ınez-Olalla, R., and Godino-

Llorente, J. I. (2008). Glottal source biometrical sig-

nature for voice pathology detection. Speech Commu-

nication, In Press.

Kob, M. (2002a). Physical Modeling of the Singing Voice.

PhD thesis, Fakul¨at f¨ur Elektrotechnik und Informa-

tionstechnik - RWTH Aachen. Logos-Verlag.

Kob, M. (2002b). Vox - a time-domain model for

the singing voice. http://www.akustik.rwth-

aachen.de/˜malte/vox/index.html.en (visited May

2009). Computer software.

Kob, M., Alhuser, N., and Reiter, U. (1999). Time-domain

model of the singing voice. In Proceedings of the 2

COST G-6 Workshop on Digital Audio Effects, Trod-

heim (Norway).

Kopec, G., Oppenheim, A., and Tribolet, J. (1977). Speech

analysis homomorphic prediction. IEEE Transac-

tions on Acoustics, Speech and Signal Processing,

25(1):40–49.

Mathur, S., Story, B. H., and Rodriguez, J. J. (2006).

Vocal-tract modeling: fractional elongation of seg-

ment lengths in a waveguide model with half-sample

delays. IEEE Transactions on Audio Speech and Lan-

guage Processing, 14(5):1754–1762.

Moore, E. and Torres, J. (2008). A performance assess-

ment of objective measures for evaluating the quality

of glottal waveform estimates. Speech Communica-

tion, 50(1):56–66.

Oppenheim, A. and Schafer, R. W. (1968). Homomorphic

analysis of speech. IEEE Transactions on Audio and

Electroacoustics, 16(2):221–226.

Rabiner, L. R. and Schafer, R. W. (1978). Digital processing

of speech signals. Prentice-Hall.

Rahman, M. S. and Shimamura, T. (2005). Formant fre-

quency estimation of high-pitched speech by homo-

morphic prediction. Acoustical science and technol-

ogy, 26(6):502–510.

Sapienza, C. and Hoffman-Ruddy, B. (2009). Voice Disor-

ders. Plural Publishing.

Story, B. H. and Titze, I. R. (1998). Parameterization

of vocal tract area functions by empirical orthogonal

modes. Journal of Phonetics, 26(3):223–260.

Walker, J. and Murphy, P. (2007). A review of glottal

waveform analysis. In Stylianou, Y., Faundez-Zanuy,

M., and Esposito, A., editors, Progress in Nonlin-

ear Speech Processing, volume 4391/2007 of Lecture

Notes in Computer Science, pages 1–21. Springer-

Verlag.

Wong, D., Markel, J., and Jr., A. G. (1979). Least squares

glottal inverse ﬁltering from the acoustic speech wave-

form. IEEE Transactions Acoustics, Speech and Sig-

nal Processing, 27(4):350–355.

BIOSIGNALS 2010 - International Conference on Bio-inspired Systems and Signal Processing