BIO-INSPIRED AUDITORY PROCESSING

FOR SPEECH FEATURE ENHANCEMENT

HariKrishna Maganti and Marco Matassoni

Fondazione Bruno Kessler, Center for Information Technology, IRST, via Sommarive 18, 38123 Povo, Trento, Italy

Keywords:

Bio-inspired auditory processing, Gammatone ﬁltering, Modulation spectrum, Reverberation, Automatic

speech recognition.

Abstract:

Mel-frequency cepstrum based features have been traditionally used for speech recognition in a number of

applications, as they naturally provide a higher recognition accuracies. However, these features are not very

robust in a noisy acoustic conditions. In this article, we investigate the use of bio-inspired auditory features

emulating the processing performed by cochlea to improve the robustness, particularly to counter environmen-

tal reverberation. Our methodology ﬁrst extracts robust noise resistant features by gammatone ﬁltering, which

emulate cochlea frequency resolution and then a long-term modulation spectral processing is performed which

preserves speech intelligibility in the signal. We compare and discuss the features based upon the performance

on Aurora5 meeting recorder digit task recorded with four different microphones in a hands-free mode at a real

meeting room. The experimental results show that the proposed features provide considerable improvements

with respect to the state of the art feature extraction techniques.

1 INTRODUCTION

A signiﬁcant trend in ubiquitous computing is to fa-

cilitate the user to communicate and interact naturally

with concerned applications. Speech is an appealing

mode of communication for such applications. The

human-machine interaction using automatic speech

processing technologies is a diversiﬁed research area,

which has been investigated actively (Kellermann,

2006; Droppo and Acero, 2008).

Speech acquisition, processing and recognition in

a non-ideal acoustic environments are complex tasks

due to presence of unknownadditive noise, reverbera-

tion and interfering speakers. Additive noise from in-

terfering noise sources, and convolutive noise arising

from acoustic environment and transmission channel

characteristics contribute to a degradation of perfor-

mance in speech recognition systems. This article ad-

dresses the problem of robustness of automatic speech

recognition (ASR) systems due to convolutive noise

by modeling techniques performed by cochlea in hu-

man auditory processing system.

The inﬂuence of additive background noise on the

speech signal can be expressed as

y(n) = x(n) + n(n) (1)

where y(n) is the degraded speech signal, x(n) rep-

resents the clean signal, n(n) is the additive noise,

which is uncorrelated with the speech signal and un-

known. Different techniques have been proposed

based on voice activity detection based noise estima-

tion, minimum statistics noise estimation, histogram

and quantile based methods, and estimation of the

posteriori and a priori signal-to-noise ratio (Woelfel

and McDonough, 2009). In Ephraim and Cohen

(Ephraim and Cohen, 2006), various approaches to

speech enhancement based on noise estimation and

spectral subtraction are discussed. Apart from the sta-

tionary background noise, another important source

of degradation is caused by reverberation produced in

acoustic environment. The speech signal acquired in

a reverberant room can be modeled as convolution of

the speech signal with the room impulse response,

y(n) = x(n) ∗ h(n) (2)

where y(n) is the degraded speech signal, x(n) repre-

sents the clean signal, h(n) is the impulse response

of the room. The impulse response depends upon

the distance between the speaker and the microphone,

and room conditions, such as movement of people

in the room, clapping, opening or closing doors, etc.

Thus extracting robust features which can handle var-

ious room impulse responses is a complex and chal-

lenging task. A variant of spectral subtraction has

Maganti H. and Matassoni M..

BIO-INSPIRED AUDITORY PROCESSING FOR SPEECH FEATURE ENHANCEMENT.

DOI: 10.5220/0003145800510058

In Proceedings of the International Conference on Bio-inspired Systems and Signal Processing (BIOSIGNALS-2011), pages 51-58

ISBN: 978-989-8425-35-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

been proposed in (Habets, 2004) to enhance speech

degraded by reverberation.

In general to improve robustness of the noisy

speech, processing can be performed at signal, fea-

ture or model level. Speech enhancement techniques

aim at improving the quality of speech signal cap-

tured through single microphone or microphone ar-

ray (Omologo et al., 1998; Martin, 2001). Robust

acoustic features attempt to represent parameters less

sensitive to noise by modifying the extracted fea-

tures. Common techniques include cepstral mean nor-

malization (CMN) and cepstral mean subtraction and

variance normalization (CMSVN) and relative spec-

tral (RASTA)ﬁltering (Droppo and Acero, 2008; Her-

mansky and Morgan, 1994). Model adaptation ap-

proach modify the acoustic model parameters to ﬁt

better with the observed speech features (Omologo

et al., 1998; Gales and Young, 1995).

Performance of the human auditory system is

more adept at noisy speech recognition. Auditory

modeling, which simulates some properties of the

human auditory system have been applied to speech

recognition system to enhance its robustness. The in-

formation coded in auditory spike trains and the in-

formation transfer processing principles found in the

auditory pathway are used in (Holmberg et al., 2005;

Deng and Sheikhzadeh, 2006). The neural synchrony

is used for creating noise-robust representations of

speech (Deng and Sheikhzadeh, 2006). The model

parameters are ﬁne-tuned to conform to the popula-

tion discharge patterns in the auditory nerve which

are then used to derive estimates of the spectrum on a

frame-by-frame basis. This was extremely effectivein

noise and improved performance of the ASR dramat-

ically. Various auditory processing based approaches

were proposed to improve robustness (Ghitza, 1988;

Seneff, 1988; Dau et al., 1996) and in particular, the

works described in (Deng and Sheikhzadeh, 2006;

Flynn and Jones, 2006) were focused to address the

additive noise problem. Further, in (Kleinschmidt

et al., 2001) a model of auditory perception (PEMO)

developed by Dau et al. (Dau et al., 1996) is used as

a front end for ASR, which performed better than the

standard MFCC for an isolated word recognition task.

Principles and models relating to auditory processing,

which attempt to model human hearing to some extent

have been applied for speech recognition in (Herman-

sky and Morgan, 1994; Hermansky, 1997).

The important aspect in a speech recognition sys-

tem is to have abstract representation of highly redun-

dant speech signal, which is achieved by frequency

analysis. The cochlea and hair cells of the inner

ear perform spectrum analysis to extract relevant fea-

tures. The models for auditory spectrum analysis are

based on ﬁlterbank design, which are usually char-

acterized by non-uniform frequency resolution and

non-uniform bandwidth on linear scale. Examples

include popular speech analysis techniques, namely

Mel frequency cepstrum and perceptual linear pre-

diction which try to emulate human auditory percep-

tion. Other important processing is based upon Gam-

matone ﬁlter bank, which is designed to model hu-

man cochlear ﬁltering and is shown to provide robust-

ness in adverse noise conditions for speech recogni-

tion tasks (Flynn and Jones, 2006; Schlueter et al.,

2006). In (Flynn and Jones, 2006), gammatone based

auditory front-end exhibited robustperformance com-

pared to traditional front-ends based on MFCC, PLP

and standard ETSI frontend. For large vocabulary

speech recognition tasks, the performance of these

features have been competitive with standard fea-

tures like MFCC and PLP (Schlueter et al., 2006).

Another important psychoacoustic property is mod-

ulation spectrum of speech, which is important for

speech intelligibility (Dau et al., 1996; Drullman

et al., 1994). The relative prominence of slow tem-

poral modulations is different at various frequencies,

similar to perceptual ability of human auditory sys-

tem. Particularly, most of the useful linguistic in-

formation is in the modulation frequency components

from the range between 2 and 16 Hz, with dominant

component at around 4 Hz (R.Drullman et al., 1994;

Kanedera et al., 1999; Hermansky, 1997). Modula-

tion spectrum based features computed over longer

windows have been effective in measuring speech in-

telligibility in noisy environments (Houtgast et al.,

1980; Kingsbury, 1998).

In this work, an alternate approach based on psy-

choacoustic properties combining gammatone ﬁlter-

ing and modulation spectrum of speech, to preserve

both quality and intelligibility for feature extraction

is presented. Gammatone frequency resolution re-

duces the ASR system sensitivity to environmental

reverberant signal attributes and improve the speech

signal characteristics. Further, long-term modulation

preserves the linguistic information in the speech sig-

nal, improving the accuracy of the system. The fea-

tures derived from the combination are used to pro-

vide robustness, particularly in the context of mis-

match between training and testing reverberant envi-

ronments. The studied features are shown to be reli-

able and robust to the effects of the hands-free record-

ings in the reverberant meeting room. The effective-

ness of the proposed features is demonstrated with ex-

periments which use real-time reverberant speech ac-

quired through four different microphones. For com-

parison purposes the recognition results obtained us-

ing conventional features are tested, and usage of the

BIOSIGNALS 2011 - International Conference on Bio-inspired Systems and Signal Processing

proposed features proved to be efﬁcient.

The paper is organized as follows: Section 2 gives

an overview of the auditory inspired features, in-

cluding gammatone ﬁlter bank processing and mod-

ulation spectrum processing. Section 3 describes

the methodology for feature extraction. Section 4

presents database description, experiments and re-

sults. Section 5 discusses the results. Finally, Section

6 concludes the paper.

2 FEATURE DESCRIPTION

In this section, a brief introduction and general

overview of auditory features based on gammatone

ﬁlter bank and modulation spectrum is presented.

2.1 Gammatone Filter Bank

The gammatone ﬁlter was ﬁrst conceptualized by

Flanagan as a model of the basilar membrane dis-

placement in the human ear (Flanagan, 1960). Jo-

hannesma used it to approximate responses recorded

from the cochlear nucleus in the cat (Johannesma,

1972). de Boer and de Jongh used a gammatone func-

tion to model impulse responses from auditory nerve

ﬁber recordings, which have been estimated using a

linear reverse-correlation technique (de Boer, 1973).

Patterson et al. showed that the gammatone ﬁlter

also delineates psychoacoustically determined audi-

tory ﬁlters in humans (Patterson et al., 1987).

Gammatone ﬁlters are linear approximation of

physiologically motivated processing performed by

the cochlea(Slaney, 1993), comprise series of band-

pass ﬁlters, whose impulse response is deﬁned by:

g(t) = at

n−1

cos(2π f

t + φ)e

−2πbt

(3)

where n is the order of the ﬁlter, b is the bandwidth

of the ﬁlter, a is the amplitude, f

is the ﬁlter center

frequency and φ is the phase.

0 500 1000 1500 2000 2500 3000 3500 4000

−60

−50

−40

−30

−20

−10

Frequency(Hz)

Filter Response (dB)

Figure 1: Frequency response for the 32-channel gamma-

tone ﬁlterbank.

The ﬁlter center frequencies and bandwidths

are derived from the ﬁlter’s Equivalent Rectangular

Bandwidth (ERB) as detailed in (Slaney, 1993). In

(Glasberg and Moore, 1990), Glasberg and Moore re-

late center frequencyand the ERB of an auditory ﬁlter

ERB( f

) = 24.7(

4.37f

1000

+ 1) (4)

The ﬁlter output of the m

gammatone ﬁlter , X

can be expressed by

(n) = x(n) ∗ h

(n) (5)

where h

(n) is the impulse response of the ﬁlter.

The frequency response of the 32-channel gam-

matone ﬁlterbank is as shown in Fig. 1.

2.2 Modulation Spectrum

The temporal evolution of speech spectral parameters,

which describe slow variation in energy represent

important information associated with phonetic seg-

ments (Greenberg, 1997). The low-frequency modu-

lations encode information pertaining to syllables, by

virtue of variation in the modulation pattern across the

acoustic spectrum. Dudley showed that essential in-

formation in speech is embedded in modulation pat-

terns lower than 25 Hz distributed over a few as 10

discrete spectral channels (Dudley, 1939). Further,

studies by Drullman et al. conﬁrmed the importance

of amplitude modulation frequencies on speech intel-

ligibility, particularly modulation frequencies below

16Hz contributing to speech intelligibility (Drullman

et al., 1994). Houtgast and Steeneeken demonstrated

that modulation frequencies between 2 and 10 Hz can

be used as an objective measure of speech intelligibil-

ity, for assessing quality of speech over wide range of

acoustic environments (Houtgast et al., 1980).

The long-termmodulations examine the slow tem-

poral evolution of the speech energy with time win-

dowsin the range of 160 - 800 ms, contrary to the con-

ventional short-term modulations studied with time

windows of 10 -30 ms which capture rapid changes

of the speech signals. Generally, the modulation spec-

trum is computed as following: Speech signal X(k) is

segmented into frames by a window function w(k,t),

where t is frame number. Short-time Fourier trans-

form of the windowed speech signal X(t, f) is calcu-

lated as

Y(t, f) =

∞

∑

i=−∞

X( f − i)W(i,t) (6)

The modulation spectrum Y

( f,g) is obtained by ap-

plying Fourier transform on the running spectra, ob-

BIO-INSPIRED AUDITORY PROCESSING FOR SPEECH FEATURE ENHANCEMENT

Figure 2: Processing stages of the gammatone modulation spectral feature.

tained by taking absolute values |Y(t, f)| at each fre-

quency, expressed as

( f,g) = FT[|Y(t, f)|]|

t=1,...T

(7)

where T is the total number of frames and g is the

modulation frequency. The relative prominence of

slow temporal modulations is different at various fre-

quencies, similar to perceptual ability of human audi-

tory system. Most of the useful linguistic information

is in the modulation frequency components from the

range between 2 and 16 Hz, with dominant compo-

nent at around 4 Hz (Drullman et al., 1994; Kanedera

et al., 1999). In (Kanedera et al., 1999), it has been

shown that for noisy environments, the components

of the modulation spectrum below 2 Hz and above 10

Hz are less important for speech intelligibility, par-

ticularly the band below 1 Hz contains mostly infor-

mation about the environment. Therefore the recog-

nition performance can be improved by suppressing

this band in the feature extraction.

The comparative waveforms, spectrograms, gam-

matonegrams and modulation spectrum density plots

of the clean and noisy versions corrupted with con-

volutive and additive noises of the same speech utter-

ance are as shown in Fig. 3. From modulation spec-

trum density plots, some of the important characteris-

tics of the modulation spectrum can be observed. The

important information of speech is concentrated in the

area from 2 Hz and 16 Hz, particularly 2 Hz and 4 Hz

contain crucial information related to the variation of

phonemes.

3 METHODOLOGY

The block schematic for the gammatone modulation

spectrum based feature extraction technique is shown

in Fig. 2. The speech signal ﬁrst undergoes pre-

emphasis, which ﬂatten the frequency characteristics

of the speech signal. The signal is then processed by a

gammatone ﬁlterbank which uses 32 frequency chan-

nels equally spaced on the equivalent ERB scale as

shown in Fig. 1. The impulse responses of the gam-

matone ﬁlterbank are similar to the impulse responses

of the auditory system found in physiological mea-

surements (de Boer, 1973). The ﬁlterbank is linear

and does not consider nonlinear effects such as level-

dependent upward spread of masking and combina-

tion tones. The computationally effective gamma-

tone ﬁlter bank implementation as described in (Ellis,

2010) is used. The gammatone ﬁlter bank transform

is computed over L ms and the segment is shifted by n

ms. The log magnitude resulting coefﬁcients are then

decorrelated by applying a discrete cosine transform

(DCT). The computations are made over all the in-

coming signal, resulting in a sequence of energy mag-

nitudes for each band sampled at 1/n Hz. Then, frame

by frame analysis is performed and a N-dimensional

parameter is obtained for each frame. The modulation

spectrum of each coefﬁcient which is deﬁned as the

Fourier transform of its temporal evolution is com-

puted. In each band, the modulations of the signal

are analyzed by computing FFT over the P ms Ham-

ming window and the segment is shifted by p ms.

The energies for the frequencies between the 2 - 16

Hz, which represent the important components for the

speech signal are computed.

For example, if the given signal x(t) is sampled at

BIOSIGNALS 2011 - International Conference on Bio-inspired Systems and Signal Processing

1 2 3 4 5

−0.5

0.5

(a)

Amplitude

Time (sec)

1 2 3 4 5

−0.5

0.5

(b)

Amplitude

Time (sec)

1 2 3 4 5

−1

−0.5

0.5

(c)

Amplitude

Time (sec)

Frequenzy (Hz)

Time (sec)

0 1 2 3 4 5

1000

2000

3000

4000

Frequenzy (Hz)

Time (sec)

0 1 2 3 4 5

1000

2000

3000

4000

Frequenzy (Hz)

Time (sec)

0 1 2 3 4 5

1000

2000

3000

4000

Frequenzy (Hz)

Time (sec)

0 1 2 3 4 5

1000

2000

3000

4000

Frequenzy (Hz)

Time (sec)

0 1 2 3 4 5

1000

2000

3000

4000

Frequenzy (Hz)

Time (sec)

0 1 2 3 4 5

1000

2000

3000

4000

Frequenzy (Hz)

Modulation Freq (Hz)

0 10 20 30 40 50

1000

2000

3000

4000

Frequenzy (Hz)

Modulation Freq (Hz)

0 10 20 30 40 50

1000

2000

3000

4000

Frequenzy (Hz)

Modulation Freq (Hz)

0 10 20 30 40 50

1000

2000

3000

4000

Figure 3: Waveform, spectrogram, gammatonegram, and modulation spectrum density plots for the (a)clean, (b)reverberant

and (c)additive noise corrupted speech.

8 kHz, a ﬁrst-order high pass pre-emphasis ﬁlter is

applied and short segments of speech are extracted

with a 25 ms rectangular window. The window is

shifted by 10 ms which corresponds to a frame rate

of 100 Hz. Each speech frame is then processed by

a 32-channel gammatone ﬁlterbank. The 32 logarith-

mic gammatone spectral values are transformed to the

cepstral domain by means of a DCT. Thirteen cep-

stral coefﬁcients C0 to C12 are calculated. C0 is re-

placed by logarithm of the energy computed from the

speech samples. The modulation spectrum of each

coefﬁcient, (sampled at 100Hz) is calculated with a

160 ms window, shifted by 10 ms. Thirteen coefﬁ-

cients C13 to C26 which are ﬁrst-order derivatives are

further extracted. The features are named gammatone

ﬁlterbank modulation cepstral (GFMC) features.

The same processing is also performed by replac-

ing gammatone ﬁlterbank with Mel ﬁlterbank in the

Figure 2 resulting in Mel-frequency modulation cep-

stral (MFMC) features. The performance of these fea-

BIO-INSPIRED AUDITORY PROCESSING FOR SPEECH FEATURE ENHANCEMENT

tures in comparison to GFMC features are discussed

in Section 4.

4 EXPERIMENTS AND RESULTS

To evaluate the performance, a full HTK based recog-

nition system is used. The HMM-based recognizer ar-

chitecture speciﬁed for use with the Aurora 5 database

is used (Hirsch, 2007). The training data is down-

sampled version of clean TIDigits at a sampling fre-

quency of 8 kHz, with 8623 utterances. There are

eleven whole word HMMs each with 16 states and

with each state having four Gaussian mixtures. The

sil model has three states and each state has four mix-

tures.

4.1 Convolutive Noise

The experiments are conducted on a subset of the

Aurora-5 corpus - meeting recorder digits. The data

comprise real recordings in a meeting room, recorded

in a hands-free mode at the International Computer

Science Institute in Berkeley. The dataset consists of

2400 utterances from 24 speakers, with 7800 digits

in total. The speech was captured with four differ-

ent microphones, placed at the middle of the table

in the meeting room. The recordings contain only a

small amount of additive noise, but have the effects of

hands-free recording in the reverberant room. There

are four different versions of all utterances recorded

with four different microphones, with recording lev-

els kept low.

Table 1 shows the results in % word accuracies for

meeting recording digits recorded with four different

microphones, labeled as 6, 7, E and F. The average

performance of four microphones for different fea-

tures is shown at the last column of the table. ETSI-2

correspond to the standard advanced front-end as de-

scribed in (Hirsch, 2007). PLP and MFCC are the

standard 39-dimensional Perceptual linear prediction

and Mel frequency features along with their delta and

acceleration derivatives. MFMC indicate Mel Fre-

quency Modulation Spectral based Cepstral (MFMC)

features where the ﬁrst thirteen features are extracted

in a traditional way, and the rest are the modulation

features (13) and their derivatives (13) derived as dis-

cussed in Section 3, except for Gammatone ﬁlterbank

being replaced with Mel ﬁlterbank. The GFCC fea-

tures are extracted in a similar way as reported in Sec-

tion 3 with C0 to C12 being the corresponding cep-

stral coefﬁcients. GFMC indicate Gammatone Fre-

quency Modulation Spectral based Cepstral (GFMC)

features derivedin a same way as GFCC but appended

with modulation spectral features corresponding to

C13 toC26 and their corresponding derivatives as dis-

cussed in Section 3.

Table 1: Word recognition accuracies (%) for different

feature extraction techniques on four different microphones.

Channel 6 7 E F Average

ETSI-2 64.3 47.6 58.1 62.7 58.1

PLP 73.8 63.8 68.1 71.4 69.2

MFCC 75.8 64.7 67.3 75.9 70.9

MFMC 75.6 61.0 70.8 77.9 71.3

GFCC 86.0 79.0 78.3 84.2 81.9

GFMC 87.8 82.7 82.2 86.9 84.9

From Table 1, it is evident that the advanced ETSI

front-end has highest error rates compared to the

MFCC and PLP. This demonstrates that for reverber-

ant environments the advanced ETSI front-end is not

effective as compared to its performance in the pres-

ence of additive background noise. It can be inferred

that the techniques applicable for additive background

noise removal are not suitable to handle reverberant

conditions. The MFMC features have better perfor-

mance than MFCC, which in turn had better perfor-

mance than PLP. It can also be seen that the GFCC

features were effective, performing better than any of

the baseline systems (ETSI-2, PLP, MFCC). This is

consistent with the earlier studies which have shown

that gammatone based features exhibit robust perfor-

mance compared to MFCC, PLP features and ETSI

frontend (Flynn and Jones, 2006; Schlueter et al.,

2006).

It can also be observed that the performance of

GFMC is the best among all the baselines and fea-

tures compared, and consistent across all the chan-

nels. However, the combination of Mel ﬁltering and

modulation spectral features is not as beneﬁcial as

gammatone ﬁltering with modulation spectral fea-

tures. This clearly demonstrates the efﬁciency of this

combination of these features in reverberation condi-

tions.

4.2 Additive Noise

Further, to test the efﬁciency for practical conditions

which contain additive noises, preliminary experi-

ments were conducted on close-talk, hands free of-

ﬁce and hands free livingroom with clean and 15

dB SNR additive noise corrupted signals. The data

is from Aurora-5 database, where condition is simu-

lated as combination of additive noise and reverber-

ation(Hirsch, 2007). Aurora-5 covers all effects of

noises as they occur in realistic application scenarios.

BIOSIGNALS 2011 - International Conference on Bio-inspired Systems and Signal Processing

In this experiments, hands free speech input in a ofﬁce

and in a living room is considered. In Table 2, Ctalk,

HFOfﬁce, HFLroom, -, A represents close-talk, hands

free ofﬁce, hands free living room, no additive noise

and with additive noise of 15 dB SNR respectively.

Table 2: Word recognition accuracies (%) for clean, hands

free ofﬁce and hands free living room conditions.

Feature Ctalk HFOfﬁce HFLroom

- A - A - A

MFCC 99.4 94.3 90.1 61.6 75.8 40.2

PLP 99.3 85.2 88.6 65.1 74.3 46.9

GFCC 99.5 88.1 89.1 65.6 73.8 48.9

GFMC 99.5 92.8 92.2 73.3 78.6 57.4

From Table 2, it can be observed that for all the

features the performance degrades signiﬁcantly in ad-

ditive noise compared to no additive noise case. Also,

it can be seen that GFCC has better performance than

MFCC and PLP in case of hands free ofﬁce and hands

free living room. It can be observed that for almost

all cases GFMC has better performance than GFCC,

MFCC and PLP indicating efﬁciency of this features

in additive noise conditions.

5 DISCUSSION

The results from both Table 1 and Table 2 indicate that

the gammatone frequency resolution was effective in

reducing system sensitivity to reverberation and ad-

ditive noise, and improved the speech signal charac-

teristics. It can also be observed from Table 1, that

the combination of gammatone ﬁltering with modu-

lation spectral features is beneﬁcial than the combi-

nation of Mel ﬁltering and modulation spectral fea-

tures. The emphasis on slow temporal changes in the

spectral structure of long-term modulations preserved

the required speech intelligibility information in the

signal which further improved the accuracy of the

system. Thus, by extracting features that model hu-

man hearing to some extent mimicking the processing

performed by cochlea, particularly emulating cochlea

frequencyresolution was beneﬁcial for speech feature

enhancement.

6 CONCLUSIONS

The paper has presented auditory inspired modulation

spectral features for improving ASR performance in

presence of room reverberation. The proposed fea-

tures were derived from features based on emulat-

ing the processing performed by cochlea to improve

the robustness, speciﬁcally gammatone frequency ﬁl-

tering and long-term modulations of the speech sig-

nal. The features were evaluated on Aurora-5 meeting

recorder digit task recorded with four different micro-

phones in hands-free mode at a real meeting room.

Results were compared with standard ETSI advanced

front-end and conventional features. The results show

that the proposed features perform consistently bet-

ter both in terms of robustness and reliability. The

work also presented some preliminary results in addi-

tive noise scenario where the speech signal was cor-

rupted with 15 dB SNR noise, simulated with hands-

free ofﬁce and hands-free living room conditions. The

results are promising, performing better than the con-

ventional features, indicating the efﬁciency of this

features in additive noise scenario.

Our study raised number of issues, including

study of auditory inspired techniques for improve-

ment of standard additive noise removal techniques

to deal with reverberation condition. The gammatone

ﬁlter implemented in this work is linear which does

not consider nonlinear effects such as level-dependent

upward spread of masking and combination tones.

For the future, we like to investigate these issues to ef-

ﬁciently deal with real world noisy speech, and eval-

uate these features on large vocabulary tasks.

REFERENCES

Dau, T., Pueschel, D., and Kohlrausch, A. (1996). A quan-

titative model of the effective signal processing in the

auditory system. The Journal of the Acoustical Society

of America, 99:3615–3622.

de Boer, E. (1973). On the principle of speciﬁc coding.

Journal of Dynamic Systems, Measurement and Con-

trol, (Trans. ASME), 95:265– 273.

Deng, L. and Sheikhzadeh, H. (2006). Use of Temporal

Codes Computed From a Cochlear Model for Speech

Recognition. Psychology Press.

Droppo, J. and Acero, A. (2008). Environmental Robust-

ness. Springer Handbook of Speech Processing.

Drullman, R., Festen, J. M., and Plomp, R. (1994). Effect of

temporal envelope smearing on speech reception. The

Journal of the Acoustical Society of America, pages

1053–1064.

Dudley, H. (1939). Remarking speech. The Journal of the

Acoustical Society of America, 11:169–177.

Ellis, D. P. W. (2010). Gammatone-like spectrograms.

http://www.ee.columbia.edu/ dpwe/resources/matlab/

gammatonegram.

Ephraim, Y. and Cohen, I. (2006). Recent Advances in

Speech Enhancement. CRC Press.

BIO-INSPIRED AUDITORY PROCESSING FOR SPEECH FEATURE ENHANCEMENT

Flanagan, J. (1960). Models for approximating basilar

membrane displacement. The Journal of the Acous-

tical Society of America, 32:937.

Flynn, R. and Jones, E. (2006). A comparative study of

auditory-based front-ends for robust speech recogni-

tion using the aurora 2 database. In IISC, IET Irish

Signals and Systems Conference.

Gales, M. J. F. and Young, S. (1995). A fast and ﬂex-

ible implementation of parallel model combination.

In ICASSP’95, IEEE International Conference on

Acoustics, Speech, and Signal Processing, pages 133–

136. IEEE.

Ghitza, O. (1988). Temporal non-place information in the

auditory-nerve ﬁring patterns as a front-end for speech

recognition in a noisy environment. Journal of Pho-

netics.

Glasberg, B. and Moore, B. (1990). Derivation of audi-

tory ﬁlter shapes from notched-noise data. Hearing

Research, 47:103–108.

Greenberg, S. (1997). On the origins of speech intelligi-

bility in the real world. In ESCA Workshop on Ro-

bust Speech Recognition for Unknown Communica-

tion Channels.

Habets, E. A. P. (2004). Single-channel speech dereverber-

ation based on spectral subtraction. In ProRISC’04,

15th Annual Workshop on Circuits, Systems and Sig-

nal Processing.

Hermansky, H. (1997). Auditory modeling in automatic

recognition of speech. In ECSAP.

Hermansky, H. and Morgan, N. (1994). Rasta processing

of speech. IEEE Transactions on Speech and Audio

Processing, 2(4):578–589.

Hirsch, H. G. (2007). Aurora-5 experimental framework

for the performance evaluation of speech recognition

in case of a hands-free speech input in noisy environ-

ments. http://aurora.hsnr.de/aurora-5/reports.html.

Holmberg, M., Gelbart, D., Ramacher, U., and Hemmert,

W. (2005). Automatic speech recognition with neural

spike trains. In Interspeech’05, 9th European Con-

ference on Speech Communication and Technology,

pages 1253–1256.

Houtgast, T., Steeneken, H. J. M., and Plomp, R. (1980).

Predicting speech intelligibility in rooms from the

modulation transfer function. Acustica, 46(1):60 –72.

Johannesma, P. I. (1972). The pre-response stimulus en-

semble of neurons in the cochlear nucleus. In Sym-

posium on Hearing Theory (Institute for Perception

Research), Eindhoven, Holland, pages 58 – 69.

Kanedera, N., Arai, T., Hermansky, H., and Pavel, M.

(1999). On the relative importance of various com-

ponents of the modulation spectrum for automatic

speech recognition. Speech Comm., 28:43–55.

Kellermann, W. (2006). Some current challenges in multi-

channel acoustic signal processing. The Journal of the

Acoustical Society of America, 120(5):3177–3178.

Kingsbury, B. E. D. (1998). Perceptually Inspired Signal-

processing Strategies for Robust Speech Recognition

in Reverberant Environments. PhD Thesis.

Kleinschmidt, M., Tchorz, J., and Kollmeier, B. (2001).

Combining speech enhancement and auditory fea-

ture extraction for robust speech recognition. Speech

Comm., 34:75–91.

Martin, R. (2001). Noise power spectral density estimation

based on optimal smoothing and minimum statistics.

IEEE Transactions on Speech and Audio Processing,

9(5):504–512.

Omologo, M., Svaizer, P., and Matassoni, M. (1998). En-

vironmental conditions and acoustic transduction in

hands-free speech recognition. Speech Comm., 25:75–

95.

Patterson, R. D., Nimmo-Smith, I., Holdsworth, J., and

Rice, P. (1987). An efﬁcient auditory ﬁlterbank based

on the gammatone function. In meeting of the IOC

Speech Group on Auditory Modelling at RSRE.

R.Drullman, J.Festen, and R.Plomp (1994). Effect of re-

ducing slow temporal modulations on speech recep-

tion. The Journal of the Acoustical Society of Amer-

ica, 95:2670–2680.

Schlueter, R., Bezrukov, I., Wagner, H., and Ney, H. (2006).

Gammatone features and feature combination for

large vocabulary speech recognition. In ICASSP’06,

IEEE International Conference on Acoustics, Speech,

and Signal Processing.

Seneff, S. (1988). A joint synchrony/mean-rate model of

auditory speech processing. Journal of Phonetics,

16:55–76.

Slaney, M. (1993). An efﬁcient implementation of the pat-

terson holdsworth auditory ﬁlterbank. In Apple Com-

puters, Perception Group.

Woelfel, J. and McDonough, J. (2009). Distant Speech

Recognition. John Wiley, 1st edition.

BIOSIGNALS 2011 - International Conference on Bio-inspired Systems and Signal Processing