BIO-INSPIRED AUDITORY PROCESSING FOR SPEECH FEATURE ENHANCEMENT

HariKrishna Maganti, Marco Matassoni

Abstract

Mel-frequency cepstrum based features have been traditionally used for speech recognition in a number of applications, as they naturally provide a higher recognition accuracies. However, these features are not very robust in a noisy acoustic conditions. In this article, we investigate the use of bio-inspired auditory features emulating the processing performed by cochlea to improve the robustness, particularly to counter environmental reverberation. Our methodology first extracts robust noise resistant features by gammatone filtering, which emulate cochlea frequency resolution and then a long-term modulation spectral processing is performed which preserves speech intelligibility in the signal. We compare and discuss the features based upon the performance on Aurora5 meeting recorder digit task recorded with four different microphones in a hands-free mode at a real meeting room. The experimental results show that the proposed features provide considerable improvements with respect to the state of the art feature extraction techniques.

References

  1. Dau, T., Pueschel, D., and Kohlrausch, A. (1996). A quantitative model of the effective signal processing in the auditory system. The Journal of the Acoustical Society of America, 99:3615-3622.
  2. de Boer, E. (1973). On the principle of specific coding. Journal of Dynamic Systems, Measurement and Control, (Trans. ASME), 95:265- 273.
  3. Deng, L. and Sheikhzadeh, H. (2006). Use of Temporal Codes Computed From a Cochlear Model for Speech Recognition. Psychology Press.
  4. Droppo, J. and Acero, A. (2008). Environmental Robustness. Springer Handbook of Speech Processing.
  5. Drullman, R., Festen, J. M., and Plomp, R. (1994). Effect of temporal envelope smearing on speech reception. The Journal of the Acoustical Society of America, pages 1053-1064.
  6. Dudley, H. (1939). Remarking speech. The Journal of the Acoustical Society of America, 11:169-177.
  7. Ellis, D. P. W. (2010). Gammatone-like spectrograms. http://www.ee.columbia.edu/ dpwe/resources/matlab/ gammatonegram.
  8. Ephraim, Y. and Cohen, I. (2006). Recent Advances in Speech Enhancement. CRC Press.
  9. Flanagan, J. (1960). Models for approximating basilar membrane displacement. The Journal of the Acoustical Society of America, 32:937.
  10. Flynn, R. and Jones, E. (2006). A comparative study of auditory-based front-ends for robust speech recognition using the aurora 2 database. In IISC, IET Irish Signals and Systems Conference.
  11. Gales, M. J. F. and Young, S. (1995). A fast and flexible implementation of parallel model combination. In ICASSP'95, IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 133- 136. IEEE.
  12. Ghitza, O. (1988). Temporal non-place information in the auditory-nerve firing patterns as a front-end for speech recognition in a noisy environment. Journal of Phonetics.
  13. Glasberg, B. and Moore, B. (1990). Derivation of auditory filter shapes from notched-noise data. Hearing Research, 47:103-108.
  14. Greenberg, S. (1997). On the origins of speech intelligibility in the real world. In ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels.
  15. Habets, E. A. P. (2004). Single-channel speech dereverberation based on spectral subtraction. In ProRISC'04, 15th Annual Workshop on Circuits, Systems and Signal Processing.
  16. Hermansky, H. (1997). Auditory modeling in automatic recognition of speech. In ECSAP.
  17. Hermansky, H. and Morgan, N. (1994). Rasta processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4):578-589.
  18. Hirsch, H. G. (2007). Aurora-5 experimental framework for the performance evaluation of speech recognition in case of a hands-free speech input in noisy environments. http://aurora.hsnr.de/aurora-5/reports.html.
  19. Holmberg, M., Gelbart, D., Ramacher, U., and Hemmert, W. (2005). Automatic speech recognition with neural spike trains. In Interspeech'05, 9th European Conference on Speech Communication and Technology, pages 1253-1256.
  20. Houtgast, T., Steeneken, H. J. M., and Plomp, R. (1980). Predicting speech intelligibility in rooms from the modulation transfer function. Acustica, 46(1):60 -72.
  21. Johannesma, P. I. (1972). The pre-response stimulus ensemble of neurons in the cochlear nucleus. In Symposium on Hearing Theory (Institute for Perception Research), Eindhoven, Holland, pages 58 - 69.
  22. Kanedera, N., Arai, T., Hermansky, H., and Pavel, M. (1999). On the relative importance of various components of the modulation spectrum for automatic speech recognition. Speech Comm., 28:43-55.
  23. Kellermann, W. (2006). Some current challenges in multichannel acoustic signal processing. The Journal of the Acoustical Society of America, 120(5):3177-3178.
  24. Kingsbury, B. E. D. (1998). Perceptually Inspired Signalprocessing Strategies for Robust Speech Recognition in Reverberant Environments. PhD Thesis.
  25. Kleinschmidt, M., Tchorz, J., and Kollmeier, B. (2001). Combining speech enhancement and auditory feature extraction for robust speech recognition. Speech Comm., 34:75-91.
  26. Martin, R. (2001). Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Transactions on Speech and Audio Processing, 9(5):504-512.
  27. Omologo, M., Svaizer, P., and Matassoni, M. (1998). Environmental conditions and acoustic transduction in hands-free speech recognition. Speech Comm., 25:75- 95.
  28. Patterson, R. D., Nimmo-Smith, I., Holdsworth, J., and Rice, P. (1987). An efficient auditory filterbank based on the gammatone function. In meeting of the IOC Speech Group on Auditory Modelling at RSRE.
  29. R.Drullman, J.Festen, and R.Plomp (1994). Effect of reducing slow temporal modulations on speech reception. The Journal of the Acoustical Society of America, 95:2670-2680.
  30. Schlueter, R., Bezrukov, I., Wagner, H., and Ney, H. (2006). Gammatone features and feature combination for large vocabulary speech recognition. In ICASSP'06, IEEE International Conference on Acoustics, Speech, and Signal Processing.
  31. Seneff, S. (1988). A joint synchrony/mean-rate model of auditory speech processing. Journal of Phonetics, 16:55-76.
  32. Slaney, M. (1993). An efficient implementation of the patterson holdsworth auditory filterbank. In Apple Computers, Perception Group.
  33. Woelfel, J. and McDonough, J. (2009). Distant Speech Recognition. John Wiley, 1st edition.
Download


Paper Citation


in Harvard Style

Maganti H. and Matassoni M. (2011). BIO-INSPIRED AUDITORY PROCESSING FOR SPEECH FEATURE ENHANCEMENT . In Proceedings of the International Conference on Bio-inspired Systems and Signal Processing - Volume 1: BIOSIGNALS, (BIOSTEC 2011) ISBN 978-989-8425-35-5, pages 51-58. DOI: 10.5220/0003145800510058


in Bibtex Style

@conference{biosignals11,
author={HariKrishna Maganti and Marco Matassoni},
title={BIO-INSPIRED AUDITORY PROCESSING FOR SPEECH FEATURE ENHANCEMENT},
booktitle={Proceedings of the International Conference on Bio-inspired Systems and Signal Processing - Volume 1: BIOSIGNALS, (BIOSTEC 2011)},
year={2011},
pages={51-58},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003145800510058},
isbn={978-989-8425-35-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bio-inspired Systems and Signal Processing - Volume 1: BIOSIGNALS, (BIOSTEC 2011)
TI - BIO-INSPIRED AUDITORY PROCESSING FOR SPEECH FEATURE ENHANCEMENT
SN - 978-989-8425-35-5
AU - Maganti H.
AU - Matassoni M.
PY - 2011
SP - 51
EP - 58
DO - 10.5220/0003145800510058