Study on the Use and Adaptation of Bottleneck Features for Robust Speech Recognition of Nonlinearly Distorted Speech

Jiri Malek; Petr Cerva; Ladislav Seps; Jan Nouza

doi:10.5220/0005955500650071

Study on the Use and Adaptation of Bottleneck Features for Robust Speech Recognition of Nonlinearly Distorted Speech

Jiri Malek, Petr Cerva, Ladislav Seps, Jan Nouza

2016

Abstract

This paper focuses on the robust recognition of nonlinearly distorted speech. We have reported (Seps et al., 2014) that hybrid acoustic models based on a combination of Hidden Markov Models and Deep Neural Networks (HMM-DNNs) are better suited to this task than conventional HMMs utilizing Gaussian Mixture Models (HMM-GMMs). To further improve recognition accuracy, this paper investigates the possibility of combining the modeling power of deep neural networks with the adaptation to given acoustic conditions. For this purpose, the deep neural networks are utilized to produce bottleneck coefficients / features (BNC). The BNCs are subsequently used for training of HMM-GMM based acoustic models and then adapted using Constrained Maximum Likelihood Linear Regression (CMLLR). Our results obtained for three types of nonlinear distortions and three types of input features show that the adapted BNC-based system (a) outperforms HMM-DNN acoustic models in the case of strong compression and (b) yields comparable performance for speech affected by nonlinear amplification in the analog domain.

References

Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):30-42.
Davis, S. B. and Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Acoustics, Speech and Signal Processing, IEEE Transactions on, 28(4):357-366.
Delcroix, M., Kubo, Y., Nakatani, T., and Nakamura, A. (2013). Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling? In INTERSPEECH, pages 2992-2996.
Deng, L., Li, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., et al. (2013). Recent advances in deep learning for speech research at microsoft. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8604-8608. IEEE.
Deng, L., Seltzer, M. L., Yu, D., Acero, A., Mohamed, A.-R., and Hinton, G. E. (2010). Binary coding of speech spectrograms using a deep auto-encoder. In Interspeech, pages 1692-1695. Citeseer.
Eaton, J. and Naylor, P. A. (2013). Detection of clipping in coded speech signals. In Signal Processing Conference (EUSIPCO), 2013 Proceedings of the 21st European, pages 1-5. IEEE.
Gales, M. J. (1998). Maximum likelihood linear transformations for hmm-based speech recognition. Computer speech & language, 12(2):75-98.
Grézl, F. and Fousek, P. (2008). Optimizing bottle-neck features for lvcsr. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 4729-4732. IEEE.
Grézl, F., Karafiát, M., Kontár, S., and Cernocky, J. (2007). Probabilistic and bottle-neck features for lvcsr of meetings. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 4, pages IV-757. IEEE.
Heigold, G., Vanhoucke, V., Senior, A., Nguyen, P., Ranzato, M., Devin, M., and Dean, J. (2013). Multilingual acoustic models using distributed deep neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8619-8623. IEEE.
Jolliffe, I. (2002). Principal component analysis. Wiley Online Library.
Kneser, R. and Ney, H. (1995). Improved backing-off for m-gram language modeling. In Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on, volume 1, pages 181-184. IEEE.
Li, J., Deng, L., Gong, Y., and Haeb-Umbach, R. (2014). An overview of noise-robust automatic speech recognition. Audio, Speech, and Language Processing, IEEE/ACM Transactions on, 22(4):745-777.
Licklider, J. C. R. and Pollack, I. (1948). Effects of differentiation, integration, and infinite peak clipping upon the intelligibility of speech. The Journal of the Acoustical Society of America, 20(1):42-51.
Parihar, N. and Picone, J. (2002). Aurora working group: Dsr front end lvcsr evaluation au/384/02. Inst. for Signal and Information Process, Mississippi State University, Tech. Rep, 40:94.
Pollak, P. and Behunek, M. (2011). Accuracy of mp3 speech recognition under real-word conditions: Experimental study. In Signal Processing and Multimedia Applications (SIGMAP), 2011 Proceedings of the International Conference on, pages 1-6. IEEE.
Saon, G., Soltau, H., Nahamoo, D., and Picheny, M. (2013). Speaker adaptation of neural network acoustic models using i-vectors. In ASRU, pages 55-59.
Seltzer, M. L., Yu, D., and Wang, Y. (2013). An investigation of deep neural networks for noise robust speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 7398-7402. IEEE.
Seps, L., Malek, J., Cerva, P., and Nouza, J. (2014). Investigation of deep neural networks for robust recognition of nonlinearly distorted speech. In INTERSPEECH, pages 363-367.
Vaseghi, S. V. (2008). Advanced digital signal processing and noise reduction. John Wiley & Sons.
Yu, D. and Seltzer, M. L. (2011). Improved bottleneck features using pretrained deep neural networks. In INTERSPEECH, volume 237, page 240.

Download

Paper Citation

in Harvard Style

Malek J., Cerva P., Seps L. and Nouza J. (2016). Study on the Use and Adaptation of Bottleneck Features for Robust Speech Recognition of Nonlinearly Distorted Speech . In Proceedings of the 13th International Joint Conference on e-Business and Telecommunications - Volume 5: SIGMAP, (ICETE 2016) ISBN 978-989-758-196-0, pages 65-71. DOI: 10.5220/0005955500650071

in Bibtex Style

@conference{sigmap16,
author={Jiri Malek and Petr Cerva and Ladislav Seps and Jan Nouza},
title={Study on the Use and Adaptation of Bottleneck Features for Robust Speech Recognition of Nonlinearly Distorted Speech},
booktitle={Proceedings of the 13th International Joint Conference on e-Business and Telecommunications - Volume 5: SIGMAP, (ICETE 2016)},
year={2016},
pages={65-71},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005955500650071},
isbn={978-989-758-196-0},
}

in EndNote Style

TY - CONF
JO - Proceedings of the 13th International Joint Conference on e-Business and Telecommunications - Volume 5: SIGMAP, (ICETE 2016)
TI - Study on the Use and Adaptation of Bottleneck Features for Robust Speech Recognition of Nonlinearly Distorted Speech
SN - 978-989-758-196-0
AU - Malek J.
AU - Cerva P.
AU - Seps L.
AU - Nouza J.
PY - 2016
SP - 65
EP - 71
DO - 10.5220/0005955500650071