DNN-based Models for Speaker Age and Gender Classification

Zakariya Qawaqneh; Arafat Abu Mallouh; Buket D. Barkana

doi:10.5220/0006096401060111

DNN-based Models for Speaker Age and Gender Classification

Zakariya Qawaqneh, Arafat Abu Mallouh, Buket D. Barkana

2017

Abstract

Automatic speaker age and gender classification is an active research field due to the continuous and rapid development of applications related to humans’ life and health. In this paper, we propose a new method for speaker age and gender classification, which utilizes deep neural networks (DNNs) as feature extractor and classifier. The proposed method creates a model for each speaker. For each test speech utterance, the similarity between the test model and the speaker class models are compared. Two feature sets have been used: Mel-frequency cepstral coefficients (MFCCs) and shifted delta cepstral (SDC) coefficients. The proposed model by using the SDC feature set achieved better classification results than that of MFCCs. The experimental results showed that the proposed SDC speaker model + SDC class model outperformed all the other systems by achieving 57.21% overall classification accuracy.

References

Bahari, M.H., McLaren, M. and van Leeuwen, D.A., 2014. Speaker age estimation using i-vectors. Engineering Applications of Artificial Intelligence, 34, pp.99-108.
Barkana, B., Zhou, J., 2015. A new pitch-range based feature set for a speaker's age and gender classification. Applied Acoustics, vol.98, pp.52-61.
Bocklet, T., Stemmer, G., Zeissler, V. and Nöth, E., 2010, September. Age and gender recognition based on multiple systems-early vs. late fusion. In INTERSPEECH, pp. 2830-2833.
Campbell, W.M., Campbell, J.P., Reynolds, D.A., Singer, E. and Torres-Carrasquillo, P.A., 2006. Support vector machines for speaker and language recognition. Computer Speech & Language, 20(2), pp.210-229.
Ciregan, D., Meier, U. and Schmidhuber, J., 2012. Multicolumn deep neural networks for image classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3642-3649.
Davis, S. and Mermelstein, P., 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), pp.357-366.
Dobry, G., Hecht, R. M., Avigal, M. & Zigel, Y., 2011. Supervector Dimension Reduction for Efficient Speaker Age Estimation Based on the Acoustic Speech Signal. IEEE Transactions on Audio, Speech, and Language Processing, 19, 1975-1985.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.- R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P. & Sainath, T. N., 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29, 82-97.
Li, M., Han, K. J. & Narayanan, S., 2013. Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Computer Speech & Language, 27, 151-167.
Metze, F., Ajmera, J., Englert, R., Bub, U., Burkhardt, F., Stegmann, J., Muller, C., Huber, R., Andrassy, B., Bauer, J.G. and Littel, B., 2007. Comparison of four approaches to age and gender recognition for telephone applications. In 2007 IEEE International Conference on Acoustics, Speech and Signal ProcessingICASSP'07, vol. 4, pp. 1089-1092.
Nguyen, A., Yosinski, J. and Clune, J., 2015. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 427-436.
NIST, The 2010 NIST Speaker Recognition Evaluation (SRE10), Link:http://www.itl.nist.gov/iad/mig/tests/sre/2010/, Accessed on 8/24/2016.
Richardson, F., Reynolds, D. and Dehak, N., 2015a. Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10), pp.1671-1675.
Richardson, F., Reynolds, D. and Dehak, N., 2015. A unified deep neural network for speaker and language recognition. In INTERSPEECH, vol. 2015, pp. 1146- 1150.
Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C.A. and Narayanan, S.S., 2010. The INTERSPEECH 2010 paralinguistic challenge. In INTERSPEECH, vol. 2010, pp. 2795-2798.
Simonyan, K. & Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Yu, D., Wang, S., Karam, Z. and Deng, L., 2010. Language recognition using deep-structured conditional random fields. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5030- 5033).
Zeiler, M. D., 2013. Hierarchical convolutional deep learning in computer vision. PhD thesis, Ch. 6, New York University.

Download

Paper Citation

in Harvard Style

Qawaqneh Z., Abu Mallouh A. and Barkana B. (2017). DNN-based Models for Speaker Age and Gender Classification . In Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 4: BIOSIGNALS, (BIOSTEC 2017) ISBN 978-989-758-212-7, pages 106-111. DOI: 10.5220/0006096401060111

in Bibtex Style

@conference{biosignals17,
author={Zakariya Qawaqneh and Arafat Abu Mallouh and Buket D. Barkana},
title={DNN-based Models for Speaker Age and Gender Classification},
booktitle={Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 4: BIOSIGNALS, (BIOSTEC 2017)},
year={2017},
pages={106-111},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006096401060111},
isbn={978-989-758-212-7},
}

in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 4: BIOSIGNALS, (BIOSTEC 2017)
TI - DNN-based Models for Speaker Age and Gender Classification
SN - 978-989-758-212-7
AU - Qawaqneh Z.
AU - Abu Mallouh A.
AU - Barkana B.
PY - 2017
SP - 106
EP - 111
DO - 10.5220/0006096401060111