Study on the Use of Deep Neural Networks for Speech Activity Detection in Broadcast Recordings

Lukas Mateju, Petr Cerva, Jindrich Zdansky

Abstract

This paper deals with the task of Speech Activity Detection (SAD). Our goal is to develop a SAD module suitable for a system for broadcast data transcription. Various Deep Neural Networks (DNNs) are evaluated for this purpose. Training of DNNs is performed using speech and non-speech data as well as artificial data created by mixing of both these data types at a desired level of Signal-to-Noise Ratio (SNR). The output from each DNN is smoothed using a decoder based on Weighted Finite State Transducers (WFSTs). The presented experimental results show that the use of the resulting SAD module leads to a) a slight improvement in transcription accuracy and b) a significant reduction in the computation time needed for transcription.

References

  1. Dahl, G., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):30 -42.
  2. Graciarena, M., Alwan, A., Ellis, D., Franco, H., Ferrer, L., Hansen, J. H. L., Janin, A., Lee, B. S., Lei, Y., Mitra, V., Morgan, N., Sadjadi, S. O., Tsai, T. J., Scheffer, N., Tan, L. N., and Williams, B. (2013). All for one: feature combination for highly channel-degraded speech activity detection. In Bimbot, F., Cerisara, C., Fougeron, C., Gravier, G., Lamel, L., Pellegrino, F., and Perrier, P., editors, INTERSPEECH, pages 709- 713. ISCA.
  3. Hughes, T. and Mierle, K. (2013). Recurrent neural networks for voice activity detection. In ICASSP, pages 7378-7382. IEEE.
  4. Kneser, R. and Ney, H. (1995). Improved backing-off for m-gram language modeling. In In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume I, pages 181-184, Detroit, Michigan.
  5. Ma, J. (2014). Improving the speech activity detection for the darpa rats phase-3 evaluation. In Li, H., Meng, H. M., Ma, B., Chng, E., and Xie, L., editors, INTERSPEECH, pages 1558-1562. ISCA.
  6. Mateju, L., Cerva, P., and Zdansky, J. (2015). Investigation into the use of deep neural networks for lvcsr of czech. In Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM), 2015 IEEE International Workshop of, pages 1-4.
  7. Ng, T., 0004, B. Z., Nguyen, L., Matsoukas, S., Zhou, X., Mesgarani, N., Vesel, K., and Matejka, P. (2012). Developing a speech activity detection system for the darpa rats program. In INTERSPEECH, pages 1969- 1972. ISCA.
  8. Ryant, N., Liberman, M., and Yuan, J. (2013). Speech activity detection on youtube using deep neural networks. In Bimbot, F., Cerisara, C., Fougeron, C., Gravier, G., Lamel, L., Pellegrino, F., and Perrier, P., editors, INTERSPEECH, pages 728-731. ISCA.
  9. Saon, G., Thomas, S., Soltau, H., Ganapathy, S., and Kingsbury, B. (2013). The ibm speech activity detection system for the darpa rats program. In Bimbot, F., Cerisara, C., Fougeron, C., Gravier, G., Lamel, L., Pellegrino, F., and Perrier, P., editors, INTERSPEECH, pages 3497-3501. ISCA.
  10. Sriskandaraja, K., Sethu, V., Le, P. N., and Ambikairajah, E. (2015). A model based voice activity detector for noisy environments. In INTERSPEECH, pages 2297- 2301. ISCA.
  11. Thomas, S., Mallidi, S. H. R., Janu, T., Hermansky, H., Mesgarani, N., Zhou, X., Shamma, S. A., Ng, T., 0004, B. Z., Nguyen, L., and Matsoukas, S. (2012). Acoustic and data-driven features for robust speech activity detection. In INTERSPEECH, pages 1985- 1988. ISCA.
  12. Thomas, S., Saon, G., Segbroeck, M. V., and Narayanan, S. S. (2015). Improvements to the ibm speech activity detection system for the darpa rats program. In ICASSP, pages 4500-4504. IEEE.
  13. Wang, Q., Du, J., Bao, X., Wang, Z.-R., Dai, L.-R., and Lee, C.-H. (2015). A universal vad based on jointly trained deep neural networks. In INTERSPEECH, pages 2282-2286. ISCA.
  14. Zhang, X.-L. and Wang, D. (2014). Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection. In Li, H., Meng, H. M., Ma, B., Chng, E., and Xie, L., editors, INTERSPEECH, pages 1534-1538. ISCA.
Download


Paper Citation


in Harvard Style

Mateju L., Cerva P. and Zdansky J. (2016). Study on the Use of Deep Neural Networks for Speech Activity Detection in Broadcast Recordings . In Proceedings of the 13th International Joint Conference on e-Business and Telecommunications - Volume 5: SIGMAP, (ICETE 2016) ISBN 978-989-758-196-0, pages 45-51. DOI: 10.5220/0005952700450051


in Bibtex Style

@conference{sigmap16,
author={Lukas Mateju and Petr Cerva and Jindrich Zdansky},
title={Study on the Use of Deep Neural Networks for Speech Activity Detection in Broadcast Recordings},
booktitle={Proceedings of the 13th International Joint Conference on e-Business and Telecommunications - Volume 5: SIGMAP, (ICETE 2016)},
year={2016},
pages={45-51},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005952700450051},
isbn={978-989-758-196-0},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 13th International Joint Conference on e-Business and Telecommunications - Volume 5: SIGMAP, (ICETE 2016)
TI - Study on the Use of Deep Neural Networks for Speech Activity Detection in Broadcast Recordings
SN - 978-989-758-196-0
AU - Mateju L.
AU - Cerva P.
AU - Zdansky J.
PY - 2016
SP - 45
EP - 51
DO - 10.5220/0005952700450051