BROADCAST NEWS PHONEME RECOGNITION BY SPARSE CODING

Joseph Razik, Sébastien Paris, Hervé Glotin

2012

Abstract

We present in this paper a novel approach for the phoneme recognition task that we want to extend to an automatic speech recognition system (ASR). Usual ASR systems are based on a GMM-HMM combination that represents a fully generative approach. Current discriminative methods are not tractable in large scale data set case, especially with non-linear kernel. In our system, we introduce a new scheme using jointly sparse coding and an approximation additive kernel for fast SVM training for phoneme recognition. Thus, on a broadcast news corpus, our system outperforms the use of GMMs by around 2.5% and is computationally linear to the number of samples.

References

  1. Arous, N. and Ellouze, N. (2003). Cooperative supervised and unsupervised learning algorithm for phoneme recognition in continuous speech and speakerindependent context. Neurocomputing, 51:225-235.
  2. Coates, A., Lee, H., and Ng, A. Y. (2011). An analysis of single-layer networks in unsupervised feature learning. In Artificial Intelligence and Statistics (AISTATS), page 9.
  3. Davis, S. and Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. ASSP, 28:357-366.
  4. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871-1874.
  5. Galliano, S., Geoffrois, E., Gravier, G., Bonastre, J., Mostefa, D., and Choukri, K. (2006). Corpus description of the ester evaluation campaign for the rich transcription of french broadcast news. In LREC, pages 315-320.
  6. Gravier, G., Bonastre, J., Galliano, S., and Geoffrois, E. (2004). The ester evaluation compaign of rich transcription of french broadcast news. In LREC.
  7. Hammer, B., Strickert, M., and Villmann, T. (2004). Relevance lvq versus svm. In Artificial Intelligence and Softcomputing, springer lecture notes in artificial intelligence, volume 3070, pages 592-597. Springer.
  8. Hsieh, C.-J., Chang, K.-W., Lin, C.-J., and Keerthi, S. S. (2008). A dual coordinate descent method for largescale linear svm.
  9. Huang, X., Acero, A., and Hon, H. (2001). Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice Hall.
  10. Hyvärinen, A. and Oja, E. (2000). Independent component analysis: algorithms and applications. Neural Netw., 13:411-430.
  11. Illina, I., Fohr, D., Mella, O., and Cerisara, C. (2004). The automatic news transcription system : Ants, some real time experiments. In ICSLP, pages 377-380.
  12. Joachims, T., Finley, T., and Yu, C.-N. (2009). Cuttingplane training of structural svms. Machine learning, 77(1):27-59.
  13. Lin, C.-J., Weng, R. C., and Keerthi, S. S. (2008). Trust region newton method for logistic regression. J. Mach. Learn. Res., 9.
  14. Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2009). Online dictionary learning for sparse coding. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 7809, pages 689-696, New York, NY, USA. ACM.
  15. Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19- 60.
  16. Mairal, J., Bach, F., Ponce, J., Sapiro, G., and Zisserman, A. (2008). Supervised dictionary learning. Advances Neural Information Processing Systems, pages 1033- 1040.
  17. Maji, S., Berg, A. C., and Malik, J. (2009). Classification using intersection kernel support vector machines is efficient. In CVPR.
  18. Paris, S. (2011). Scenes/objects classification toolbox. http://www.mathworks.com/matlabcentral/fileexchan ge/29800-scenesobjects-classification-toolbox.
  19. Rabiner, L. and Juang, B. (1993). Fundamentals of Speech Recognition. Prentice Hall PTR.
  20. Ranzato, M., Krizhevsky, A., and Hinton, G. (2010). Factored 3-way restricted boltzmann machines for modeling natural images. In International Conference on Artificial Intelligence and Statistics AISTATS.
  21. Razik, J., Mella, O., Fohr, D., and Haton, J.-P. (2011). Frame-synchronous and local confidence measures for automatic speech recognition. IJPRAI, 25(2):157- 182.
  22. Rudin, C., Schapire, R. E., and Daubechies, I. (2007). Analysis of boosting algorithms using the smooth margin function. The Annals of Statistics, 35(6):2723-2768.
  23. Schölkopf, B., Platt, J. C., Shawe-Taylor, J. C., Smola, A. J., and Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Comput., 13:1443-1471.
  24. Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. (2007). Pegasos: Primal estimated sub-gradient solver for svm.
  25. Sivaram, G., Nemala, S., M. Elhilali, T. T., and Hermansky, H. (2010). Sparse coding for speech recognition. In ICASSP, pages 4346-4349.
  26. Smit, W. J. and Barnard, E. (2009). Continuous speech recognition with sparse coding. Computer Speech and Language, 23:200-219.
  27. Tan, M., Wang, L., and Tsang, I. W. (2010). Learning sparse svm for feature selection on very high dimensional datasets. In ICML, page 8.
  28. Tomi Kinnunen, H. L. (2010). An overview of textindependent speaker recognition: From features to supervectors. Speech Communication, 52:12-40.
  29. Vapnik, V. N. (1998). Statistical Learning Theory. WileyIntersciences.
  30. Vedaldi, A. and Zisserman, A. (2011). Efficient additive kernels via explicit feature maps. IEEE PAMI.
  31. Wang, J., Yang, J., Kai Yu, F. L., Huang, T., and Gong, Y. (2010). Locality-constrained linear coding for image classification. CVPR'10.
  32. Yang, J., Yu, K., Gong, Y., and Huang, T. S. (2009). Linear spatial pyramid matching using sparse coding for image classification. In CVPR.
  33. Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., and Woodland, P. (1995). The HTK Book. Entropic Ltd., Cambridge, England.
  34. Yu, K., Lin, Y., and Lafferty, J. (2011). Learning image representations from the pixel level via hierarchical sparse coding. In CVPR, pages 1713-1720.
Download


Paper Citation


in Harvard Style

Razik J., Paris S. and Glotin H. (2012). BROADCAST NEWS PHONEME RECOGNITION BY SPARSE CODING . In Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM, ISBN 978-989-8425-99-7, pages 191-197. DOI: 10.5220/0003778201910197


in Bibtex Style

@conference{icpram12,
author={Joseph Razik and Sébastien Paris and Hervé Glotin},
title={BROADCAST NEWS PHONEME RECOGNITION BY SPARSE CODING},
booktitle={Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,},
year={2012},
pages={191-197},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003778201910197},
isbn={978-989-8425-99-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,
TI - BROADCAST NEWS PHONEME RECOGNITION BY SPARSE CODING
SN - 978-989-8425-99-7
AU - Razik J.
AU - Paris S.
AU - Glotin H.
PY - 2012
SP - 191
EP - 197
DO - 10.5220/0003778201910197