Automatic Viseme Vocabulary Construction to Enhance Continuous Lip-reading

Adriana Fernandez-Lopez, Federico M. Sukno


Speech is the most common communication method between humans and involves the perception of both auditory and visual channels. Automatic speech recognition focuses on interpreting the audio signals, but it has been demonstrated that video can provide information that is complementary to the audio. Thus, the study of automatic lip-reading is important and is still an open problem. One of the key challenges is the definition of the visual elementary units (the visemes) and their vocabulary. Many researchers have analyzed the importance of the phoneme to viseme mapping and have proposed viseme vocabularies with lengths between 11 and 15 visemes. These viseme vocabularies have usually been manually defined by their linguistic properties and in some cases using decision trees or clustering techniques. In this work, we focus on the automatic construction of an optimal viseme vocabulary based on the association of phonemes with similar appearance. To this end, we construct an automatic system that uses local appearance descriptors to extract the main characteristics of the mouth region and HMMs to model the statistic relations of both viseme and phoneme sequences. To compare the performance of the system different descriptors (PCA, DCT and SIFT) are analyzed. We test our system in a Spanish corpus of continuous speech. Our results indicate that we are able to recognize approximately 58% of the visemes, 47% of the phonemes and 23% of the words in a continuous speech scenario and that the optimal viseme vocabulary for Spanish is composed by 20 visemes.


  1. Antonakos, E., Roussos, A., and Zafeiriou, S. (2015). A survey on mouth modeling and analysis for sign language recognition. In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, volume 1, pages 1-7. IEEE.
  2. Bear, H. L., Harvey, R. W., Theobald, B.-J., and Lan, Y. (2014). Which phoneme-to-viseme maps best improve visual-only computer lip-reading? In International Symposium on Visual Computing, pages 230- 239. Springer.
  3. Bozkurt, E., Erdem, C. E., Erzin, E., Erdem, T., and Ozkan, M. (2007). Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation. Proc. of Signal Proc. and Communications Applications, pages 1-4.
  4. Buchan, J. N., Paré, M., and Munhall, K. G. (2007). Spatial statistics of gaze fixations during dynamic face processing. Social Neuroscience, 2(1):1-13.
  5. Cappelletta, L. and Harte, N. (2011). Viseme definitions comparison for visual-only speech recognition. In Signal Processing Conference, 2011 19th European, pages 2109-2113. IEEE.
  6. Chiu¸t, A. and Rothkrantz, L. J. (2012). Automatic visual speech recognition. Speech enhancement, Modeling and Recognition-Algorithms and Applications, page 95.
  7. Cooke, M., Barker, J., Cunningham, S., and Shao, X. (2006). An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5):2421-2424.
  8. Dupont, S. and Luettin, J. (2000). Audio-visual speech modeling for continuous speech recognition. IEEE transactions on multimedia, 2(3):141-151.
  9. Erber, N. P. (1975). Auditory-visual perception of speech. Journal of Speech and Hearing Disorders, 40(4):481- 492.
  10. Ezzat, T. and Poggio, T. (1998). Miketalk: A talking facial display based on morphing visemes. In Computer Animation 98. Proceedings, pages 96-102. IEEE.
  11. Fisher, C. G. (1968). Confusions among visually perceived consonants. Journal of Speech, Language, and Hearing Research, 11(4):796-804.
  12. Franklin, S. B., Gibson, D. J., Robertson, P. A., Pohlmann, J. T., and Fralish, J. S. (1995). Parallel analysis: a method for determining significant principal components. Journal of Vegetation Science, 6(1):99-106.
  13. Frénay, B. and Verleysen, M. (2014). Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, 25(5):845- 869.
  14. Goldschen, A. J., Garcia, O. N., and Petajan, E. (1994). Continuous optical automatic speech recognition by lipreading. In Signals, Systems and Computers, 1994. 1994 Conference Record of the Twenty-Eighth Asilomar Conference on, volume 1, pages 572-577. IEEE.
  15. Hazen, T. J., Saenko, K., La, C.-H., and Glass, J. R. (2004). A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments. In Proceedings of the 6th international conference on Multimodal interfaces, pages 235-242. ACM.
  16. Hilder, S., Harvey, R., and Theobald, B.-J. (2009). Comparison of human and machine-based lip-reading. In AVSP, pages 86-89.
  17. Jeffers, J. and Barley, M. (1980). Speechreading (lipreading). Charles C. Thomas Publisher.
  18. Khoshgoftaar, T. M., Van Hulse, J., and Napolitano, A. (2011). Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 41(3):552-568.
  19. Lan, Y., Harvey, R., Theobald, B., Ong, E.-J., and Bowden, R. (2009). Comparing visual features for lipreading. In International Conference on Auditory-Visual Speech Processing 2009, pages 102-106.
  20. Llisterri, J. and Marin˜o, J. B. (1993). Spanish adaptation of sampa and automatic phonetic transcription. Reporte técnico del ESPRIT PROJECT, 6819.
  21. Lowe, D. G. (2004). Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision, 60(2):91-110.
  22. Luettin, J., Thacker, N. A., and Beet, S. W. (1996). Visual speech recognition using active shape models and hidden markov models. In Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on, volume 2, pages 817-820. IEEE.
  23. McGurk, H. and MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264:746-748.
  24. Moll, K. L. and Daniloff, R. G. (1971). Investigation of the timing of velar movements during speech. The Journal of the Acoustical Society of America, 50(2B):678- 684.
  25. Nefian, A. V., Liang, L., Pi, X., Xiaoxiang, L., Mao, C., and Murphy, K. (2002). A coupled hmm for audiovisual speech recognition. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 2, pages II-2013. IEEE.
  26. Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., and Mashari, A. (2000). Audio visual speech recognition. Technical report, IDIAP.
  27. Nettleton, D. F., Orriols-Puig, A., and Fornells, A. (2010). A study of the effect of different types of noise on the precision of supervised learning techniques. Artificial intelligence review, 33(4):275-306.
  28. Ortega, A., Sukno, F., Lleida, E., Frangi, A. F., Miguel, A., Buera, L., and Zacur, E. (2004). Av@ car: A spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition. In LREC.
  29. Ortiz, I. d. l. R. R. (2008). Lipreading in the prelingually deaf: what makes a skilled speechreader? The Spanish journal of psychology, 11(02):488-502.
  30. Pei, Y., Kim, T.-K., and Zha, H. (2013). Unsupervised random forest manifold alignment for lipreading. In Proceedings of the IEEE International Conference on Computer Vision, pages 129-136.
  31. Petrushin, V. A. (2000). Hidden markov models: Fundamentals and applications. In Online Symposium for Electronics Engineer.
  32. Potamianos, G., Neti, C., Gravier, G., Garg, A., and Senior, A. W. (2003). Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9):1306-1326.
  33. Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257-286.
  34. Ronquest, R. E., Levi, S. V., and Pisoni, D. B. (2010). Language identification from visual-only speech signals. Attention, Perception, & Psychophysics, 72(6):1601- 1613.
  35. Saenko, K., Livescu, K., Siracusa, M., Wilson, K., Glass, J., and Darrell, T. (2005). Visual speech recognition with loosely synchronized feature streams. In Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, volume 2, pages 1424-1431.
  36. Sahu, V. and Sharma, M. (2013). Result based analysis of various lip tracking systems. In Green High Performance Computing (ICGHPC), 2013 IEEE International Conference on, pages 1-7. IEEE.
  37. Seymour, R., Stewart, D., and Ming, J. (2008). Comparison of image transform-based features for visual speech recognition in clean and corrupted videos. Journal on Image and Video Processing, 2008:14.
  38. Sui, C., Bennamoun, M., and Togneri, R. (2015). Listening with your eyes: Towards a practical visual speech recognition system using deep boltzmann machines. In Proceedings of the IEEE International Conference on Computer Vision, pages 154-162.
  39. Sukno, F. M., Ordas, S., Butakoff, C., Cruz, S., and Frangi, A. F. (2007). Active shape models with invariant optimal features: application to facial analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(7):1105-1117.
  40. Sumby, W. H. and Pollack, I. (1954). Visual contribution to speech intelligibility in noise. The journal of the acoustical society of america, 26(2):212-215.
  41. Thangthai, K., Harvey, R., Cox, S., and Theobald, B.-J. (2015). Improving lip-reading performance for robust audiovisual speech recognition using dnns. In Proc. FAAVSP, 1St Joint Conference on Facial Analysis, Animation and Audio-Visual Speech Processing.
  42. Verbaeten, S. and Van Assche, A. (2003). Ensemble methods for noise elimination in classification problems. In International Workshop on Multiple Classifier Systems, pages 317-325. Springer.
  43. Wells, J. C. et al. (1997). Sampa computer readable phonetic alphabet. Handbook of standards and resources for spoken language systems, 4.
  44. Yau, W. C., Kumar, D. K., and Weghorn, H. (2007). Visual speech recognition using motion features and hidden markov models. In International Conference on Computer Analysis of Images and Patterns, pages 832- 839. Springer.
  45. Zhao, G., Barnard, M., and Pietikainen, M. (2009). Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia, 11(7):1254-1265.
  46. Zhou, Z., Hong, X., Zhao, G., and Pietikäinen, M. (2014a). A compact representation of visual speech data using latent variables. IEEE transactions on pattern analysis and machine intelligence, 36(1):1-1.
  47. Zhou, Z., Zhao, G., Hong, X., and Pietikäinen, M. (2014b). A review of recent advances in visual speech decoding. Image and vision computing, 32(9):590-605.

Paper Citation

in Harvard Style

Fernandez-Lopez A. and M. Sukno F. (2017). Automatic Viseme Vocabulary Construction to Enhance Continuous Lip-reading . In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017) ISBN 978-989-758-226-4, pages 52-63. DOI: 10.5220/0006102100520063

in Bibtex Style

author={Adriana Fernandez-Lopez and Federico M. Sukno},
title={Automatic Viseme Vocabulary Construction to Enhance Continuous Lip-reading},
booktitle={Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017)},

in EndNote Style

JO - Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, (VISIGRAPP 2017)
TI - Automatic Viseme Vocabulary Construction to Enhance Continuous Lip-reading
SN - 978-989-758-226-4
AU - Fernandez-Lopez A.
AU - M. Sukno F.
PY - 2017
SP - 52
EP - 63
DO - 10.5220/0006102100520063