An Anthropomorphic Perspective for Audiovisual Speech Synthesis

Samuel Silva, António Teixeira

Abstract

In speech communication, both the auditory and visual streams play an important role, ensuring both a certain level of redundancy (e.g., lip movement) and transmission of complementary information (e.g., to emphasize a word). The common current approach to audiovisual speech synthesis, generally based on data-driven methods, yields good results, but relies on models controlled by parameters that do not relate with how humans do it, being hard to interpret and adding little to our understanding of the human speech production apparatus. Modelling the actual system, adopting an anthropomorphic perspective would provide a myriad of novel research paths. This article proposes a conceptual framework to support research and development of an articulatory-based audiovisual speech synthesis system. The core idea is that the speech production system is modelled to produce articulatory parameters with anthropomorphic meaning (e.g., lip opening) driving the synthesis of both the auditory and visual streams. A first instantiation of the framework for European Portuguese illustrates its viability and constitutes an important tool for research in speech production and the deployment of audiovisual speech synthesis in multimodal interaction scenarios, of the utmost relevance for the current and future complex services and applications.

References

  1. Almeida, N., Silva, S., and Teixeira, A. (2014). Design and development of speech interaction: A methodology. In Proc. of HCII, LNCS 8511, pages 370-381, Crete, Grece.
  2. Almeida, N., Silva, S., Teixeira, A. J. S., and Vieira, D. (2016). Multi-device applications using the multimodal architecture. In Dahl, D., editor, Multimodal Interaction with W3C Standards: Towards Natural User Interfaces to Everything, (to appear). Springer, New York, NY, USA.
  3. Birkholz, P. (2013). Modeling consonant-vowel coarticulation for articulatory speech synthesis. PLoS ONE, 8(4):1-17.
  4. Browman, C. P. and Goldstein, L. (1990). Gestural specification using dynamically-defined articulatory structures. Journal of Phonetics, 18:299-320.
  5. Cohen, M. M. and Massaro, D. W. (1993). Modeling coarticulation in synthetic visual speech. In Models and techniques in computer animation, pages 139-156. Springer.
  6. Files, B. T., Tjan, B. S., Jiang, J., and Bernstein, L. E. (2015). Visual speech discrimination and identification of natural and synthetic consonant stimuli. Frontiers in psychology, 6.
  7. Freitas, J., Candeias, S., Dias, M. S., Lleida, E., Ortega, A., Teixeira, A., Silva, S., Acarturk, C., and Orvalho, V. (2014). The IRIS project: A liaison between industry and academia towards natural multimodal communication. In Proc. Iberspeech, pages 338-347, Las Palmas de Gran Canária, Spain.
  8. Hall, N. (2010). Articulatory phonology. Language and Linguistics Compass, 4(9):818-830.
  9. Massaro, D. W. (2005). The Psychology and Technology of Talking Heads: Applications in Language Learning, pages 183-214. Springer Netherlands, Dordrecht.
  10. Mattheyses, W. and Verhelst, W. (2015). Audiovisual speech synthesis: An overview of the state-of-the-art. Speech Communication, 66:182 - 217.
  11. Nam, H., Goldstein, L., Browman, C., Rubin, P., Proctor, M., and Saltzman, E. (2006). TADA manual. New Haven, CT: Haskins Labs.
  12. Oliveira, C. (2009). From Grapheme to Gesture. Linguistic Contributions for an Articulatory Based Text-ToSpeech System. PhD thesis, University of Aveiro (in Portuguese).
  13. Rubin, P., Baer, T., and Mermelstein, P. (1981). An articulatory synthesizer for perceptual research. The Journal of the Acoustical Society of America, 70(2):321-328.
  14. Rubin, P., Saltzman, E., Goldstein, L., McGowan, R., Tiede, M., and Browman, C. (1996). CASY and extensions to the task-dynamic model. In Proc. Speech Prod. Seminar, pages 125-128.
  15. Saltzman, E. L. and Munhall, K. G. (1989). A dynamical approach to gestural patterning in speech production. Ecological psychology, 1(4):333-382.
  16. Schabus, D., Pucher, M., and Hofer, G. (2014). Joint audiovisual hidden semi-markov model-based speech synthesis. J. of Selected Topics in Signal Proc., 8(2):336- 347.
  17. Scott, A. D., Wylezinska, M., Birch, M. J., and Miquel, M. E. (2014). Speech mri: Morphology and function. Physica Medica, 30(6):604 - 618.
  18. Serra, J., Ribeiro, M., Freitas, J., Orvalho, V., and Dias, M. S. (2012). A proposal for a visual speech animation system for european portuguese. In Proc. IberSPEECH, pages 267-276, Madrid, Spain. Springer.
  19. Silva, S., Almeida, N., Pereira, C., Martins, A. I., Rosa, A. F., e Silva, M. O., and Teixeira, A. (2015). Design and development of multimodal applications: A vision on key issues and methods. In Proc. HCII, LNCS.
  20. Teixeira, A., Oliveira, C., and Barbosa, P. (2008). European Portuguese articulatory based text-to-speech: First results. In Proc. PROPOR, LNAI 5190, pages 101-111.
  21. Teixeira, A., Silva, L., Martinez, R., and Vaz, F. (2002). SAPWindows - towards a versatile modular articulatory synthesizer. In Proc. of IEEE Workshop on Speech Synthesis, pages 31-34.
  22. Teixeira, A. J. S., Almeida, N., Pereira, C., e Silva, M. O., Vieira, D., and Silva, S. (2016). Applications of the multimodal interaction architecture in ambient assisted living. In Dahl, D., editor, Multimodal Interaction with W3C Standards: Towards Natural User Interfaces to Everything, (to appear). Springer, New York, NY, USA.
  23. W3C Consortium (2003). W3C multimodal interaction framework - technical note (accessed oct 2016).
  24. Z?eleznÉ, M., Krn?oul, Z., and Jedlic?ka, P. (2015). Analysis of Facial Motion Capture Data for Visual Speech Synthesis, pages 81-88. Springer International Publishing, Cham.
Download


Paper Citation


in Harvard Style

Silva S. and Teixeira A. (2017). An Anthropomorphic Perspective for Audiovisual Speech Synthesis . In Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 4: BIOSIGNALS, (BIOSTEC 2017) ISBN 978-989-758-212-7, pages 163-172. DOI: 10.5220/0006150201630172


in Bibtex Style

@conference{biosignals17,
author={Samuel Silva and António Teixeira},
title={An Anthropomorphic Perspective for Audiovisual Speech Synthesis},
booktitle={Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 4: BIOSIGNALS, (BIOSTEC 2017)},
year={2017},
pages={163-172},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006150201630172},
isbn={978-989-758-212-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 4: BIOSIGNALS, (BIOSTEC 2017)
TI - An Anthropomorphic Perspective for Audiovisual Speech Synthesis
SN - 978-989-758-212-7
AU - Silva S.
AU - Teixeira A.
PY - 2017
SP - 163
EP - 172
DO - 10.5220/0006150201630172