Diffusion Transformer Framework for Speech-Driven Stylized Gesture Generation
Nada Elmasry, Yanbo Cheng, Yingying Wang
2025
Abstract
Gestures are a vital component of human expression, playing a pivotal role in conveying information and emotions. Generating co-speech gestures remains challenging in human-computer interaction due to the intricate relationship between speech and gestures. While recent advances in learning-based methodologies have shown some progress, they still encounter limitations, as a lack of diversity and a mismatch between generated gestures and the semantic and emotional context of speech, impacting the effectiveness of communication. In this work, we propose a novel gesture generation framework that takes speech audio and a target style gesture example as inputs, automatically synthesizing new gesture performances that align with the speech in the desired style. Specifically, our framework comprises four main components: a dual-stream audio encoder, a gesture-style encoder, a cross-attention modality fusion module, and a latent diffusion generation module. The dual-stream audio encoder and gesture style encoder extract diverse modality embeddings from audio and motion inputs; the cross-attention fusion module maps the multi-modal embeddings into a unified latent space, and the diffusion module produces expressive and stylized gestures. The results demonstrate the exceptional performance of our method in generating natural and diversified gestures that accurately and coherently convey the intended information, surpassing the benchmarks established by traditional methods. Finally, we discuss future directions for our research.
DownloadPaper Citation
in Harvard Style
Elmasry N., Cheng Y. and Wang Y. (2025). Diffusion Transformer Framework for Speech-Driven Stylized Gesture Generation. In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 1: GRAPP; ISBN 978-989-758-728-3, SciTePress, pages 355-362. DOI: 10.5220/0013318400003912
in Bibtex Style
@conference{grapp25,
author={Nada Elmasry and Yanbo Cheng and Yingying Wang},
title={Diffusion Transformer Framework for Speech-Driven Stylized Gesture Generation},
booktitle={Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 1: GRAPP},
year={2025},
pages={355-362},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013318400003912},
isbn={978-989-758-728-3},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 1: GRAPP
TI - Diffusion Transformer Framework for Speech-Driven Stylized Gesture Generation
SN - 978-989-758-728-3
AU - Elmasry N.
AU - Cheng Y.
AU - Wang Y.
PY - 2025
SP - 355
EP - 362
DO - 10.5220/0013318400003912
PB - SciTePress