approximately 362 distinct videos, the HDTF dataset
has an aggregate duration of 15.8 hours. All of the
videos within this dataset possess a video resolution
of either 720P or 1080P.
The ObamaSet represents a specialized
audiovisual dataset. It is centered around the in -
depth analysis of the speeches delivered by former
US President Barack Obama. Functioning as a
dedicated database, it serves a particular speaking
digital human generation task. All video materials
within this dataset are sourced from Obama's weekly
addresses.
4 CONCLUSIONS
This research undertakes an in - depth and
comprehensive assessment of the latest developments
in the generation techniques of speaking digital
humans. It approaches this from two critical
perspectives: the fundamental technical components
and the datasets involved. Broadly speaking,
propelled by the rapid and remarkable progress of
artificial intelligence technologies, which are firmly
rooted in deep - learning algorithms, the current video
generation technology for speaking digital humans
has attained significant headway. However, it still
contends with a multitude of complex and arduous
challenges that impede its seamless and widespread
implementation.
This academic treatise delves into the
contemporary status quo and prospective evolution of
voice - driven speech systems. In the wake of
unceasing technological advancements, remarkable
enhancements in the fidelity and naturalness of digital
humans have been observed, attributed to generative
methodologies including the generative adversarial
network model, diffusion model, and neural radiance
field. The progressive augmentation and
diversification of datasets have furnished a more
comprehensive and copious resource for digital
human creation.
Within sectors such as entertainment, healthcare,
and education, digital avatars are anticipated to
emerge as pivotal instruments, endowing users with a
more lifelike, intuitive, and immersive experience.
Concurrently, with the relentless progression of
technology and the continuous expansion of
application scenarios, the technology for digital
human generation is confronted with a plethora of
challenges and concomitant opportunities.
REFERENCES
Chen, L., Cui, G., Kou, Z., Zheng, H., & Xu, C. (2020).
What comprises a good talking-head video generation?:
A survey and benchmark. arXiv preprint
arXiv:2005.03201.
Hu, Q., Zhong, H. Q., Wang, W. S., et al. (2024). Efficient
encoding and transmission method of volumetric video
based on neural radiation field. The Radio and
Television Network, 2024(S2), 41-45.
https://doi.org/10.16045/j.cnki.catvtec.2024.s2.019
Liu, J. H., & Huang, X. X. (2024). Face inpainting model
based on denoising diffusion probability models.
Journal of Northeastern University (Natural Science),
45(9), 1227-1234.
Liu, Y., Wu, M. Y., Hu, Y., Qi, K., Wang, Y. B., Zhao, Y.,
& Song, J. L. (2024). Preliminary application of a
cervical vertebra segmentation method based on
Transformer and diffusion model for lateral
cephalometric radiographs in orthodontic clinical
practice. Journal of Shanghai Jiao Tong University
(Medical Science), 44(12), 1579-1586.
Song, Y., Zhang, W., Chen, Z., & Jiang, Y. (2023). A
survey on talking head generation. Journal of
Computer-Aided Design & Computer Graphics,
35(10), 1457-1468.
Tang, X. L., Du, Y. M., Liu, Y. W., et al. (2018). Image
recognition method for generating adversarial networks
based on conditional deep convolution. Journal of
Automation, 44(5), 855-864.
Xiao, H. Y. (2024). Digital human portrait editing based
on radiation field and generative diffusion model
(Unpublished doctoral dissertation). University of
Science and Technology of China, Hefei, China.
Zhen, R., Song, W., He, Q., Cao, J., Shi, L., & Luo, J.
(2023). Human-computer interaction system: A survey
of talking-head generation. Electronics, 12(1), 218.
Zakharov, E., Shysheya, A., Burkov, E., & Lempitsky, V.
(2019). Few-shot adversarial learning of realistic neural
talking head models. In Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV)
(pp. 9459-9468). IEEE.
Sheng, X. M., Zhao, J. L., Wang, G. D., et al. (2024). High-
fidelity face generation algorithm based on neural
radiation field. Computer Science, 1-15. Retrieved from
http://kns.cnki.net/kcms/detail/50.1075.TP.20241225.
1825.004.html