4.7 总结
本章主要介绍了对于语音驱动高保真数字人的技术探索,提出了一种比较通用的机器学习解决方案,分析并验证了不同声学特征对语音驱动动画的质量影响。基于深度学习得到了两个有效模型:任意人、多语言的语音驱动模型和多情绪语音驱动模型。此外,基于语音驱动数字人面部动画这一技术,提出了两个场景的应用:一是将人工智能产品和实时渲染技术结合的可交互数字人,二是面向游戏开发的动画制作工具。
随着相关技术的不断发展,越来越多的语音驱动面部动画技术在不同领域得以应用,尤其是在人工智能和游戏、虚拟偶像等领域。我们相信这样一种多模态结合技术会有更大的技术与应用潜力。
[1] FAN B, WANG L, SOONG F K, et al. Photo-real talking head with deep bidirectional LSTM[C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP),IEEE, 2015: 4884-4888.
[2] TAYLOR S, KIM T,YUE Y, et al. A deep learning approach for generalized speech animation[J]. ACM Transactions on Graphics (TOG), 2017, 36(4): 1-11.
[3] KARRAS T, AILA T, LAINE S, et al. Audio-driven facial animation by joint end-to-end learning of pose and emotion[J]. ACM Transactions on Graphics (TOG), 2017, 36(4): 1-12.
[4] CUDEIRO D, BOLKART T, LAIDLAW C, et al. Capture, learning, and synthesis of 3d speaking styles[C]//2019 IEEE/CVF Conference on Computer Version and Pattern Recognition(CVPR), 2019: 10101-10111.
[5] ZHOU Y,XU Z, LANDRETH C, et al. Visemenet: Audio-driven animator-centric speech animation[J]. ACM Transactions on Graphics (TOG), 2018, 37(4): 1-10.
[6] EDWARDS P, LANDRETH C,FIUME E, et al. JALI: an animator-centric viseme model for expressive lip synchronization[J]. ACM Transactions on Graphics (TOG), 2016, 35(4): 1-11.
[7] HUANG H R, WU Z Y, KANG S Y, et al. Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams[OL]. huang2020speaker, arXiv preprint arXiv:2006.11610, 2020.