基于草图引导的少样本说话人视频生成算法研究

2024,32(10):236-242
魏清杨, 徐树公
上海大学 通信与信息工程学院
摘要:说话人视频生成需要对面部纹理和驱动语音进行精准联合建模;为实现该目标,对语义引导的纹理特征形变进行了研究,提出一种基于草图引导的少样本说话人视频生成框架,采用双阶段生成技术进行模态对齐;在第一阶段使用真实先验关键点信息进行语音到目标关键点的生成,第二阶段将关键点转化为草图作为中间表征与参考图片进行语义对齐;草图的引入有效地解决了语音与图像的模态不匹配问题;通过实验测试,算法在公开数据集HDTF和MEAD上的FID指标达到了15.676和8.618;经上述结果验证,提出的算法可通过中间表征有效建模目标音频驱动下的面部纹理,达到与最先进算法相当的生成效果。
关键词:高保真生成;说话人视频生成;关键点生成;多模态学习;音唇同步

Research on Few-Shot Talking Head Video Generation Algorithm Guided by Sketches

Abstract:Talking face generation requires precise joint modeling of facial texture and driven audio; to achieve this goal, research on semantic-guided texture feature deformation has been conducted, proposing a sketch-guided few-shot speaker video generation framework, employing dual-stage generation techniques for modality alignment. In the first stage, real prior facial landmarks information is used to generate the target facial landmarks from audio, and in the second stage, facial landmarks are transformed into sketches as intermediate representations for semantic alignment with reference images. Introduction of sketches effectively addresses the modality mismatch between audio and images; through experimental testing, the algorithm achieves FID scores of 15.676 and 8.618 on the public datasets HDTF and MEAD respectively. The proposed algorithm effectively models facial texture under the drive of target audio through intermediate representations, achieving comparable results to state-of-the-art algorithms as validated by the aforementioned results.
Key words:High-fidelity generation; Talking face generation; Keypoint generation; Multimodal learning; Lip synchronization.
收稿日期:2024-04-28
基金项目:国家自然科学基金(61871262)
     下载PDF全文