Awesome-Talking-Head-Synthesis

Awesome-Talking-Head-Synthesis

最新音频驱动和神经辐射场技术在数字人头像生成中的应用

这个项目收集了生成对抗网络(GANs)和神经辐射场(NeRF)在说话头合成领域的相关研究。内容包括图像和音频驱动的说话头生成技术、数据集、研究综述和代表性工作。从2D到3D、单模态到多模态,项目全面展示了说话头生成的技术发展,为相关研究提供参考资料。

Talking Head Synthesis3DNeRF音频驱动数据集Github开源项目

Awesome-Talking-Head-Synthesis

This repository organizes papers, codes and resources related to generative adversarial networks (GANs) 🤗 and neural radiance fields (NeRF) 🎨, with a main focus on image-driven and audio-driven talking head synthesis papers and released codes. 👤

Papers for Talking Head Synthesis, released codes collections. ✍️

Most papers are linked to PDFs on "arXiv" or journal/conference websites 📚. However, some papers require an academic license to view 🔐.

🔆 This project Awesome-Talking-Head-Synthesis is ongoing - pull requests are welcome! If you have any suggestions (missing papers, new papers, key researchers or typos), please feel free to edit and submit a PR. You can also open an issue or contact me directly via email. 📩

⭐ If you find this repo useful, please give it a star! 🤩

2023.12 Update 📆

Thank you to https://github.com/Curated-Awesome-Lists/awesome-ai-talking-heads, I have added some of its contents, such as Tools & Software and Slides & Presentations. 🙏 I hope this will be helpful.😊

If you have any feedback or ideas on extending this aggregated resource, please open an issue or PR - community contributions are vital to advancing this shared knowledge. 🤝

Let's keep pushing forward to recreate ever more realistic digital human faces! 💪 We've come so far but still have a long way to go. With continued research 🔬 and collaboration, I'm sure we'll get there! 🤗

Please feel free to star ⭐ and share this repo if you find it a valuable resource. Your support helps motivate me to keep maintaining and improving it. 🥰 Let me know if you have any other questions!

Datasets

在这里插入图片描述

DatasetDownload LinkDescription
Faceforensics++Download link
CelebVDownload link
VoxCelebDownload linkVoxCeleb, a comprehensive audio-visual dataset for speaker recognition, encompasses both VoxCeleb1 and VoxCeleb2 datasets.
VoxCeleb1Download linkVoxCeleb1 contains over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.
VoxCeleb2Download linkExtracted from YouTube videos, VoxCeleb2 includes video URLs and discourse timestamps. As the largest public audio-visual dataset, it is primarily used for speaker recognition tasks. However, it can also be utilized for training talking-head generation models. To obtain download permission and access the dataset, apply here. Requires 300 GB+ storage space.
ObamaSetDownload linkObamaSet is a specialized audio-visual dataset focused on analyzing the visual speech of former US President Barack Obama. All video samples are collected from his weekly address footage. Unlike previous datasets, it exclusively centers on Barack Obama and does not provide any human annotations.
TalkingHead-1KHDownload linkThe dataset consists of 500k video clips, of which about 80k are greater than 512x512 resolution. Only videos under permissive licenses are included. Note that the number of videos differ from that in the original paper because a more robust preprocessing script was used to split the videos.
LRW (Lip Reading in the Wild)Download linkLRW, a diverse English-speaking video dataset from the BBC program, features over 1000 speakers with various speaking styles and head poses. Each video is 1.16 seconds long (29 frames) and involves the target word along with context.
MEAD 2020Download linkMEAD 2020 is a Talking Head dataset annotated with emotion labels and intensity labels. The dataset focuses on facial generation for natural emotional speech, covering eight different emotions on three intensity levels.
CelebV-HQDownload linkCelebV-HQ is a high-quality video dataset comprising 35,666 clips with a resolution of at least 512x512. It includes 15,653 identities, and each clip is manually labeled with 83 facial attributes, spanning appearance, action, and emotion. The dataset's diversity and temporal coherence make it a valuable resource for tasks like unconditional video generation and video facial attribute editing.
HDTFDownload linkHDTF, the High-definition Talking-Face Dataset, is a large in-the-wild high-resolution audio-visual dataset consisting of approximately 362 different videos totaling 15.8 hours. Original video resolutions are 720 P or 1080 P, and each cropped video is resized to 512 × 512.
CREMA-DDownload linkCREMA-D is a diverse dataset with 7,442 original clips featuring 91 actors, including 48 male and 43 female actors aged 20 to 74, representing various races and ethnicities. The dataset includes recordings of actors speaking from a set of 12 sentences, expressing six different emotions (Anger, Disgust, Fear, Happy, Neutral, and Sad) at four emotion levels (Low, Medium, High, and Unspecified). Emotion and intensity ratings were gathered through crowd-sourcing, with 2,443 participants rating 90 unique clips each (30 audio, 30 visual, and 30 audio-visual). Over 95% of the clips have more than 7 ratings. For additional details on CREMA-D, refer to the paper link.
LRS2Download linkLRS2 is a lip reading dataset that includes videos recorded in diverse settings, suitable for studying lip reading and visual speech recognition.
GRIDDownload linkThe GRID dataset was recorded in a laboratory setting with 34 volunteers, each speaking 1000 phrases, totaling 34,000 utterance instances. Phrases follow specific rules, with six words randomly selected from six categories: "command," "color," "preposition," "letter," "number," and "adverb." Access the dataset here.
SAVEEDownload linkThe SAVEE (Surrey Audio-Visual Expressed Emotion) database is a crucial component for developing an automatic emotion recognition system. It features recordings from 4 male actors expressing 7 different emotions, totaling 480 British English utterances. These sentences, selected from the standard TIMIT corpus, are phonetically balanced for each emotion. Recorded in a high-quality visual media lab, the data undergoes processing and labeling. Performance evaluation involves 10 subjects rating recordings under audio, visual, and audio-visual conditions. Classification systems for each modality achieve speaker-independent recognition rates of 61%, 65%, and 84% for audio, visual, and audio-visual, respectively.
BIWI(3D)Download linkThe Biwi 3D Audiovisual Corpus of Affective Communication serves as a compromise between data authenticity and quality, acquired at ETHZ in collaboration with SYNVO GmbH.
VOCADownload linkVOCA is a 4D-face dataset with approximately 29 minutes of 4D face scans and synchronized audio from 12-bit speakers. It greatly facilitates research in 3D VSG.
Multiface(3D)Download linkThe Multiface Dataset consists of high-quality multi-view video recordings of 13 people displaying various facial expressions. It contains approximately 12,200 to 23,000 frames per subject, captured at 30 fps from around 40 to 160 camera views with uniform lighting. The dataset's size is 65TB and includes raw images (2048x1334 resolution), tracked and meshed heads, 1024x1024 unwrapped face textures, camera calibration metadata, and audio. This repository provides code for downloading the dataset and building a codec avatar using a deep appearance model.
MMFace4DDownload linkThe MMFace4D dataset is a large-scale multi-modal dataset for audio-driven 3D facial animation research. It contains over 35,000 sequences captured from 431 subjects ranging in age from 15 to 68 years old. Various sentences from scenarios such as news broadcasting, conversations and storytelling were recorded, totaling around 11,000 utterances. High-fidelity data was captured using three synchronized RGB-D cameras to obtain high-resolution 3D meshes and textures. A reconstruction pipeline was developed to fuse the multi-view data and generate topology-consistent 3D mesh sequences. In addition to the 3D facial motions, synchronized speech audio is also provided. The final dataset covers a wide range of expressive talking styles and facial expressions through a diverse set of subjects and utterances. With its large scale, high quality of data and strong diversity, the MMFace4D dataset provides an ideal benchmark for developing and evaluating audio-driven 3D facial animation models.

Survey

YearTitleConference/Journal
2024A Survey on 3D Human Avatar Modeling — From Reconstruction to GenerationarXiv 2024
2024Deepfake Generation and Detection: A Benchmark and Survey GithubarXiv 2024
2024A Comparative Study of Perceptual Quality Metrics for Audio-driven Talking Head Videos CodearXiv 2024
2024How NeRFs and 3D Gaussian Splatting are Reshaping SLAM: a Survey 3DGS+SLAM🔥🔥🔥arXiv 2024
20243D Gaussian as a New Vision Era: A Survey 3DGS🔥🔥🔥arXiv 2024
2024Advances in 3D Generation: A SurveyarXiv 2024
2024A Survey on 3D Gaussian Splatting 3DGS🔥🔥🔥on goingarXiv 2024
2024Neural Radiance Fields: Past, Present, and Future NeRF🔥🔥🔥 Amazing 413 pagesarXiv 2024
2023From Pixels to Portraits: A Comprehensive Survey of Talking Head Generation Techniques and ApplicationsarXiv 2023
2023Human-Computer Interaction System: A Survey of Talking-Head GenerationIEEE
2023Talking human face generation: A surveyACM
2022Deep Learning for Visual Speech Analysis: A SurveyarXiv 2022
2020What comprises a good talking-head video generation?: A Survey and BenchmarkarXiv 2020

Funny Work

YearTitleCodeProjectKeywords
2024[Audio2Photoreal] From Audio to Photoreal Embodiment: Synthesizing Humans in ConversationsCodeProjectPhotoreal
2024[Animate Anyone] Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character AnimationCodeProject🔥Animate (阿里科目三驱动)
2024[3DGAN] What You See Is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANsProject🔥Nvidia
2024[LivePortrait] LivePortrait: Efficient Portrait Animation with Stitching and Retargeting ControlCodeaProject🔥快手

Audio-driven

YearTitleConference/JournalCodeProjectKeywords
2024[GLDiTalker] GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion TransformerArxiv 2024
2024Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head GenerationArxiv 2024
2024[JambaTalk] JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba ModelArxiv 20243D
2024[Talk Less, Interact Better] Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMsCOLM 2024CodeLLM
2024[Digital Avatars] Digital Avatars: Framework Development and Their EvaluationArxiv 2024
2024[EmoTalk3D] EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking HeadECCV 2024Project
2024

编辑推荐精选

TRAE编程

TRAE编程

AI辅助编程,代码自动修复

Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。

AI工具TraeAI IDE协作生产力转型热门
博思AIPPT

博思AIPPT

AI一键生成PPT,就用博思AIPPT!

博思AIPPT,新一代的AI生成PPT平台,支持智能生成PPT、AI美化PPT、文本&链接生成PPT、导入Word/PDF/Markdown文档生成PPT等,内置海量精美PPT模板,涵盖商务、教育、科技等不同风格,同时针对每个页面提供多种版式,一键自适应切换,完美适配各种办公场景。

AI办公办公工具AI工具博思AIPPTAI生成PPT智能排版海量精品模板AI创作热门
潮际好麦

潮际好麦

AI赋能电商视觉革命,一站式智能商拍平台

潮际好麦深耕服装行业,是国内AI试衣效果最好的软件。使用先进AIGC能力为电商卖家批量提供优质的、低成本的商拍图。合作品牌有Shein、Lazada、安踏、百丽等65个国内外头部品牌,以及国内10万+淘宝、天猫、京东等主流平台的品牌商家,为卖家节省将近85%的出图成本,提升约3倍出图效率,让品牌能够快速上架。

iTerms

iTerms

企业专属的AI法律顾问

iTerms是法大大集团旗下法律子品牌,基于最先进的大语言模型(LLM)、专业的法律知识库和强大的智能体架构,帮助企业扫清合规障碍,筑牢风控防线,成为您企业专属的AI法律顾问。

SimilarWeb流量提升

SimilarWeb流量提升

稳定高效的流量提升解决方案,助力品牌曝光

稳定高效的流量提升解决方案,助力品牌曝光

Sora2视频免费生成

Sora2视频免费生成

最新版Sora2模型免费使用,一键生成无水印视频

最新版Sora2模型免费使用,一键生成无水印视频

Transly

Transly

实时语音翻译/同声传译工具

Transly是一个多场景的AI大语言模型驱动的同声传译、专业翻译助手,它拥有超精准的音频识别翻译能力,几乎零延迟的使用体验和支持多国语言可以让你带它走遍全球,无论你是留学生、商务人士、韩剧美剧爱好者,还是出国游玩、多国会议、跨国追星等等,都可以满足你所有需要同传的场景需求,线上线下通用,扫除语言障碍,让全世界的语言交流不再有国界。

讯飞绘文

讯飞绘文

选题、配图、成文,一站式创作,让内容运营更高效

讯飞绘文,一个AI集成平台,支持写作、选题、配图、排版和发布。高效生成适用于各类媒体的定制内容,加速品牌传播,提升内容营销效果。

热门AI辅助写作AI工具讯飞绘文内容运营AI创作个性化文章多平台分发AI助手
商汤小浣熊

商汤小浣熊

最强AI数据分析助手

小浣熊家族Raccoon,您的AI智能助手,致力于通过先进的人工智能技术,为用户提供高效、便捷的智能服务。无论是日常咨询还是专业问题解答,小浣熊都能以快速、准确的响应满足您的需求,让您的生活更加智能便捷。

imini AI

imini AI

像人一样思考的AI智能体

imini 是一款超级AI智能体,能根据人类指令,自主思考、自主完成、并且交付结果的AI智能体。

下拉加载更多