This repository organizes papers, codes and resources related to generative adversarial networks (GANs) 🤗 and neural radiance fields (NeRF) 🎨, with a main focus on image-driven and audio-driven talking head synthesis papers and released codes. 👤
Papers for Talking Head Synthesis, released codes collections. ✍️
Most papers are linked to PDFs on "arXiv" or journal/conference websites 📚. However, some papers require an academic license to view 🔐.
🔆 This project Awesome-Talking-Head-Synthesis is ongoing - pull requests are welcome! If you have any suggestions (missing papers, new papers, key researchers or typos), please feel free to edit and submit a PR. You can also open an issue or contact me directly via email. 📩
⭐ If you find this repo useful, please give it a star! 🤩
2023.12 Update 📆
Thank you to https://github.com/Curated-Awesome-Lists/awesome-ai-talking-heads, I have added some of its contents, such as Tools & Software and Slides & Presentations. 🙏 I hope this will be helpful.😊
If you have any feedback or ideas on extending this aggregated resource, please open an issue or PR - community contributions are vital to advancing this shared knowledge. 🤝
Let's keep pushing forward to recreate ever more realistic digital human faces! 💪 We've come so far but still have a long way to go. With continued research 🔬 and collaboration, I'm sure we'll get there! 🤗
Please feel free to star ⭐ and share this repo if you find it a valuable resource. Your support helps motivate me to keep maintaining and improving it. 🥰 Let me know if you have any other questions!

| Dataset | Download Link | Description |
|---|---|---|
| Faceforensics++ | Download link | |
| CelebV | Download link | |
| VoxCeleb | Download link | VoxCeleb, a comprehensive audio-visual dataset for speaker recognition, encompasses both VoxCeleb1 and VoxCeleb2 datasets. |
| VoxCeleb1 | Download link | VoxCeleb1 contains over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube. |
| VoxCeleb2 | Download link | Extracted from YouTube videos, VoxCeleb2 includes video URLs and discourse timestamps. As the largest public audio-visual dataset, it is primarily used for speaker recognition tasks. However, it can also be utilized for training talking-head generation models. To obtain download permission and access the dataset, apply here. Requires 300 GB+ storage space. |
| ObamaSet | Download link | ObamaSet is a specialized audio-visual dataset focused on analyzing the visual speech of former US President Barack Obama. All video samples are collected from his weekly address footage. Unlike previous datasets, it exclusively centers on Barack Obama and does not provide any human annotations. |
| TalkingHead-1KH | Download link | The dataset consists of 500k video clips, of which about 80k are greater than 512x512 resolution. Only videos under permissive licenses are included. Note that the number of videos differ from that in the original paper because a more robust preprocessing script was used to split the videos. |
| LRW (Lip Reading in the Wild) | Download link | LRW, a diverse English-speaking video dataset from the BBC program, features over 1000 speakers with various speaking styles and head poses. Each video is 1.16 seconds long (29 frames) and involves the target word along with context. |
| MEAD 2020 | Download link | MEAD 2020 is a Talking Head dataset annotated with emotion labels and intensity labels. The dataset focuses on facial generation for natural emotional speech, covering eight different emotions on three intensity levels. |
| CelebV-HQ | Download link | CelebV-HQ is a high-quality video dataset comprising 35,666 clips with a resolution of at least 512x512. It includes 15,653 identities, and each clip is manually labeled with 83 facial attributes, spanning appearance, action, and emotion. The dataset's diversity and temporal coherence make it a valuable resource for tasks like unconditional video generation and video facial attribute editing. |
| HDTF | Download link | HDTF, the High-definition Talking-Face Dataset, is a large in-the-wild high-resolution audio-visual dataset consisting of approximately 362 different videos totaling 15.8 hours. Original video resolutions are 720 P or 1080 P, and each cropped video is resized to 512 × 512. |
| CREMA-D | Download link | CREMA-D is a diverse dataset with 7,442 original clips featuring 91 actors, including 48 male and 43 female actors aged 20 to 74, representing various races and ethnicities. The dataset includes recordings of actors speaking from a set of 12 sentences, expressing six different emotions (Anger, Disgust, Fear, Happy, Neutral, and Sad) at four emotion levels (Low, Medium, High, and Unspecified). Emotion and intensity ratings were gathered through crowd-sourcing, with 2,443 participants rating 90 unique clips each (30 audio, 30 visual, and 30 audio-visual). Over 95% of the clips have more than 7 ratings. For additional details on CREMA-D, refer to the paper link. |
| LRS2 | Download link | LRS2 is a lip reading dataset that includes videos recorded in diverse settings, suitable for studying lip reading and visual speech recognition. |
| GRID | Download link | The GRID dataset was recorded in a laboratory setting with 34 volunteers, each speaking 1000 phrases, totaling 34,000 utterance instances. Phrases follow specific rules, with six words randomly selected from six categories: "command," "color," "preposition," "letter," "number," and "adverb." Access the dataset here. |
| SAVEE | Download link | The SAVEE (Surrey Audio-Visual Expressed Emotion) database is a crucial component for developing an automatic emotion recognition system. It features recordings from 4 male actors expressing 7 different emotions, totaling 480 British English utterances. These sentences, selected from the standard TIMIT corpus, are phonetically balanced for each emotion. Recorded in a high-quality visual media lab, the data undergoes processing and labeling. Performance evaluation involves 10 subjects rating recordings under audio, visual, and audio-visual conditions. Classification systems for each modality achieve speaker-independent recognition rates of 61%, 65%, and 84% for audio, visual, and audio-visual, respectively. |
| BIWI(3D) | Download link | The Biwi 3D Audiovisual Corpus of Affective Communication serves as a compromise between data authenticity and quality, acquired at ETHZ in collaboration with SYNVO GmbH. |
| VOCA | Download link | VOCA is a 4D-face dataset with approximately 29 minutes of 4D face scans and synchronized audio from 12-bit speakers. It greatly facilitates research in 3D VSG. |
| Multiface(3D) | Download link | The Multiface Dataset consists of high-quality multi-view video recordings of 13 people displaying various facial expressions. It contains approximately 12,200 to 23,000 frames per subject, captured at 30 fps from around 40 to 160 camera views with uniform lighting. The dataset's size is 65TB and includes raw images (2048x1334 resolution), tracked and meshed heads, 1024x1024 unwrapped face textures, camera calibration metadata, and audio. This repository provides code for downloading the dataset and building a codec avatar using a deep appearance model. |
| MMFace4D | Download link | The MMFace4D dataset is a large-scale multi-modal dataset for audio-driven 3D facial animation research. It contains over 35,000 sequences captured from 431 subjects ranging in age from 15 to 68 years old. Various sentences from scenarios such as news broadcasting, conversations and storytelling were recorded, totaling around 11,000 utterances. High-fidelity data was captured using three synchronized RGB-D cameras to obtain high-resolution 3D meshes and textures. A reconstruction pipeline was developed to fuse the multi-view data and generate topology-consistent 3D mesh sequences. In addition to the 3D facial motions, synchronized speech audio is also provided. The final dataset covers a wide range of expressive talking styles and facial expressions through a diverse set of subjects and utterances. With its large scale, high quality of data and strong diversity, the MMFace4D dataset provides an ideal benchmark for developing and evaluating audio-driven 3D facial animation models. |
| Year | Title | Code | Project | Keywords |
|---|---|---|---|---|
| 2024 | [Audio2Photoreal] From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations | Code | Project | Photoreal |
| 2024 | [Animate Anyone] Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation | Code | Project | 🔥Animate (阿里科目三驱动) |
| 2024 | [3DGAN] What You See Is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs | Project | 🔥Nvidia | |
| 2024 | [LivePortrait] LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control | Codea | Project | 🔥快手 |
| Year | Title | Conference/Journal | Code | Project | Keywords |
|---|---|---|---|---|---|
| 2024 | [GLDiTalker] GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer | Arxiv 2024 | |||
| 2024 | Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation | Arxiv 2024 | |||
| 2024 | [JambaTalk] JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model | Arxiv 2024 | 3D | ||
| 2024 | [Talk Less, Interact Better] Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs | COLM 2024 | Code | LLM | |
| 2024 | [Digital Avatars] Digital Avatars: Framework Development and Their Evaluation | Arxiv 2024 | |||
| 2024 | [EmoTalk3D] EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head | ECCV 2024 | Project | ||
| 2024 |


多风格AI绘画神器
堆友平台由阿里巴巴设计团队创建,作为一款AI驱动的设计工具,专为设计师提供一站式增长服务。功能覆盖海量3D素材、AI绘画、实时渲染以及专业抠图,显著提升设计品质和效率。平台不仅提供工具,还是一个促进创意交流和个人发展的空间,界面友好,适合所有级别的设计师和创意工作者。


零代码AI应用开发平台
零代码AI应用开发平台,用户只需一句话简单描述需求,AI能自动生成小程序、APP或H5网页应用,无需编写代码。


免费创建高清无水印Sora视频
Vora是一个免费创建高清无水印Sora视频的AI工具


最适合小白的AI自动化工作流平台
无需编码,轻松生成可复用、可变现的AI自动化工作流

大模型驱动的Excel数据处理工具
基于大模型交互的表格处理系统,允许用户通过对话方式完成数据整理和可视化分析。系统采用机器学习算法解析用户指令,自动执行排序、公式计算和数据透视等操作,支持多种文件格式导入导出。数据处理响应速度保持在0.8秒以内,支持超过100万行数据的即时分析。


AI辅助编程,代码自动修复
Trae是一种自适应的集成开发环境(IDE),通过自动化和多元协作改变开发流程。利用Trae,团队能够更快速、精确地编写和部署代码,从而提高编程效率和项目交付速度。Trae具备上下文感知和代码自动完成功能,是提升开发效率的理想工具。


AI论文写作指导平台
AIWritePaper论文写作是一站式AI论文写作辅助工具,简化了选题、文献检索至论文撰写的整个过程。通过简单设定,平台可快速生成高质量论文大纲和全文,配合图表、参考文献等一应俱全,同时提供开题报告和答辩PPT等增值服务,保障数据安全,有效提升写作效率和论文质量。


AI一键生成PPT,就用博思AIPPT!
博思AIPPT,新一代的AI生成PPT平台,支持智能生成PPT、AI美化PPT、文本&链接生成PPT、导入Word/PDF/Markdown文档生成PPT等,内置海量精美PPT模板,涵盖商务、教育、科技等不同风格,同时针对每个页面提供多种版式,一键自适应切换,完美适配各种办公场景。


AI赋能电商视觉革命,一站式智能商拍平台
潮际好麦深耕服装行业,是国内AI试衣效果最好的软件。使用先进AIGC能力为电商卖家批量提供优质的、低成本的商拍图。合作品牌有Shein、Lazada、安踏、百丽等65个国内外头部品牌,以及国内10万+淘宝、天猫、京东等主流平台的品牌商家,为卖家节省将近85%的出图成本,提升约3倍出图效率,让品牌能够快速上架。


企业专属的AI法律顾问
iTerms是法大大集团旗下法律子品牌,基于最先进的大语言模型(LLM)、专业的法律知识库和强大的智能体架构,帮助企业扫清合规障碍,筑牢风控防线,成为您企业专属的AI法律顾问。
最新AI工具、AI资讯
独家AI资源、AI项目落地

微信扫一扫关注公众号