A Survey on Video Diffusion Models

<div style="text-align:center; font-size: 18px;"> <a href="https://chenhsing.github.io">Zhen Xing</a>, Qijun Feng, Haoran Chen, <a href="https://scholar.google.com/citations?user=NSJY12IAAAAJ&hl=zh-CN" >Qi Dai,</a> <a href="https://scholar.google.com/citations?user=Jkss014AAAAJ&hl=zh-CN&oi=ao" >Han Hu,</a> <a href="https://scholar.google.com/citations?user=J_8TX6sAAAAJ&hl=zh-CN&oi=ao" >Hang Xu,</a> <a href="https://scholar.google.com/citations?user=7t12hVkAAAAJ&hl=en" >Zuxuan Wu,</a> <a href="https://scholar.google.com/citations?user=f3_FP8AAAAAJ&hl=en" >Yu-Gang Jiang </a> </div> <img src="asset/fish.webp" width="160px"/> <img src="asset/tree.gif" width="160px"/> <img src="asset/raccoon.gif" width="160px"/> <img src="asset/svd.gif" width="240px"/> <img src="asset/fly3.gif" width="240px"/> <img src="asset/1.gif" width="120px"/> <img src="asset/2.gif" width="120px"/> <img src="asset/3.gif" width="120px"/> <img src="asset/4.gif" width="120px"/> (Source: <a href="https://makeavideo.studio/">Make-A-Video</a>, <a href="https://chenhsing.github.io/SimDA/">SimDA</a>, <a href="https://research.nvidia.com/labs/dir/pyoco/">PYoCo</a>, <a href="https://img.shields.io/badge/Website-9cf"> SVD </a>, <a href="https://research.nvidia.com/labs/toronto-ai/VideoLDM/">Video LDM</a> and <a href="https://tuneavideo.github.io/">Tune-A-Video</a>)

[News] We are planning to update the survey soon to encompass the latest work. If you have any suggestions, please feel free to contact us.
[News] The Chinese translation is available on Zhihu. Special thanks to Dai-Wenxun for this.

Open-source Toolboxes and Foundation Models

Methods	Task	Github
Open-Sora-Plan	T2V Generation
Open-Sora	T2V Generation
Morph Studio	T2V Generation	-
Genie	T2V Generation	-
Sora	T2V Generation & Editing	-
VideoPoet	T2V Generation & Editing	-
Stable Video Diffusion	T2V Generation
NeverEnds	T2V Generation	-
Pika	T2V Generation	-
EMU-Video	T2V Generation	-
GEN-2	T2V Generation & Editing	-
ModelScope	T2V Generation
ZeroScope	T2V Generation	-
T2V Synthesis Colab	T2V Genetation
VideoCraft	T2V Genetation & Editing
Diffusers (T2V synthesis)	T2V Genetation	-
AnimateDiff	Personalized T2V Genetation
Text2Video-Zero	T2V Genetation
HotShot-XL	T2V Genetation
Genmo	T2V Genetation	-
Fliki	T2V Generation	-

Video Generation
- Data
- - Caption-level
- - Category-level
- T2V Generation
- - Training-based
- - Training-free
- Video Generation with other Condtions
- - Pose-gudied
- - Instruct-guided
- - Sound-guided
- - Brain-guided
- - Multi-Modal guided
- Unconditional Video Generation
- - U-Net based
- - Transformer-based
- Video Completion
- - Video Enhance and Restoration
- - Video Prediction
Video Editing
- Text guided Video Editing
- - Training-based Editing
- - One-shot Editing
- - Traning-free
- Modality-guided Video Editing
- - Motion-guided
- - Instruct-guided
- - Sound-guided
- - Multi-Modal Control
- Domain-specific editing
- Non-diffusion editing
Video Understanding
Contact

Video Generation

Data

Caption-level

Title	Github	WebSite	Pub. & Date
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation			Jun., 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers			CVPR, 2024
CelebV-Text: A Large-Scale Facial Text-Video Dataset		-	CVPR, 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation		-	May, 2023
VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation	-	-	May, 2023
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions	-	-	Nov, 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval	-	-	ICCV, 2021
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language	-	-	CVPR, 2016

Category-level

Title	Github	WebSite	Pub. & Date
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild	-	-	Dec., 2012
First Order Motion Model for Image Animation	-	-	May, 2023
Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks	-	-	CVPR,2018

Metric and BenchMark

Title	WebSite	Pub. & Date
Fréchet Video Motion Distance: A Metric for Evaluating Motion Consistency in Videos	-	Jul., 2024
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation		Jun., 2024
[STREAM: Spatio-TempoRal Evaluation and Analysis Metric for Video Generative