Notes on Video Diffusion Models
Image-Based Video Editing
Let’s recall Diffusion-based approaches for Image editing
- Real Image -> Inversion into Latent -> Sample with Text-condition -> Edited Image
General equations for Image-based Video Editing methods:
Revisit T2I Stable Diffusion Model
Stable Diffusion Notes borrowed from my labmate Weiming Ren.
TokenFlow
Consistent Diffusion Features for Consistent Video Editing
NeurIPS 2023
Keywords: Image-based, Nearest Neighbor Field, Extended Attention, No fine-tuning/training
PNP (Plug-and-Play) is special way to inject source latent for image editing.
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
CVPR 2023
Fairy
Fast Parallelized Instruction-Guided Video-to-Video Synthesis
Keywords: Image-based, Cache Anchor frames, Yes fine-tuning/training
Text-to-Video Generation
LaVie
Misc Info
Available data-source:
- Webvid-10M (captions) (w/ watermark)
- HD-VILA-100M (metadata)
- Pexels (High quality, but requires crawling)
- DAVIS (use as a reference)
- Vimeo 25M dataset (not released yet, but we can try asking the LaVie team)
Video format required?
- Training: Square video
- Inference: Square video or Aspect ratio 1.75 ( 672 x 384)
- Length: 2~15 seconds
- No cutscenes
All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.
Comment