Notes on Video Diffusion Models

Created2024-01-27|Computer Vision

|Post View:

Image-Based Video Editing

Let’s recall Diffusion-based approaches for Image editing

Real Image -> Inversion into Latent -> Sample with Text-condition -> Edited Image

General equations for Image-based Video Editing methods:

$\mathcal{V}^*=\mathcal{D}\left(\text { DDIM-samp }\left(\text { DDIM-inv }(\mathcal{E}(\mathcal{V})), \mathcal{T}^*\right)\right)$

Revisit T2I Stable Diffusion Model

Stable Diffusion Notes borrowed from my labmate Weiming Ren.

TokenFlow

Consistent Diffusion Features for Consistent Video Editing
NeurIPS 2023

Keywords: Image-based, Nearest Neighbor Field, Extended Attention, No fine-tuning/training

PNP (Plug-and-Play) is special way to inject source latent for image editing.
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
CVPR 2023

Fairy

Fast Parallelized Instruction-Guided Video-to-Video Synthesis

Keywords: Image-based, Cache Anchor frames, Yes fine-tuning/training

Misc Info

Available data-source:

Webvid-10M (captions) (w/ watermark)
HD-VILA-100M (metadata)
Pexels (High quality, but requires crawling)
DAVIS (use as a reference)
Vimeo 25M dataset (not released yet, but we can try asking the LaVie team)

Video format required?

Training: Square video
Inference: Square video or Aspect ratio 1.75 ( 672 x 384)
Length: 2~15 seconds
No cutscenes

Literature Review

Related Articles

Paper Review - AnimeGAN

Paper Review - CartoonGAN

Paper Review - Pix2Pix, CycleGAN

Paper Review - Diffusion Models Applications

Overview on common Generative adversarial network methods

DreamBooth, Textual Inversion, LoRA

Comment

Loading the Database