Image-Based Video Editing

Let’s recall Diffusion-based approaches for Image editing

  • Real Image -> Inversion into Latent -> Sample with Text-condition -> Edited Image

General equations for Image-based Video Editing methods:

V=D( DDIM-samp ( DDIM-inv (E(V)),T))\mathcal{V}^*=\mathcal{D}\left(\text { DDIM-samp }\left(\text { DDIM-inv }(\mathcal{E}(\mathcal{V})), \mathcal{T}^*\right)\right)

Revisit T2I Stable Diffusion Model

Stable Diffusion Notes borrowed from my labmate Weiming Ren.





TokenFlow

Consistent Diffusion Features for Consistent Video Editing
NeurIPS 2023

Keywords: Image-based, Nearest Neighbor Field, Extended Attention, No fine-tuning/training





PNP (Plug-and-Play) is special way to inject source latent for image editing.
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
CVPR 2023

Fairy

Fast Parallelized Instruction-Guided Video-to-Video Synthesis

Keywords: Image-based, Cache Anchor frames, Yes fine-tuning/training


Text-to-Video Generation

LaVie

img

img

img

img

img

Misc Info

Available data-source:

  • Webvid-10M (captions) (w/ watermark)
  • HD-VILA-100M (metadata)
  • Pexels (High quality, but requires crawling)
  • DAVIS (use as a reference)
  • Vimeo 25M dataset (not released yet, but we can try asking the LaVie team)

Video format required?

  • Training: Square video
  • Inference: Square video or Aspect ratio 1.75 ( 672 x 384)
  • Length: 2~15 seconds
  • No cutscenes