References

Wan: Open and Advanced Large-Scale Video Generative Models https://arxiv.org/abs/2503.20314

Improved Video VAE for Latent Video Diffusion Model https://arxiv.org/abs/2411.06449

Perception Prioritized Training of Diffusion Models https://arxiv.org/abs/2204.00227

Wan 2.1 and Wan 2.2

Model Spec

Variational Auto Encoder (VAE)

  • Wan-VAE ( 127M) : 4 × 8 × 8 at 16 dim

The major key trick:

  1. Train a good Image VAE
  2. Jump start on short low res video (128x128 T=5)
  3. Train on actual high res video with GAN loss (720x720 T=?)

Diffusion model

  • DiT based transformers (14B)

Same trick in pre-training

  1. Train a good text to image model
  2. Add in short low res video (128p) + image
  3. Train on high res video (420p, 720p @ 5 secs ) + image

Training objective follows Rectified Flows (RFs), model is trained to predict the velocity, thus, the loss function can be formulated as the mean squared error (MSE) between the model output and vt

Wan2.2, a major upgrade to our foundational video models. With Wan2.2, we have focused on incorporating the following innovations:

  • 👍 Effective MoE Architecture: Wan2.2 introduces a Mixture-of-Experts (MoE) architecture into video diffusion models. By separating the denoising process cross timesteps with specialized powerful expert models, this enlarges the overall model capacity while maintaining the same computational cost.

    • Disentangle experts at denoising step
    • The expert switching is guided by SNR
  • The observation that diffusion models primarily learn perceptually rich content at higher timesteps, whereas they focus on straightforward noise removal at lower timesteps

    • Large SNR (> 100): images are almost clean → model learns only imperceptible details.

    • Medium SNR (10^-2 – 100): content is partially corrupted → model learns perceptually rich contents.

    • Small SNR (< 10^−2 ): almost pure noise → model learns only coarse global features.

Text encoder

  • Text encoder : umT5-xxl (13B)

Pre-training / Post-training data

  1. Curriculum learning is important
  2. So is data quality ( clean and diverse )
  3. Observations is important

Basic filtering that removed ~50% of initial data:

  • Text detection (OCR to limit excessive text)
  • Aesthetic evaluation using LAION-5B classifier
  • NSFW content filtering
  • Watermark/logo detection and cropping
  • Black border removal
  • Overexposure detection
  • Synthetic image filtering (crucial since <10% contamination significantly degrades performance)
  • Blur detection
  • Duration (>4 seconds) and resolution constraints

All text prompt is recaptioned