References

Wan: Open and Advanced Large-Scale Video Generative Models https://arxiv.org/abs/2503.20314

Improved Video VAE for Latent Video Diffusion Model https://arxiv.org/abs/2411.06449

Perception Prioritized Training of Diffusion Models https://arxiv.org/abs/2204.00227

Wan 2.1 and Wan 2.2

Model Spec

Variational Auto Encoder (VAE)

Wan-VAE ( 127M) : 4 × 8 × 8 at 16 dim

The major key trick:

Train a good Image VAE
Jump start on short low res video (128x128 T=5)
Train on actual high res video with GAN loss (720x720 T=?)

Diffusion model

DiT based transformers (14B)

Same trick in pre-training

Train a good text to image model
Add in short low res video (128p) + image
Train on high res video (420p, 720p @ 5 secs ) + image

Training objective follows Rectified Flows (RFs), model is trained to predict the velocity, thus, the loss function can be formulated as the mean squared error (MSE) between the model output and vt

Wan2.2, a major upgrade to our foundational video models. With Wan2.2, we have focused on incorporating the following innovations:

👍 Effective MoE Architecture: Wan2.2 introduces a Mixture-of-Experts (MoE) architecture into video diffusion models. By separating the denoising process cross timesteps with specialized powerful expert models, this enlarges the overall model capacity while maintaining the same computational cost.
- Disentangle experts at denoising step
- The expert switching is guided by SNR
The observation that diffusion models primarily learn perceptually rich content at higher timesteps, whereas they focus on straightforward noise removal at lower timesteps
- Large SNR (> 100): images are almost clean → model learns only imperceptible details.
- Medium SNR (10^-2 – 100): content is partially corrupted → model learns perceptually rich contents.
- Small SNR (< 10^−2 ): almost pure noise → model learns only coarse global features.

Text encoder

Text encoder : umT5-xxl (13B)

Pre-training / Post-training data

Curriculum learning is important

So is data quality ( clean and diverse )

Observations is important

Basic filtering that removed ~50% of initial data:

Text detection (OCR to limit excessive text)
Aesthetic evaluation using LAION-5B classifier
NSFW content filtering
Watermark/logo detection and cropping
Black border removal
Overexposure detection
Synthetic image filtering (crucial since <10% contamination significantly degrades performance)
Blur detection
Duration (>4 seconds) and resolution constraints

All text prompt is recaptioned