Notes on Video Model Efficency

FastVideo

FastVideo is a unified post-training and inference framework for accelerated video generation. Github

Context Forcing (Feb 26’)

Context Forcing: Consistent Autoregressive Video Generation with Long Context

LongLive (Sep 25’)

LongLive: Real-time Interactive Long Video Generation

LONGLIVE is a 1.3B-parameter frame-level autoregressive video generation framework that produces real-time, interactive long videos (up to 240 seconds) from sequential text prompts while maintaining high visual quality and temporal coherence.

Core idea

Uses causal attention with KV caching to get diffusion-like visual quality but with real-time throughput (20.7 FPS on a single H100 at 832×480), enabling interactive long video generation instead of only static single-prompt clips.

Method

KV recache: When the user changes the prompt, the model rebuilds the KV cache from the already generated frames plus the new prompt, removing residual semantics from the old prompt but preserving motion and scene continuity, which fixes the trade-off between abrupt cuts (clear cache) and prompt inertia (keep cache).
Streaming long tuning (train-long–test-long): The model is trained by rolling out long sequences as a sequence of short clips (e.g., 5 s each), always conditioning on its own previous generations and supervising only the newly generated clip, which aligns training with inference while avoiding OOM and letting a short-clip teacher reliably supervise long horizons.
Short-window attention + frame sink: At inference and during tuning, attention is restricted to a local temporal window to reduce compute, while a small set of sink tokens (first-frame chunk) is kept globally attendable, restoring long-range consistency even with a short window and yielding around 28% compute and 17% memory savings compared to full-window attention.

Training and data strategy

Built on Wan2.1-T2V-1.3B, first adapted to a few-step causal AR model with self-forcing DMD on VidProM, then long-tuned on synthetic 60-second sequences with exactly one prompt switch per sequence and KV recache applied at the switch.
Uses self-supervised distillation from a short-clip teacher and LLM-generated follow-up prompts (Qwen2-72B-Instruct) rather than additional real video datasets.

Ablations show: KV recache gives the best combination of subject/background consistency and semantic CLIP scores at prompt switches, and frame-sink plus a smaller window can match the consistency of larger windows with lower cost.

Limitations

Training and modeling assumptions
- Streaming long tuning assumes a fixed clip length (e.g., 5 s), a fixed max training horizon (60 s), and one switch; anything beyond 60 s or with more complex interaction patterns is an extrapolation.
- The teacher is only reliable on short clips; streaming long tuning explicitly avoids supervising long sequences end-to-end, which limits the correctness of very long-horizon supervision.
Efficiency–quality trade-offs
- Short-window attention plus frame sink is a deliberate approximation: small windows are more efficient but would harm long-range consistency without the sink, so quality still depends on hand-chosen window and sink sizes.
- KV recache is only applied once per training sample and is designed for sparse switches; frequent recaching at inference adds overhead and could become costly if switches are very dense.

Rolling Forcing (Sep 25’)

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

Self-forcing (Jun 25’)

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Self Forcing is a post-training paradigm for autoregressive video diffusion models that closes the exposure-bias gap by training the model under the same autoregressive rollout procedure used at inference, and supervising it with holistic video-level distribution matching losses.

Core idea

Standard Teacher Forcing and Diffusion Forcing denoise each frame in parallel using ground-truth or noisy context frames, which creates exposure bias and error accumulation at inference when the model conditions on its own imperfect outputs.
Self Forcing instead unrolls the full causal generation process during training: each frame is generated by few-step diffusion conditioned on previously self-generated frames, using KV caching just as in inference.
Since the training outputs truly follow the inference-time model distribution, the method can apply holistic video-level distribution-matching losses (DMD, SiD, or GAN) between generated and real/noised videos, rather than frame-wise denoising objectives.

Algorithmic details

The model is a causal DiT-based few-step video diffusion model (4 denoising steps) over latent 3D-VAE codes, with chain-rule factorization
During Self Forcing training, they:
- Autoregressively roll out all frames with KV caching.
- Use gradient truncation: only the final denoising step of each frame in a randomly chosen step s∈{1,…,T} is backpropagated, while earlier steps and past frames are detached (no gradients through KV cache).
- Optimize a video-level distribution-matching objective (DMD, SiD, or GAN) on the generated videos after applying a forward noising process.
They also introduce a rolling KV cache for extrapolation: keep KV entries for a fixed window of recent frames and evict the oldest, enabling infinite-length autoregressive generation with O(TL) complexity and no KV recomputation.
To avoid artifacts when the initial “image latent” frame leaves the cache, they train with a restricted attention window that prevents attending to the first chunk when denoising the last chunk, matching the long-context setting.

Experimental setup

Base model: Wan2.1-T2V-1.3B, a 5s, 16 FPS, 832×480 Flow Matching video diffusion model.
They implement both chunk-wise AR (3-frame chunks) and frame-wise AR variants with 4-step diffusion.
Training uses distribution matching against either a large teacher (Wan 14B) for DMD, a 1.3B model for SiD, or a relativistic GAN critic; training converges in ~1.5–3 hours on 64 H100s.

Limitations

error accumulation can still appear on videos far longer than the training context, and gradient truncation may constrain very long-range dependency learning; future work might use more recurrent/state-space architectures to extend context.