Notes on Video Models Events

WorldCanvas (Dec 25’)

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

A trajectory–text–reference driven framework for controllable “promptable world events” in video generation, built by post-training Wan 2.2 I2V on curated multimodal triplets.

lets a user specify what, when, where, and who of an event by combining: trajectories encoding motion, timing, and visibility; reference images grounding object identity; and motion-focused text describing interactions.

Data and interface

Built a 280k-sample dataset of trajectory–video–text triplets by tracking keypoints with YOLO+SAM+CoTracker, cropping to simulate entry/exit, and using Qwen2.5-VL on trajectory-visualized clips to produce motion-centric captions aligned per-trajectory; reference images are extracted and mildly transformed from first-frame object crops to support drag-and-drop style control.

Model changes

They modify Wan 2.2 I2V by
- (1) injecting trajectories as Gaussian heatmaps plus a “point VAE map” concatenated to the DiT inputs via a 3D conv, and
- (2) introducing Spatial-Aware Weighted Cross-Attention that biases text tokens for a caption toward visual tokens spatially overlapping that trajectory’s bounding-box-defined region, improving per-agent text–motion binding.

At inference, users can set trajectory timing, shape (via point spacing), visibility masks, per-trajectory captions, and place multiple reference images, enabling multi-agent interactions, object entry/exit, and reference-guided appearance control.

Technical and modeling limits

Struggles with long-horizon, multi-step reasoning and complex causal chains, leading to inconsistent or implausible outcomes for intricate event sequences.
Failure under complicated camera motions or large viewpoint changes, with trajectories and agents sometimes de-synchronized from the background or each other.
Limited robustness to dense multi-agent interactions and heavy occlusions; identities or motions can drift when agents frequently overlap.

MotionStream (Nov 25’)

MotionStream: Real-Time Video Generation with Interactive Motion Controls

Time-to-Move (Nov 25’)

Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

Frame Guidance (Jun 25’)

Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

A training-free, model-agnostic guidance framework that enables fine-grained, frame-level control over large video diffusion models (VDMs) using signals like keyframes, style images, sketches, depth maps, and color blocks, without any finetuning.

Applies gradient-based guidance only on selected frames during sampling, steering the whole video toward frame-level conditions while keeping temporal coherence.
Targets two desiderata for controllable video generation: (1) training-free, model-agnostic control and (2) a general-purpose mechanism that supports many input types instead of task-specific models.

Technical method

Observes temporal locality in CausalVAE latents: modifying one frame affects only a small temporal neighborhood of latents, enabling efficient local decoding instead of decoding the full sequence.
Proposes latent slicing: to guide frame $i$ , only a short window of latents around its corresponding latent index is decoded, combined optionally with spatial downsampling, yielding up to 60× GPU memory reduction and making gradient-based guidance feasible for large VDMs on a single GPU.
Introduces Video Latent Optimization (VLO), a hybrid update schedule reflecting two phases of generation:
- Layout stage (early steps): deterministic gradient updates on latents without renoising to enforce globally coherent layout aligned with conditions.
- Detail stage (later steps): time-travel–style updates with renoising to refine details while preserving the layout and correcting accumulated errors.

The framework is instantiated with simple task-specific losses over predicted clean frames:

Keyframe-guided generation (I2V): L2 loss between guided frames and user keyframes, with arbitrary keyframe positions and tunable strength via step size and repeat count.
Stylized video generation (T2V): cosine similarity loss in a style embedding space (CSD) between style reference and selected frames, guiding only a few frames while style propagates temporally.
Looped video generation (T2V): L2 loss between first and last frame with stop-gradient on the first frame, forcing the last frame to match the first without oversaturation.
General input guidance: encoder-aligned L2 losses using differentiable encoders (e.g., depth estimator, edge predictor) for depth maps, sketches, or other structural inputs.

Ablations and limitations

Ablation on latent optimization shows VLO outperforms using only time-travel or only deterministic updates; the former hurts layout coherence and the latter risks oversaturation and temporal disconnection.
The method is empirically model-agnostic but still relies on access to latent VAEs and gradients, and can increase inference time by up to roughly 4x, so practical setups cap the number of guidance steps.