Notes on Improving Physics in Visual World Generation
Self-Refining Video Sampling (Jan 26)
- An inference-time method that lets a pre-trained flow-matching video generator act as its own self-refiner to improve motion coherence, physics alignment, and spatial consistency without extra training or external verifiers.
Method
- The authors reinterpret flow matching in latent space as a time-conditioned denoising autoencoder (DAE), where the model predicts a clean latent from a noisy latent .
- They define a Predict-and-Perturb (pnp) inner loop at a fixed noise level : predict a clean latent via the flow model (Predict), then re-add noise to the prediction (Perturb), forming a pseudo-Gibbs refinement chain on .
- each Predict-and-Perturb step is a small MCMC-like move that pushes the latent toward high-density regions of the data distribution where realistic, temporally smooth trajectories live, and away from low-density, artifact-heavy ones.
- If a current latent encodes non-physical motion (e.g., sand teleporting, jittery hands), that configuration lies in a lower-density region; repeated predict→re-noise at the same tends to “snap” toward smoother, more typical trajectories that the model has seen often in training, which are more physically plausible.
- Only a small number of inner iterations (typically 2–3) are applied and mainly at early timesteps (t<0.2), then plugged into a standard ODE solver by replacing the state with the refined latent before the next step.
Uncertainty-aware refinement
- Repeated P&P with classifier-free guidance can over-saturate static regions; to avoid this, they introduce Uncertainty-aware P&P that refines only spatio-temporal locations where predictions are self-inconsistent across successive P&P steps.
- An uncertainty map is computed as the per-location L1 difference of two consecutive predicted clean latents, averaged over channels, then thresholded to get a binary mask indicating regions to refine (typically moving objects) vs. regions to preserve (background).
- The mask is used inside the ODE update so that uncertain regions follow the refined latent while certain regions keep the previous trajectory, adding no extra NFEs beyond the extra P&P denoising calls.
Experimental setup and gains
- They apply the method to state-of-the-art flow-based video models Wan2.1, Wan2.2 (T2V/I2V) and Cosmos-2.5, across text-to-video, image-to-video, robotics, and physics benchmarks.
Physics and consistency results
- On VideoPhy2 and PhyWorldBench, their sampler improves physical commonsense (PC) scores and is strongly preferred in human evaluation (e.g., 84% preference over default Wan2.2 in physics alignment), reducing non-physical artifacts like objects appearing without causal contact.
- On PisaBench free-fall videos, trajectories become more physically plausible, with lower L2/Chamfer distance and higher IoU between generated and ground-truth paths, and fewer obviously implausible trajectories across 32 seeds.
- For camera-motion scenes where viewpoints are revisited, P&P improves spatial consistency (higher SSIM, PSNR and lower L1 after warping between revisited views), reducing background drift across large camera rotations.
Analysis, limitations, and scope
- The authors analyze why the method works better for videos than images: cross-frame consistency makes iterative P&P act as a local temporal refinement rather than causing large semantic jumps, and the chain exhibits mode-seeking behavior that suppresses temporally inconsistent low-density samples.
- On visual reasoning tasks, P&P helps when success depends on smooth motion or temporal coherence (e.g., a graph traversal task’s success rate jumps from 0.1 to 0.8), but gives little benefit when the base model lacks semantic competence (e.g., maze solving remains near zero success).
- They relate P&P to ALD, Restart, and FreeInit, emphasizing that their method refines intermediate latents at fixed noise levels within one sampling trajectory, making it training-free and more NFE-efficient than restarting full denoising; they also note limitations such as mode-seeking and reliance on the base model’s learned priors, with further discussion in the appendix.
Why this affects physics specifically
- In videos, physical violations like teleporting objects, broken contact, or inconsistent velocities show up as temporally incoherent patterns across frames, and such patterns are rare in real data, so they sit in low probability regions that P&P’s mode-seeking behavior moves away from.
- Cross-frame consistency in videos makes P&P behave like a local temporal smoother: perturbations are constrained by neighboring frames, so refinement reduces jitter/flicker instead of changing semantics wholesale, effectively enforcing more continuous motion and thus better approximate physics.
Role of “simply” re-noising
- The re-noise step alone would just randomize; what matters is the alternation: denoise with the learned prior (toward data manifold) then re-noise at the same level (local exploration), repeated a few times, which approximates a short pseudo-Gibbs chain that concentrates on stable, consistent modes.
- That chain empirically shows mode-seeking: for videos, this reduces temporal variance and removes low-density inconsistent motions (e.g., weird free-fall trajectories, mid-air object pops), which shows up as improved free-fall paths on PisaBench and fewer non-causal events on VideoPhy2/PhyWorldBench.
WMReward (Jan 26)
It has a tech report paper before:
Distillation with VJEPA-2 Reward
Improving the Physics of Video Generation with VJEPA-2 Reward Signal (2025)
- winning the ICCV 2025 PhysicsIQ Challenge.
Inference-time Physics Alignment of Video Generative Models with Latent World Models
Proposes WMReward, an inference-time method that steers existing video generative models toward producing videos that better respect basic physical laws, using a latent world model (VJEPA-2) as a physics-aware reward signal.
The work demonstrates that a pretrained latent world model can act as an effective physics prior at inference time: by rewarding videos that are easy for it to predict, WMReward aligns general-purpose video generators toward more physically plausible outputs without retraining, and this alignment scales with additional test-time compute.
- Modern video generators look good visually but often break intuitive physics (e.g., objects teleport, ignore gravity, fluids behave unrealistically). Instead of retraining these models, the authors treat “being physically plausible” as an alignment problem at inference time: among many possible generations a model can produce, search for and guide toward those that a physics-aware world model finds predictable.
Method
- The authors define a surprise-based reward: slide a temporal window over a generated video, let VJEPA-2 predict the latent representation of future frames from context, and compare this prediction to the actual encoded future frames via cosine similarity. Lower prediction error (less “surprise”) → higher physics plausibility reward.
- This reward is differentiable with respect to the video, so it can both rank complete samples and provide gradients to guide generation steps.
- Sampling strategies. To sample from a “tilted” distribution that prefers high-reward (more physical) videos, the paper uses three schemes:
- Guidance (∇): adds a gradient term proportional to the reward to the diffusion model’s score, nudging each denoising step toward videos that VJEPA-2 considers predictable.
- Best-of-N (BoN): generate N candidate videos with the base model and select the one with the highest WMReward score.
- ∇ + BoN: combine both—use guidance to generate N candidates, then pick the best by reward—achieving stronger improvements with the same search budget.
Scaling and computational trade-offs
- Increasing the number of particles (N) in BoN / ∇+BoN improves average physics scores and concentrates score distributions in the high-physics region, with diminishing but still positive returns as N grows.
- WMReward adds inference-time cost: BoN scales linearly with N, while guidance adds a roughly 5× time and 2–4× memory overhead per trajectory, but can be tuned to available compute.
Empirical results
- The method is applied to two strong video generators: MAGI-1 (autoregressive) and a holistic video latent diffusion model (vLDM), across text-to-video (T2V), image-and-text-to-video (I2V), and video-and-text-to-video (V2V) setups.
- On the PhysicsIQ benchmark (I2V and V2V), WMReward substantially improves physics plausibility metrics; with MAGI-1 V2V and ∇+BoN, it reaches a PhysicsIQ score of about 62%, surpassing the previous state-of-the-art by roughly 6–7 percentage points and winning the ICCV 2025 PhysicsIQ Challenge.
- On VideoPhy (T2V), using WMReward improves physics consistency pass rates by several percentage points for both MAGI-1 and vLDM, while semantic adherence to prompts drops slightly, reflecting that the physics reward is text-agnostic.
- Human studies show that videos generated with WMReward are preferred for physics plausibility, and often also for visual quality, with modest trade-offs in prompt alignment for some settings.
Additional
- Comparisons with alternative reward signals (VideoMAE reconstruction error and VLM-based physics judgments using Qwen2.5-VL / Qwen3-VL) show WMReward gives stronger physics improvements, supporting the claim that latent world models encode better physical structure.
- VBench-based evaluations indicate that improved physics plausibility does not degrade standard perceptual metrics (subject/background consistency, motion smoothness, temporal flicker, image/aesthetic quality), and sometimes slightly improves them.
- WMReward remains robust under different VJEPA architectures and reward hyperparameters, and tends to perform better with larger VJEPA backbones.
Sampling loop (per particle)
-
Step A – Diffusion step
- Start from noise .
- Run one denoising step with the base model to get new and estimated clean video via Tweedie’s formula.
Step B – Compute physics reward on
-
Slide a temporal window of length over the current video estimate .
-
For each window position :
- Context frames:
- Future frames:
- Encode context with VJEPA encoder: .
- Feed into VJEPA predictor with masked future tokens to get predicted future latents .
- Encode full window to get actual future latents .
-
Surprise-based reward:
(lower surprise → more physical).
Step C – Use reward in one of two ways
- Guidance (∇):
- Treat and approximate the tilted score
- Backprop through VJEPA to compute , add it to the diffusion score, and take the next denoising step with this modified score.
- No guidance (pure BoN):
- Ignore gradients; just continue vanilla denoising and only use after the full video is generated.
-
Search over multiple particles
- Run N independent trajectories (particles) with either:
- BoN: plain diffusion for each particle; at the end, compute on the final videos and select .
- ∇+BoN: use guided denoising (Step C, ∇) for each particle, then still do Best-of-N on the final VJEPA reward scores.
- Run N independent trajectories (particles) with either:
-
Output
- Return the single highest-reward video (or the N scored videos, if desired) as the physics-aligned generation, with no training or fine-tuning of either the video model or VJEPA-2.
AI generated pseudocode
1 | # Inputs: |
ProPhy (Dec 25’)
ProPhy: Progressive Physical Alignment for Dynamic World Simulation
- The method adds a dedicated Physical branch to existing latent video diffusion backbones such as Wan2.1 and CogVideoX, operating alongside the standard denoising transformer.
- A Semantic Expert Block (SEB) extracts video-level, physics-specific priors from the text prompt using a Mixture-of-Physics-Experts over learnable basis maps that match the latent’s shape.
- Multiple Physical Blocks, initialized from backbone transformer layers, progressively refine and inject these priors into the video latent, preserving pretrained semantic/rendering capacity while accumulating physical information.
- A Refinement Expert Block (REB) at the last Physical Block routes each latent token to a small set of experts, providing token-level physics priors for spatially anisotropic responses (different regions follow different physical laws).
VideoREPA (May 25’)
VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
A novel framework that distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations.
- Physics Understanding Gap Identification: We identify and quantify the significant physics understanding gap between self-supervised video foundation models and text-to-video diffusion models.
- Token Relation Distillation (TRD) Loss: We introduce a novel TRD loss that effectively distills physics knowledge through spatio-temporal token relation alignment, overcoming key limitations of direct REPA application.
- VideoREPA Framework: We present the first representation alignment method specifically designed for finetuning video diffusion models and injecting physical knowledge into T2V generation.
About REPA
- From the paper “Training Diffusion Transformers is Easier than you think” (ICLR 2025)
- Similar to perceptual loss but applied in aligns the internal hidden states of the diffusion model (as it processes noisy inputs at each timestep) directly with the features of the clean, original image from the external encoder (DINOv2).
- injects supervision directly into the diffusion model’s internal representations during the denoising process.
- Shown improved FID with same amount of training iterations (more effective)
- In implementation, since the hidden states from the diffusion transformer and the features from DINOv2 don’t naturally live in the same space, extra MLP layer (a three-layer MLP with SiLU activations) is needed for projection.
- REPA operates at the token-to-token level, where each token corresponds to a patch in the latent image space.
- As the paper shows, this simple trick allows the model to use alignment primarily in the early layers, leaving later layers free to focus on refining fine details.
- the MLP (projection head) is applied at each layer where REPA is active, not just one. The paper shows that aligning just the first 8 layers of the diffusion transformer (e.g., in SiT-XL/2) is sufficient and optimal. That means one separate MLP is attached to the output of each of those first 8 transformer blocks, and each of those MLPs is trained independently to project that block’s hidden state into alignment with DINOv2’s corresponding patch representations.
Now come back to VideoREPA.
- goal of REPA is to accelerate training from scratch by helping diffusion models converge faster through direct alignment of feature representations—essentially acting as a training efficiency booster.
- goal of VideoREPA is to inject specific, high-fidelity physics understanding into pre-trained text-to-video models through fine-tuning.
VideoREPA introduced:
- a novel variant called Token Relation Distillation (TRD) loss, which moves beyond standard REPA by aligning relational structures (i.e., pairwise similarities between tokens) across both spatial and temporal dimensions — rather than directly aligning feature vectors — to better suit the challenges of fine-tuning pre-trained video models and capturing physical dynamics.
- Token Relation Distillation computes pairwise token similarities within frames (spatial) and across frames (temporal) and matches those relations with an L1 “soft” objective, which is more stable for finetuning.
- Similar to REPA, VideoREPA injects supervision directly into the diffusion model’s internal representations during the denoising process.
- REPA “hard-aligns” token features with a cosine-similarity objective to speed up from-scratch image DiT training.
- VideoREPA “soft-aligns” relations between tokens (pairwise similarity matrices) with an L1 + margin objective, across space and time, for finetuning.
Why it assumes VideoMAEv2 already know the physics, so it thinks distillating it will learn the physics?
- Why a video SSL teacher has physics signal: predicting missing frames/patches in real videos is easier if the model encodes who-is-where, how things move across time, when/where contacts happen, and how fluids deform. Those cues are embedded in token relations (similarities within and across frames).
- Why distillation can transfer it: aligning pairwise token–token similarities forces the student’s mid-layer to organize space–time tokens the way the teacher does. That makes the denoiser’s score field prefer trajectories with smoother motion, consistent contacts, and fewer shape glitches, which shows up as more physics-plausible generations.
- Evidence and limits: in the results you shared, the teacher’s features outperform the T2V model’s features on a physics-understanding probe, and after alignment the gap narrows and generations improve. This doesn’t give exact simulators or guarantees; gains depend on the teacher’s pretraining and cover “commonsense physics” seen in natural videos, not precise quantitative laws.
If we try to align the representation under VideoMAE/V-JEPA2, we need to make sure the dataset to align is still general enough.
Diving into the code
https://github.com/aHapBean/VideoREPA
The paper used full model supervised fine-tuning (SFT) for the smaller CogVideoX-2B model, and LoRA (Low-Rank Adaptation) training for the larger CogVideoX-5B model:
- CogVideoX-2B: “full-parameter finetuning” (all weights updated, classic SFT).
- SFT: only the transformer is trainable; others frozen.
- CogVideoX-5B: LoRA-based efficient finetuning, with explicit mention of LoRA rank and alpha (rank 128, alpha 64).
- LoRA: adds adapters to the transformer; additionally trains VideoREPA projector heads and the downsampler.
1 | def compute_loss(self, batch) -> torch.Tensor: |
PhysWorld (Nov 24’)
How Far is Video Generation from World Model: A Physical Law Perspective
https://youtu.be/yWSn9Xyrkso?si=YJPPePMqpPm6VdD9
When referencing training data for generalization, the models prioritize different attributes in a specific order: color > size > velocity > shape. This explains why video generation models often struggle with maintaining object consistency.
-
Simply scaling video generation models and data is insufficient for them to discover fundamental physical laws and generalize robustly to out-of-distribution scenarios.
-
Instead, improving combinatorial diversity in training data is crucial for better physical video modeling. The models’ generalization mechanism relies more on memorization and case-based imitation than on learning universal rules.





