Notes on Improving Physics in Visual World Generation

Self-Refining Video Sampling (Jan 26)

Self-Refining Video Sampling

An inference-time method that lets a pre-trained flow-matching video generator act as its own self-refiner to improve motion coherence, physics alignment, and spatial consistency without extra training or external verifiers.

Method

The authors reinterpret flow matching in latent space as a time-conditioned denoising autoencoder (DAE), where the model predicts a clean latent $z_1$ from a noisy latent $z_t$ .
They define a Predict-and-Perturb (pnp) inner loop at a fixed noise level $t$ $t$ : predict a clean latent via the flow model (Predict), then re-add noise to the prediction (Perturb), forming a pseudo-Gibbs refinement chain on $z_t$ $z_{t}$ .
- each Predict-and-Perturb step is a small MCMC-like move that pushes the latent toward high-density regions of the data distribution where realistic, temporally smooth trajectories live, and away from low-density, artifact-heavy ones.
- If a current latent encodes non-physical motion (e.g., sand teleporting, jittery hands), that configuration lies in a lower-density region; repeated predict→re-noise at the same $t$ tends to “snap” toward smoother, more typical trajectories that the model has seen often in training, which are more physically plausible.
Only a small number of inner iterations (typically 2–3) are applied and mainly at early timesteps (t<0.2), then plugged into a standard ODE solver by replacing the state with the refined latent before the next step.

Uncertainty-aware refinement

Repeated P&P with classifier-free guidance can over-saturate static regions; to avoid this, they introduce Uncertainty-aware P&P that refines only spatio-temporal locations where predictions are self-inconsistent across successive P&P steps.
An uncertainty map is computed as the per-location L1 difference of two consecutive predicted clean latents, averaged over channels, then thresholded to get a binary mask indicating regions to refine (typically moving objects) vs. regions to preserve (background).
The mask is used inside the ODE update so that uncertain regions follow the refined latent while certain regions keep the previous trajectory, adding no extra NFEs beyond the extra P&P denoising calls.

Experimental setup and gains

They apply the method to state-of-the-art flow-based video models Wan2.1, Wan2.2 (T2V/I2V) and Cosmos-2.5, across text-to-video, image-to-video, robotics, and physics benchmarks.

Physics and consistency results

On VideoPhy2 and PhyWorldBench, their sampler improves physical commonsense (PC) scores and is strongly preferred in human evaluation (e.g., 84% preference over default Wan2.2 in physics alignment), reducing non-physical artifacts like objects appearing without causal contact.
On PisaBench free-fall videos, trajectories become more physically plausible, with lower L2/Chamfer distance and higher IoU between generated and ground-truth paths, and fewer obviously implausible trajectories across 32 seeds.
For camera-motion scenes where viewpoints are revisited, P&P improves spatial consistency (higher SSIM, PSNR and lower L1 after warping between revisited views), reducing background drift across large camera rotations.

Analysis, limitations, and scope

The authors analyze why the method works better for videos than images: cross-frame consistency makes iterative P&P act as a local temporal refinement rather than causing large semantic jumps, and the chain exhibits mode-seeking behavior that suppresses temporally inconsistent low-density samples.
On visual reasoning tasks, P&P helps when success depends on smooth motion or temporal coherence (e.g., a graph traversal task’s success rate jumps from 0.1 to 0.8), but gives little benefit when the base model lacks semantic competence (e.g., maze solving remains near zero success).
They relate P&P to ALD, Restart, and FreeInit, emphasizing that their method refines intermediate latents at fixed noise levels within one sampling trajectory, making it training-free and more NFE-efficient than restarting full denoising; they also note limitations such as mode-seeking and reliance on the base model’s learned priors, with further discussion in the appendix.

Why this affects physics specifically

In videos, physical violations like teleporting objects, broken contact, or inconsistent velocities show up as temporally incoherent patterns across frames, and such patterns are rare in real data, so they sit in low probability regions that P&P’s mode-seeking behavior moves away from.
Cross-frame consistency in videos makes P&P behave like a local temporal smoother: perturbations are constrained by neighboring frames, so refinement reduces jitter/flicker instead of changing semantics wholesale, effectively enforcing more continuous motion and thus better approximate physics.

Role of “simply” re-noising

The re-noise step alone would just randomize; what matters is the alternation: denoise with the learned prior (toward data manifold) then re-noise at the same level (local exploration), repeated a few times, which approximates a short pseudo-Gibbs chain that concentrates on stable, consistent modes.
That chain empirically shows mode-seeking: for videos, this reduces temporal variance and removes low-density inconsistent motions (e.g., weird free-fall trajectories, mid-air object pops), which shows up as improved free-fall paths on PisaBench and fewer non-causal events on VideoPhy2/PhyWorldBench.

WMReward (Jan 26)

It has a tech report paper before:

Distillation with VJEPA-2 Reward

Improving the Physics of Video Generation with VJEPA-2 Reward Signal (2025)

winning the ICCV 2025 PhysicsIQ Challenge.

Inference-time Physics Alignment of Video Generative Models with Latent World Models

Proposes WMReward, an inference-time method that steers existing video generative models toward producing videos that better respect basic physical laws, using a latent world model (VJEPA-2) as a physics-aware reward signal.

The work demonstrates that a pretrained latent world model can act as an effective physics prior at inference time: by rewarding videos that are easy for it to predict, WMReward aligns general-purpose video generators toward more physically plausible outputs without retraining, and this alignment scales with additional test-time compute.

Modern video generators look good visually but often break intuitive physics (e.g., objects teleport, ignore gravity, fluids behave unrealistically). Instead of retraining these models, the authors treat “being physically plausible” as an alignment problem at inference time: among many possible generations a model can produce, search for and guide toward those that a physics-aware world model finds predictable.

Method

The authors define a surprise-based reward: slide a temporal window over a generated video, let VJEPA-2 predict the latent representation of future frames from context, and compare this prediction to the actual encoded future frames via cosine similarity. Lower prediction error (less “surprise”) → higher physics plausibility reward.
This reward is differentiable with respect to the video, so it can both rank complete samples and provide gradients to guide generation steps.
Sampling strategies. To sample from a “tilted” distribution that prefers high-reward (more physical) videos, the paper uses three schemes:
- Guidance (∇): adds a gradient term proportional to the reward to the diffusion model’s score, nudging each denoising step toward videos that VJEPA-2 considers predictable.
- Best-of-N (BoN): generate N candidate videos with the base model and select the one with the highest WMReward score.
- ∇ + BoN: combine both—use guidance to generate N candidates, then pick the best by reward—achieving stronger improvements with the same search budget.

Scaling and computational trade-offs

Increasing the number of particles (N) in BoN / ∇+BoN improves average physics scores and concentrates score distributions in the high-physics region, with diminishing but still positive returns as N grows.
WMReward adds inference-time cost: BoN scales linearly with N, while guidance adds a roughly 5× time and 2–4× memory overhead per trajectory, but can be tuned to available compute.

Empirical results

The method is applied to two strong video generators: MAGI-1 (autoregressive) and a holistic video latent diffusion model (vLDM), across text-to-video (T2V), image-and-text-to-video (I2V), and video-and-text-to-video (V2V) setups.
On the PhysicsIQ benchmark (I2V and V2V), WMReward substantially improves physics plausibility metrics; with MAGI-1 V2V and ∇+BoN, it reaches a PhysicsIQ score of about 62%, surpassing the previous state-of-the-art by roughly 6–7 percentage points and winning the ICCV 2025 PhysicsIQ Challenge.
On VideoPhy (T2V), using WMReward improves physics consistency pass rates by several percentage points for both MAGI-1 and vLDM, while semantic adherence to prompts drops slightly, reflecting that the physics reward is text-agnostic.
Human studies show that videos generated with WMReward are preferred for physics plausibility, and often also for visual quality, with modest trade-offs in prompt alignment for some settings.

Additional

Comparisons with alternative reward signals (VideoMAE reconstruction error and VLM-based physics judgments using Qwen2.5-VL / Qwen3-VL) show WMReward gives stronger physics improvements, supporting the claim that latent world models encode better physical structure.
VBench-based evaluations indicate that improved physics plausibility does not degrade standard perceptual metrics (subject/background consistency, motion smoothness, temporal flicker, image/aesthetic quality), and sometimes slightly improves them.
WMReward remains robust under different VJEPA architectures and reward hyperparameters, and tends to perform better with larger VJEPA backbones.

Sampling loop (per particle)

Step A – Diffusion step
- Start from noise $x_T$ .
- Run one denoising step with the base model to get new $x_t$ and estimated clean video $x_{0|t}$ via Tweedie’s formula.
Step B – Compute physics reward on $x_{0|t}$
- Slide a temporal window of length $C+M$ over the current video estimate $x_{0|t}$ .
- For each window position $k$ :
  - Context frames: $x_{k-C+1:k}$
  - Future frames: $x_{k+1:k+M}$
  - Encode context with VJEPA encoder: $E_\theta(x_{k-C+1:k})$ .
  - Feed into VJEPA predictor with masked future tokens to get predicted future latents $\hat{z}^{\text{fut}}_k$ .
  - Encode full window to get actual future latents ${z}^{\text{fut}}_k$ .
- Surprise-based reward:
  
  $r(x)=\frac{1}{∣K∣}∑_{k\in K}(1−cos⁡(\hat{z}^{\text{fut}}_k,{z}^{\text{fut}}_k))$
  
  (lower surprise → more physical).
Step C – Use reward in one of two ways
- Guidance (∇):
  - Treat $w(x)=\exp(\lambda r(x))$ and approximate the tilted score $\nabla_{x_t} \log p_t^*\left(x_t\right) \approx \nabla_{x_t} \log p_t\left(x_t\right)+\lambda \nabla_{x_t} r\left(x_{0 \mid t}\right)$
  - Backprop through VJEPA to compute $\nabla_{x_t} r\left(x_{0 \mid t}\right)$ , add it to the diffusion score, and take the next denoising step with this modified score.
- No guidance (pure BoN):
  - Ignore gradients; just continue vanilla denoising and only use $r(x)$ after the full video is generated.
Search over multiple particles
- Run N independent trajectories (particles) with either:
  - BoN: plain diffusion for each particle; at the end, compute $r(x^{(i)})$ on the final videos and select $x=\arg \max r(x(i))$ .
  - ∇+BoN: use guided denoising (Step C, ∇) for each particle, then still do Best-of-N on the final VJEPA reward scores.
Output
- Return the single highest-reward video (or the N scored videos, if desired) as the physics-aligned generation, with no training or fine-tuning of either the video model or VJEPA-2.

AI generated pseudocode

# Inputs:
#   cond         : text / image / video condition
#   video_model  : pretrained diffusion / flow video generator
#   vjepa_enc    : VJEPA-2 encoder E_theta
#   vjepa_pred   : VJEPA-2 predictor P_phi
#   N            : number of particles (trajectories)
#   T, timesteps : diffusion schedule
#   lambda_w     : guidance strength λ
#   use_grad     : True for ∇ or ∇+BoN, False for pure BoN

def wmreward_video_generation(cond, N, use_grad, lambda_w):
    # initialize N noisy trajectories
    x_T = [sample_gaussian_like_video() for _ in range(N)]

    # denoising loop (same timesteps for all particles)
    x_t = x_T
    for t in reversed(timesteps):             # t: T → 0
        new_x_t = []
        for i in range(N):
            # (1) base model score and denoising step
            score = video_model.score(x_t[i], t, cond)             # ∇_{x_t} log p_t(x_t)[page:3]
            x0_est = tweedie_estimate(x_t[i], score, t)            # x_{0|t}[page:5]

            if use_grad:
                # (2) compute WMReward r(x0_est)
                r = wmreward_vjepa(x0_est, vjepa_enc, vjepa_pred)  # surprise-based reward[page:4]

                # (3) gradient of reward w.r.t. x_t via x0_est
                #     r(x0_est(x_t)) ⇒ ∇_{x_t} r(...)
                grad_r = grad_wrt_xt(r, x_t[i])                    # backprop through VJEPA & Tweedie[page:5]

                # (4) guided score for tilted distribution p*(x)
                guided_score = score + lambda_w * grad_r           # Eq. (9)[page:5]
                step_score = guided_score
            else:
                step_score = score

            # (5) one sampler step (e.g., Euler, Heun) using step_score
            x_next = diffusion_step(x_t[i], step_score, t)
            new_x_t.append(x_next)

        x_t = new_x_t

    # after loop: x_t are N complete videos at t=0
    videos = x_t

    # compute final rewards for BoN selection
    rewards = [wmreward_vjepa(v, vjepa_enc, vjepa_pred) for v in videos]  # r(x)[page:4]
    best_idx = argmax(rewards)
    return videos[best_idx], rewards[best_idx]


def wmreward_vjepa(video, enc, pred, C=8, M=8, stride=8):
    """
    Sliding-window VJEPA surprise reward r(x):
    average (1 - cosine) between predicted and actual future latents.[page:4]
    """
    T = video.num_frames()
    K = []
    surprises = []

    # window: length C + M, context C, future M
    for k in range(C, T - M, stride):
        # context frames x_{k-C+1:k}
        ctx_frames = video.frames[k-C : k]
        # context encoding
        ctx_latents = enc(ctx_frames)                             # E_theta(x_{k-C+1:k})[page:4]

        # predict all C+M latents with masked future positions
        z_hat_all = pred(mask_tokens_for_future(M), ctx_latents)  # P_phi(Δ_m, E_theta(...))[page:4]

        # encode full window x_{k-C+1:k+M}
        full_frames = video.frames[k-C : k+M]
        z_all = enc(full_frames)                                  # E_theta(x_{k-C+1:k+M})[page:4]

        # extract future parts (positions C..C+M-1)
        z_hat_fut = z_hat_all[C : C+M]
        z_fut     = z_all[C : C+M]

        # surprise for this window
        s_k = 1.0 - cosine_similarity(z_hat_fut, z_fut)           # Eq. (6)[page:4]
        surprises.append(s_k)
        K.append(k)

    return mean(surprises)                                       # r(x)[page:4]

ProPhy (Dec 25’)

ProPhy: Progressive Physical Alignment for Dynamic World Simulation

The method adds a dedicated Physical branch to existing latent video diffusion backbones such as Wan2.1 and CogVideoX, operating alongside the standard denoising transformer.
A Semantic Expert Block (SEB) extracts video-level, physics-specific priors from the text prompt using a Mixture-of-Physics-Experts over learnable basis maps that match the latent’s shape.
Multiple Physical Blocks, initialized from backbone transformer layers, progressively refine and inject these priors into the video latent, preserving pretrained semantic/rendering capacity while accumulating physical information.
A Refinement Expert Block (REB) at the last Physical Block routes each latent token to a small set of experts, providing token-level physics priors for spatially anisotropic responses (different regions follow different physical laws).

VideoREPA (May 25’)

VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

A novel framework that distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations.

Physics Understanding Gap Identification: We identify and quantify the significant physics understanding gap between self-supervised video foundation models and text-to-video diffusion models.
Token Relation Distillation (TRD) Loss: We introduce a novel TRD loss that effectively distills physics knowledge through spatio-temporal token relation alignment, overcoming key limitations of direct REPA application.
VideoREPA Framework: We present the first representation alignment method specifically designed for finetuning video diffusion models and injecting physical knowledge into T2V generation.

About REPA

From the paper “Training Diffusion Transformers is Easier than you think” (ICLR 2025)

Similar to perceptual loss but applied in aligns the internal hidden states of the diffusion model (as it processes noisy inputs at each timestep) directly with the features of the clean, original image from the external encoder (DINOv2).

injects supervision directly into the diffusion model’s internal representations during the denoising process.

Shown improved FID with same amount of training iterations (more effective)

In implementation, since the hidden states from the diffusion transformer and the features from DINOv2 don’t naturally live in the same space, extra MLP layer (a three-layer MLP with SiLU activations) is needed for projection.

REPA operates at the token-to-token level, where each token corresponds to a patch in the latent image space.

As the paper shows, this simple trick allows the model to use alignment primarily in the early layers, leaving later layers free to focus on refining fine details.

the MLP (projection head) is applied at each layer where REPA is active, not just one. The paper shows that aligning just the first 8 layers of the diffusion transformer (e.g., in SiT-XL/2) is sufficient and optimal. That means one separate MLP is attached to the output of each of those first 8 transformer blocks, and each of those MLPs is trained independently to project that block’s hidden state into alignment with DINOv2’s corresponding patch representations.

Now come back to VideoREPA.

goal of REPA is to accelerate training from scratch by helping diffusion models converge faster through direct alignment of feature representations—essentially acting as a training efficiency booster.
goal of VideoREPA is to inject specific, high-fidelity physics understanding into pre-trained text-to-video models through fine-tuning.

VideoREPA introduced:

a novel variant called Token Relation Distillation (TRD) loss, which moves beyond standard REPA by aligning relational structures (i.e., pairwise similarities between tokens) across both spatial and temporal dimensions — rather than directly aligning feature vectors — to better suit the challenges of fine-tuning pre-trained video models and capturing physical dynamics.
- Token Relation Distillation computes pairwise token similarities within frames (spatial) and across frames (temporal) and matches those relations with an L1 “soft” objective, which is more stable for finetuning.

Similar to REPA, VideoREPA injects supervision directly into the diffusion model’s internal representations during the denoising process.
- REPA “hard-aligns” token features with a cosine-similarity objective to speed up from-scratch image DiT training.
- VideoREPA “soft-aligns” relations between tokens (pairwise similarity matrices) with an L1 + margin objective, across space and time, for finetuning.

Why it assumes VideoMAEv2 already know the physics, so it thinks distillating it will learn the physics?

Why a video SSL teacher has physics signal: predicting missing frames/patches in real videos is easier if the model encodes who-is-where, how things move across time, when/where contacts happen, and how fluids deform. Those cues are embedded in token relations (similarities within and across frames).
Why distillation can transfer it: aligning pairwise token–token similarities forces the student’s mid-layer to organize space–time tokens the way the teacher does. That makes the denoiser’s score field prefer trajectories with smoother motion, consistent contacts, and fewer shape glitches, which shows up as more physics-plausible generations.
Evidence and limits: in the results you shared, the teacher’s features outperform the T2V model’s features on a physics-understanding probe, and after alignment the gap narrows and generations improve. This doesn’t give exact simulators or guarantees; gains depend on the teacher’s pretraining and cover “commonsense physics” seen in natural videos, not precise quantitative laws.

If we try to align the representation under VideoMAE/V-JEPA2, we need to make sure the dataset to align is still general enough.

Diving into the code

https://github.com/aHapBean/VideoREPA

The paper used full model supervised fine-tuning (SFT) for the smaller CogVideoX-2B model, and LoRA (Low-Rank Adaptation) training for the larger CogVideoX-5B model:

CogVideoX-2B: “full-parameter finetuning” (all weights updated, classic SFT).
- SFT: only the transformer is trainable; others frozen.
CogVideoX-5B: LoRA-based efficient finetuning, with explicit mention of LoRA rank and alpha (rank 128, alpha 64).
- LoRA: adds adapters to the transformer; additionally trains VideoREPA projector heads and the downsampler.

def compute_loss(self, batch) -> torch.Tensor:
    prompt_embedding = batch["prompt_embedding"]
    latent = batch["encoded_videos"]
    raw_frames = batch["raw_frames"]    # [B, C, F, H, W] whose value range from -1 to 1, e.g. torch.Size([Batch_size, 3, 49, 480, 720])
    
    # pre-process for vision encoder
    B, C, F, H, W = raw_frames.shape 
    raw_frames = raw_frames.transpose(1, 2).flatten(0, 1)   # B * F, C, H, W
    if self.args.align_models[0] in ['VideoMAEv2', 'VideoMAE', 'OminiMAE', "VJEPA", "VJEPA2"]:
        raw_frames = (raw_frames + 1.0) / 2.0
        raw_frames = Normalize([0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])(raw_frames)    # should be NCHW
    elif self.args.align_models[0] == 'DINOv2':
        raw_frames = (raw_frames + 1.0) / 2.0
        raw_frames = Normalize(IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD)(raw_frames)
    else:
        raise NotImplementedError
    raw_frames = raw_frames.reshape(B, F, C, H, W).transpose(1, 2)
    
    
    # pre-process frames for Video Foundation Models
    assert len(self.args.align_models) == 1, "Support only align one model currently"
    if self.args.align_models[0] in ['VideoMAEv2', 'VJEPA', 'VJEPA2', 'VideoMAE', 'OminiMAE']:
        repa_raw_frames = raw_frames[:, :, 1:]  # remove the first frames
        B, C, F, H, W = repa_raw_frames.shape 
        
        repa_raw_frames = repa_raw_frames.transpose(1, 2).flatten(0, 1)
        # 480x720 -> 160x240
        repa_raw_frames = torch.nn.functional.interpolate(repa_raw_frames, (H // 3, W // 3), mode='bicubic')    # hard coded
        repa_raw_frames = repa_raw_frames.reshape(B, F, C, H // 3, W // 3).transpose(1, 2)  # B, C, F, H, W
    elif self.args.align_models[0] == 'DINOv2':
        repa_raw_frames = raw_frames
        B, C, F, H, W = repa_raw_frames.shape 
        repa_raw_frames = repa_raw_frames.transpose(1, 2).flatten(0, 1) # B * F, C, H, W
        input_resolution = (420, 630)   # to fit the patch size 14 in DINOv2
        repa_raw_frames = torch.nn.functional.interpolate(repa_raw_frames, input_resolution, mode='bicubic')
        repa_raw_frames = repa_raw_frames.reshape(B, F, C, input_resolution[0], input_resolution[1]).transpose(1, 2)  # B, C, F, H, W
    
    
    # encode the frames with vision encoders
    with torch.no_grad():
        if self.args.align_models[0] in ['VideoMAEv2', 'VJEPA', 'VJEPA2', 'VideoMAE', 'OminiMAE']:
            B, C, F, H, W = repa_raw_frames.shape
            # encoding the frames with vision encoder: B, 3, 48, 160, 240 -> B, 24x10x15, C
            align_target = self.vision_encoder(repa_raw_frames)
            # B, 24x10x15, D -> B, D, 24, 10, 15
            align_target = align_target.transpose(1, 2).reshape(B, -1, F // self.vision_encoder.tubelet_size, H // self.vision_encoder.patch_size, W // self.vision_encoder.patch_size)
        elif self.args.align_models[0] == 'DINOv2':
            B, C, F, H, W = repa_raw_frames.shape
            repa_raw_frames = repa_raw_frames.transpose(1, 2).flatten(0, 1)
            group_size = 128  # 32 / 64 / 128 to avoid OOM
            chunked = repa_raw_frames.chunk((B * F) // group_size, dim=0)
            
            features = []
            for frames in chunked:
                group, C, H, W = frames.shape
                output = self.vision_encoder.forward_features(frames)['x_norm_patchtokens'].reshape(group, input_resolution[0] // self.vision_encoder.patch_size, input_resolution[1] // self.vision_encoder.patch_size, self.vision_encoder.embed_dim)
                features.append(output)
            features = torch.cat(features, dim=0)
            features = features.reshape(B, F, input_resolution[0] // self.vision_encoder.patch_size, input_resolution[1] // self.vision_encoder.patch_size, self.vision_encoder.embed_dim)
    
    align_targets = []
    if self.args.align_models[0] in ['VideoMAEv2', 'VJEPA', 'VJEPA2', 'VideoMAE', 'OminiMAE']:
        align_target = align_target.flatten(2).transpose(1, 2)  # B, 24x10x15, C
        align_targets.append(align_target)        
    elif self.args.align_models[0] == 'DINOv2':
        first_frame_feature = features[:, :1].permute(0, 4, 1, 2, 3)   # B, 1, H, W, C -> B, C, 1, H, W
        features = features[:, 1:]
        B, F, H, W, C = features.shape
        align_target = features.permute(0, 2, 3, 4, 1).flatten(0, 2)
        # To align with the features from CogVideoX, the encoded features are avg pooled to 1/4
        align_target = torch.nn.functional.avg_pool1d(align_target, kernel_size=4, stride=4)
        align_target = align_target.reshape(B, H, W, C, F // 4).permute(0, 3, 4, 1, 2)
        align_target = torch.cat([first_frame_feature, align_target], dim=2)
        align_target = align_target.flatten(2).transpose(1, 2)  # B, 13x30x45, C
        align_targets.append(align_target)  
        

    patch_size_t = self.state.transformer_config.patch_size_t
    if patch_size_t is not None:
        raise NotImplementedError("This is for CogVideoX1.5 but the 1.5 is not used in VideoREPA")
        ncopy = latent.shape[2] % patch_size_t
        # Copy the first frame ncopy times to match patch_size_t
        first_frame = latent[:, :, :1, :, :]
        latent = torch.cat([first_frame.repeat(1, 1, ncopy, 1, 1), latent], dim=2)
        assert latent.shape[2] % patch_size_t == 0

    batch_size, num_channels, num_frames, height, width = latent.shape

    # Get prompt embeddings
    _, seq_len, _ = prompt_embedding.shape
    prompt_embedding = prompt_embedding.view(batch_size, seq_len, -1).to(dtype=latent.dtype)

    # Sample a random timestep for each sample
    timesteps = torch.randint(
        0, self.components.scheduler.config.num_train_timesteps, (batch_size,), device=self.accelerator.device
    )
    timesteps = timesteps.long()

    # Add noise to latent
    latent = latent.permute(0, 2, 1, 3, 4)  # from [B, C, F, H, W] to [B, F, C, H, W]
    noise = torch.randn_like(latent)
    latent_added_noise = self.components.scheduler.add_noise(latent, noise, timesteps)

    # Prepare rotary embeds
    vae_scale_factor_spatial = 2 ** (len(self.components.vae.config.block_out_channels) - 1)
    transformer_config = self.state.transformer_config
    rotary_emb = (
        self.prepare_rotary_positional_embeddings(
            height=height * vae_scale_factor_spatial,
            width=width * vae_scale_factor_spatial,
            num_frames=num_frames,
            transformer_config=transformer_config,
            vae_scale_factor_spatial=vae_scale_factor_spatial,
            device=self.accelerator.device,
        )
        if transformer_config.use_rotary_positional_embeddings
        else None
    )
    
    # Predict noise
    # aligns is a list of intermediate transformer features captured at the specified align_layer, after passing through small projector MLPs. One entry per projector/align dimension.

    predicted_noises, aligns = self.components.transformer(
        hidden_states=latent_added_noise,
        encoder_hidden_states=prompt_embedding,
        timestep=timesteps,
        image_rotary_emb=rotary_emb,
        return_dict=False,
    )
    predicted_noise = predicted_noises[0]
    
    # Aligning features from CogVideoX to pre-trained frozen vision encoders
    align = aligns[0]
    align = align.reshape(B, 13, 60 // 2, 90 // 2, -1)  # TODO: remove hard coded
    if self.args.align_models[0] in ['VideoMAEv2', 'VJEPA', 'VJEPA2', 'VideoMAE', 'OminiMAE']:
        # remove the first frame (Because the VFM targets (VideoMAEv2/VideoMAE/VJEPA/OminiMAE) are computed after dropping the first frame, the transformer aligns drop their first temporal slice to keep time indices matched.)
        align = align[:, 1:]    
    aligns = [align]
    if self.args.align_models[0] == 'DINOv2':
        # DINOv2 is per-frame; they explicitly keep the first frame and then pool the rest, so no removal on that path.
        # Only able to perform REPA loss when using DINOv2
        assert self.args.loss == 'cosine_similarity'
    
    if self.args.loss == 'cosine_similarity':
        # REPA loss
        proj_loss = 0
        align = aligns[0].permute(0, 4, 1, 2, 3)    # B, C, F, H, W
        if self.args.align_models[0] != "DINOv2": 
            align = torch.nn.functional.interpolate(align, scale_factor=(2.0, 1.0, 1.0), mode='trilinear') # interpolate with scale_factor=(2,1,1): doubles frames F to match the VFM token frame rate used for targets (e.g., VideoMAEv2 path uses 24 temporal tokens). Skipped for DINOv2 because its path handles frames differently.
        
        if self.args.align_models[0] in ['VideoMAEv2', 'VJEPA', 'VJEPA2', 'VideoMAE', 'OminiMAE']:
          # reshape to [B×F, C, H, W] and pass a stride-3 conv downsampler to reduce the generator’s spatial grid from 30×45 to 10×15, then reshape back. This matches the precomputed VFM token grid size.

            B, C, F, H, W = align.shape
            align = align.permute(0, 2, 1, 3, 4).reshape(B * F, C, H, W)
            align = self.components.transformer.downsampler_cogvideo_output(align.to(torch.bfloat16))   # 30x45 -> 10x15
            align = align.reshape(B, F, C, H // 3, W // 3).permute(0, 2, 1, 3, 4)   # B, C, F, H, W
        
        # Flatten to token list, flatten(2) → HW tokens, transpose to [B, HW, C], then flatten batch and frames to [BFHW, C] to pair each generator token with a corresponding target token.
        align = align.flatten(2).transpose(1, 2).flatten(0, 1)  # BFHW, C
        align_target = align_targets[0].flatten(0, 1)
        
        # L2-normalize both sides along channel so dot product equals cosine.
        align = torch.nn.functional.normalize(align, dim=-1) 
        align_target = torch.nn.functional.normalize(align_target, dim=-1) 
        assert align_target.shape[-1] == align.shape[-1] == self.args.align_dims[0]  # NOTE here
        
        # compute negative cosine: -(target · align) per token, average over tokens → scalar. Negative sign means minimizing the loss maximizes cosine similarity.
        proj_loss += (-(align_target * align)).sum(dim=-1).mean(dim=0) 
    
    elif self.args.loss == 'token_relation_distillation':
        # TRD loss in VideoREPA
        assert len(aligns) == 1
        # enable standard interpolation/convolution ops.
        align = aligns[0].permute(0, 4, 1, 2, 3)   # B, F, H, W, C -> B, C, F, H, W  (e.g. B, 768, 12, 30, 45)
        # upsample the temporal dimension (from 12 to 24) to match the dimension in VideoMAEv2
        align = torch.nn.functional.interpolate(align, scale_factor=(2.0, 1.0, 1.0), mode='trilinear')

        # downsample the representation of VDM to match the VFM’s spatial token grid size (10×15), then treat each spatial token per frame as a token sequence, and ensure one-to-one token correspondence with VFM tokens.
        B, C, F, H, W = align.shape
        align = align.permute(0, 2, 1, 3, 4).reshape(B * F, C, H, W)
        align = self.components.transformer.downsampler_cogvideo_output(align.to(torch.bfloat16))
        align = align.reshape(B, F, C, H // 3, W // 3)
        
        align = align.permute(0, 1, 3, 4, 2)   # B, F, H, W, C          
        token_relation_distillation_loss = 0
        align = align.flatten(2, 3) # B, F, H*W, C
        align_target = align_targets[0].reshape(B, F, 10 * 15, -1)  # B, 12, 10 * 15, D
        
        # L2 normalize before calculate Gram matrix
        align = torch.nn.functional.normalize(align, dim=-1)
        align_target = torch.nn.functional.normalize(align_target, dim=-1)
        assert align.shape[-1] == align_target.shape[-1] == self.args.align_dims[0]
        
        # Compute token–token similarity (Gram) matrices for model and target. TRD matches relational structure (pairwise cosine) across all tokens (spatial and temporal).
        # BF, HW, C @ BF, C, FHW -> BF, HW, FHW
        align_sim = torch.bmm(align.flatten(0, 1), align.flatten(1, 2).unsqueeze(1).expand(-1, F, -1, -1).flatten(0, 1).transpose(1, 2))
        align_target_sim = torch.bmm(align_target.flatten(0, 1), align_target.flatten(1, 2).unsqueeze(1).expand(-1, F, -1, -1).flatten(0, 1).transpose(1, 2))
        assert align_sim.shape == align_target_sim.shape
        # or refer to more concise implementation: B, FHW, C @ B, C, FHW -> B, FHW, FHW
        # align_sim = torch.bmm(align.flatten(1, 2), align.flatten(1, 2).transpose(1, 2))
        # align_target_sim = torch.bmm(align_target.flatten(1, 2), align_target.flatten(1, 2).transpose(1, 2))
        # penalize when relational discrepancies exceed margin; tolerate small deviations.
        token_relation_distillation_loss = nn.functional.relu((align_sim - align_target_sim).abs() - self.args.margin).mean()
    else:
        raise NotImplementedError
    
    # Denoise
    latent_pred = self.components.scheduler.get_velocity(predicted_noise, latent_added_noise, timesteps)

    alphas_cumprod = self.components.scheduler.alphas_cumprod[timesteps]
    weights = 1 / (1 - alphas_cumprod)
    while len(weights.shape) < len(latent_pred.shape):
        weights = weights.unsqueeze(-1)

    loss = torch.mean((weights * (latent_pred - latent) ** 2).reshape(batch_size, -1), dim=1)
    loss = loss.mean()

    if self.args.loss == 'token_relation_distillation' or self.args.loss == 'token_relation_distillation_only_spatial' or self.args.loss == 'token_relation_distillation_only_temporal':
        return [loss, None, token_relation_distillation_loss]
    return [loss, proj_loss]

PhysWorld (Nov 24’)

How Far is Video Generation from World Model: A Physical Law Perspective

https://youtu.be/yWSn9Xyrkso?si=YJPPePMqpPm6VdD9

When referencing training data for generalization, the models prioritize different attributes in a specific order: color > size > velocity > shape. This explains why video generation models often struggle with maintaining object consistency.

Simply scaling video generation models and data is insufficient for them to discover fundamental physical laws and generalize robustly to out-of-distribution scenarios.
Instead, improving combinatorial diversity in training data is crucial for better physical video modeling. The models’ generalization mechanism relies more on memorization and case-based imitation than on learning universal rules.