Literature Review (Position Papers)

PhysWorld

How Far is Video Generation from World Model: A Physical Law Perspective

https://youtu.be/yWSn9Xyrkso?si=YJPPePMqpPm6VdD9

When referencing training data for generalization, the models prioritize different attributes in a specific order: color > size > velocity > shape. This explains why video generation models often struggle with maintaining object consistency.

Simply scaling video generation models and data is insufficient for them to discover fundamental physical laws and generalize robustly to out-of-distribution scenarios.
Instead, improving combinatorial diversity in training data is crucial for better physical video modeling. The models’ generalization mechanism relies more on memorization and case-based imitation than on learning universal rules.

Literature Review (Methods)

VideoREPA

VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

A novel framework that distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations.

Physics Understanding Gap Identification: We identify and quantify the significant physics understanding gap between self-supervised video foundation models and text-to-video diffusion models.
Token Relation Distillation (TRD) Loss: We introduce a novel TRD loss that effectively distills physics knowledge through spatio-temporal token relation alignment, overcoming key limitations of direct REPA application.
VideoREPA Framework: We present the first representation alignment method specifically designed for finetuning video diffusion models and injecting physical knowledge into T2V generation.

About REPA

From the paper “Training Diffusion Transformers is Easier than you think” (ICLR 2025)

Similar to perceptual loss but applied in aligns the internal hidden states of the diffusion model (as it processes noisy inputs at each timestep) directly with the features of the clean, original image from the external encoder (DINOv2).

injects supervision directly into the diffusion model’s internal representations during the denoising process.

Shown improved FID with same amount of training iterations (more effective)

In implementation, since the hidden states from the diffusion transformer and the features from DINOv2 don’t naturally live in the same space, extra MLP layer (a three-layer MLP with SiLU activations) is needed for projection.

REPA operates at the token-to-token level, where each token corresponds to a patch in the latent image space.

As the paper shows, this simple trick allows the model to use alignment primarily in the early layers, leaving later layers free to focus on refining fine details.

the MLP (projection head) is applied at each layer where REPA is active, not just one. The paper shows that aligning just the first 8 layers of the diffusion transformer (e.g., in SiT-XL/2) is sufficient and optimal. That means one separate MLP is attached to the output of each of those first 8 transformer blocks, and each of those MLPs is trained independently to project that block’s hidden state into alignment with DINOv2’s corresponding patch representations.

Now come back to VideoREPA.

goal of REPA is to accelerate training from scratch by helping diffusion models converge faster through direct alignment of feature representations—essentially acting as a training efficiency booster.
goal of VideoREPA is to inject specific, high-fidelity physics understanding into pre-trained text-to-video models through fine-tuning.

VideoREPA introduced:

a novel variant called Token Relation Distillation (TRD) loss, which moves beyond standard REPA by aligning relational structures (i.e., pairwise similarities between tokens) across both spatial and temporal dimensions — rather than directly aligning feature vectors — to better suit the challenges of fine-tuning pre-trained video models and capturing physical dynamics.
- Token Relation Distillation computes pairwise token similarities within frames (spatial) and across frames (temporal) and matches those relations with an L1 “soft” objective, which is more stable for finetuning.

Similar to REPA, VideoREPA injects supervision directly into the diffusion model’s internal representations during the denoising process.
- REPA “hard-aligns” token features with a cosine-similarity objective to speed up from-scratch image DiT training.
- VideoREPA “soft-aligns” relations between tokens (pairwise similarity matrices) with an L1 + margin objective, across space and time, for finetuning.

Why it assumes VideoMAEv2 already know the physics, so it thinks distillating it will learn the physics?

Why a video SSL teacher has physics signal: predicting missing frames/patches in real videos is easier if the model encodes who-is-where, how things move across time, when/where contacts happen, and how fluids deform. Those cues are embedded in token relations (similarities within and across frames).
Why distillation can transfer it: aligning pairwise token–token similarities forces the student’s mid-layer to organize space–time tokens the way the teacher does. That makes the denoiser’s score field prefer trajectories with smoother motion, consistent contacts, and fewer shape glitches, which shows up as more physics-plausible generations.
Evidence and limits: in the results you shared, the teacher’s features outperform the T2V model’s features on a physics-understanding probe, and after alignment the gap narrows and generations improve. This doesn’t give exact simulators or guarantees; gains depend on the teacher’s pretraining and cover “commonsense physics” seen in natural videos, not precise quantitative laws.

Diving into the code

https://github.com/aHapBean/VideoREPA

The paper used full model supervised fine-tuning (SFT) for the smaller CogVideoX-2B model, and LoRA (Low-Rank Adaptation) training for the larger CogVideoX-5B model:

CogVideoX-2B: “full-parameter finetuning” (all weights updated, classic SFT).
- SFT: only the transformer is trainable; others frozen.
CogVideoX-5B: LoRA-based efficient finetuning, with explicit mention of LoRA rank and alpha (rank 128, alpha 64).
- LoRA: adds adapters to the transformer; additionally trains VideoREPA projector heads and the downsampler.

def compute_loss(self, batch) -> torch.Tensor:
    prompt_embedding = batch["prompt_embedding"]
    latent = batch["encoded_videos"]
    raw_frames = batch["raw_frames"]    # [B, C, F, H, W] whose value range from -1 to 1, e.g. torch.Size([Batch_size, 3, 49, 480, 720])
    
    # pre-process for vision encoder
    B, C, F, H, W = raw_frames.shape 
    raw_frames = raw_frames.transpose(1, 2).flatten(0, 1)   # B * F, C, H, W
    if self.args.align_models[0] in ['VideoMAEv2', 'VideoMAE', 'OminiMAE', "VJEPA", "VJEPA2"]:
        raw_frames = (raw_frames + 1.0) / 2.0
        raw_frames = Normalize([0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])(raw_frames)    # should be NCHW
    elif self.args.align_models[0] == 'DINOv2':
        raw_frames = (raw_frames + 1.0) / 2.0
        raw_frames = Normalize(IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD)(raw_frames)
    else:
        raise NotImplementedError
    raw_frames = raw_frames.reshape(B, F, C, H, W).transpose(1, 2)
    
    
    # pre-process frames for Video Foundation Models
    assert len(self.args.align_models) == 1, "Support only align one model currently"
    if self.args.align_models[0] in ['VideoMAEv2', 'VJEPA', 'VJEPA2', 'VideoMAE', 'OminiMAE']:
        repa_raw_frames = raw_frames[:, :, 1:]  # remove the first frames
        B, C, F, H, W = repa_raw_frames.shape 
        
        repa_raw_frames = repa_raw_frames.transpose(1, 2).flatten(0, 1)
        # 480x720 -> 160x240
        repa_raw_frames = torch.nn.functional.interpolate(repa_raw_frames, (H // 3, W // 3), mode='bicubic')    # hard coded
        repa_raw_frames = repa_raw_frames.reshape(B, F, C, H // 3, W // 3).transpose(1, 2)  # B, C, F, H, W
    elif self.args.align_models[0] == 'DINOv2':
        repa_raw_frames = raw_frames
        B, C, F, H, W = repa_raw_frames.shape 
        repa_raw_frames = repa_raw_frames.transpose(1, 2).flatten(0, 1) # B * F, C, H, W
        input_resolution = (420, 630)   # to fit the patch size 14 in DINOv2
        repa_raw_frames = torch.nn.functional.interpolate(repa_raw_frames, input_resolution, mode='bicubic')
        repa_raw_frames = repa_raw_frames.reshape(B, F, C, input_resolution[0], input_resolution[1]).transpose(1, 2)  # B, C, F, H, W
    
    
    # encode the frames with vision encoders
    with torch.no_grad():
        if self.args.align_models[0] in ['VideoMAEv2', 'VJEPA', 'VJEPA2', 'VideoMAE', 'OminiMAE']:
            B, C, F, H, W = repa_raw_frames.shape
            # encoding the frames with vision encoder: B, 3, 48, 160, 240 -> B, 24x10x15, C
            align_target = self.vision_encoder(repa_raw_frames)
            # B, 24x10x15, D -> B, D, 24, 10, 15
            align_target = align_target.transpose(1, 2).reshape(B, -1, F // self.vision_encoder.tubelet_size, H // self.vision_encoder.patch_size, W // self.vision_encoder.patch_size)
        elif self.args.align_models[0] == 'DINOv2':
            B, C, F, H, W = repa_raw_frames.shape
            repa_raw_frames = repa_raw_frames.transpose(1, 2).flatten(0, 1)
            group_size = 128  # 32 / 64 / 128 to avoid OOM
            chunked = repa_raw_frames.chunk((B * F) // group_size, dim=0)
            
            features = []
            for frames in chunked:
                group, C, H, W = frames.shape
                output = self.vision_encoder.forward_features(frames)['x_norm_patchtokens'].reshape(group, input_resolution[0] // self.vision_encoder.patch_size, input_resolution[1] // self.vision_encoder.patch_size, self.vision_encoder.embed_dim)
                features.append(output)
            features = torch.cat(features, dim=0)
            features = features.reshape(B, F, input_resolution[0] // self.vision_encoder.patch_size, input_resolution[1] // self.vision_encoder.patch_size, self.vision_encoder.embed_dim)
    
    align_targets = []
    if self.args.align_models[0] in ['VideoMAEv2', 'VJEPA', 'VJEPA2', 'VideoMAE', 'OminiMAE']:
        align_target = align_target.flatten(2).transpose(1, 2)  # B, 24x10x15, C
        align_targets.append(align_target)        
    elif self.args.align_models[0] == 'DINOv2':
        first_frame_feature = features[:, :1].permute(0, 4, 1, 2, 3)   # B, 1, H, W, C -> B, C, 1, H, W
        features = features[:, 1:]
        B, F, H, W, C = features.shape
        align_target = features.permute(0, 2, 3, 4, 1).flatten(0, 2)
        # To align with the features from CogVideoX, the encoded features are avg pooled to 1/4
        align_target = torch.nn.functional.avg_pool1d(align_target, kernel_size=4, stride=4)
        align_target = align_target.reshape(B, H, W, C, F // 4).permute(0, 3, 4, 1, 2)
        align_target = torch.cat([first_frame_feature, align_target], dim=2)
        align_target = align_target.flatten(2).transpose(1, 2)  # B, 13x30x45, C
        align_targets.append(align_target)  
        

    patch_size_t = self.state.transformer_config.patch_size_t
    if patch_size_t is not None:
        raise NotImplementedError("This is for CogVideoX1.5 but the 1.5 is not used in VideoREPA")
        ncopy = latent.shape[2] % patch_size_t
        # Copy the first frame ncopy times to match patch_size_t
        first_frame = latent[:, :, :1, :, :]
        latent = torch.cat([first_frame.repeat(1, 1, ncopy, 1, 1), latent], dim=2)
        assert latent.shape[2] % patch_size_t == 0

    batch_size, num_channels, num_frames, height, width = latent.shape

    # Get prompt embeddings
    _, seq_len, _ = prompt_embedding.shape
    prompt_embedding = prompt_embedding.view(batch_size, seq_len, -1).to(dtype=latent.dtype)

    # Sample a random timestep for each sample
    timesteps = torch.randint(
        0, self.components.scheduler.config.num_train_timesteps, (batch_size,), device=self.accelerator.device
    )
    timesteps = timesteps.long()

    # Add noise to latent
    latent = latent.permute(0, 2, 1, 3, 4)  # from [B, C, F, H, W] to [B, F, C, H, W]
    noise = torch.randn_like(latent)
    latent_added_noise = self.components.scheduler.add_noise(latent, noise, timesteps)

    # Prepare rotary embeds
    vae_scale_factor_spatial = 2 ** (len(self.components.vae.config.block_out_channels) - 1)
    transformer_config = self.state.transformer_config
    rotary_emb = (
        self.prepare_rotary_positional_embeddings(
            height=height * vae_scale_factor_spatial,
            width=width * vae_scale_factor_spatial,
            num_frames=num_frames,
            transformer_config=transformer_config,
            vae_scale_factor_spatial=vae_scale_factor_spatial,
            device=self.accelerator.device,
        )
        if transformer_config.use_rotary_positional_embeddings
        else None
    )
    
    # Predict noise
# aligns is a list of intermediate transformer features captured at the specified align_layer, after passing through small projector MLPs. One entry per projector/align dimension.

    predicted_noises, aligns = self.components.transformer(
        hidden_states=latent_added_noise,
        encoder_hidden_states=prompt_embedding,
        timestep=timesteps,
        image_rotary_emb=rotary_emb,
        return_dict=False,
    )
    predicted_noise = predicted_noises[0]
    
    # Aligning features from CogVideoX to pre-trained frozen vision encoders
    align = aligns[0]
    align = align.reshape(B, 13, 60 // 2, 90 // 2, -1)  # TODO: remove hard coded
    if self.args.align_models[0] in ['VideoMAEv2', 'VJEPA', 'VJEPA2', 'VideoMAE', 'OminiMAE']:
        # remove the first frame (Because the VFM targets (VideoMAEv2/VideoMAE/VJEPA/OminiMAE) are computed after dropping the first frame, the transformer aligns drop their first temporal slice to keep time indices matched.)
        align = align[:, 1:]    
    aligns = [align]
    if self.args.align_models[0] == 'DINOv2':
      	# DINOv2 is per-frame; they explicitly keep the first frame and then pool the rest, so no removal on that path.
        # Only able to perform REPA loss when using DINOv2
        assert self.args.loss == 'cosine_similarity'
    
    if self.args.loss == 'cosine_similarity':
        # REPA loss
        proj_loss = 0
        align = aligns[0].permute(0, 4, 1, 2, 3)    # B, C, F, H, W
        if self.args.align_models[0] != "DINOv2": 
            align = torch.nn.functional.interpolate(align, scale_factor=(2.0, 1.0, 1.0), mode='trilinear') # interpolate with scale_factor=(2,1,1): doubles frames F to match the VFM token frame rate used for targets (e.g., VideoMAEv2 path uses 24 temporal tokens). Skipped for DINOv2 because its path handles frames differently.
        
        if self.args.align_models[0] in ['VideoMAEv2', 'VJEPA', 'VJEPA2', 'VideoMAE', 'OminiMAE']:
          # reshape to [B×F, C, H, W] and pass a stride-3 conv downsampler to reduce the generator’s spatial grid from 30×45 to 10×15, then reshape back. This matches the precomputed VFM token grid size.

            B, C, F, H, W = align.shape
            align = align.permute(0, 2, 1, 3, 4).reshape(B * F, C, H, W)
            align = self.components.transformer.downsampler_cogvideo_output(align.to(torch.bfloat16))   # 30x45 -> 10x15
            align = align.reshape(B, F, C, H // 3, W // 3).permute(0, 2, 1, 3, 4)   # B, C, F, H, W
        
        # Flatten to token list, flatten(2) → HW tokens, transpose to [B, HW, C], then flatten batch and frames to [BFHW, C] to pair each generator token with a corresponding target token.
        align = align.flatten(2).transpose(1, 2).flatten(0, 1)  # BFHW, C
        align_target = align_targets[0].flatten(0, 1)
        
        # L2-normalize both sides along channel so dot product equals cosine.
        align = torch.nn.functional.normalize(align, dim=-1) 
        align_target = torch.nn.functional.normalize(align_target, dim=-1) 
        assert align_target.shape[-1] == align.shape[-1] == self.args.align_dims[0]  # NOTE here
		
        # compute negative cosine: -(target · align) per token, average over tokens → scalar. Negative sign means minimizing the loss maximizes cosine similarity.
        proj_loss += (-(align_target * align)).sum(dim=-1).mean(dim=0) 
    
    elif self.args.loss == 'token_relation_distillation':
        # TRD loss in VideoREPA
        assert len(aligns) == 1
        # enable standard interpolation/convolution ops.
        align = aligns[0].permute(0, 4, 1, 2, 3)   # B, F, H, W, C -> B, C, F, H, W  (e.g. B, 768, 12, 30, 45)
        # upsample the temporal dimension (from 12 to 24) to match the dimension in VideoMAEv2
        align = torch.nn.functional.interpolate(align, scale_factor=(2.0, 1.0, 1.0), mode='trilinear')

        # downsample the representation of VDM to match the VFM’s spatial token grid size (10×15), then treat each spatial token per frame as a token sequence, and ensure one-to-one token correspondence with VFM tokens.
        B, C, F, H, W = align.shape
        align = align.permute(0, 2, 1, 3, 4).reshape(B * F, C, H, W)
        align = self.components.transformer.downsampler_cogvideo_output(align.to(torch.bfloat16))
        align = align.reshape(B, F, C, H // 3, W // 3)
        
        align = align.permute(0, 1, 3, 4, 2)   # B, F, H, W, C          
        token_relation_distillation_loss = 0
        align = align.flatten(2, 3) # B, F, H*W, C
        align_target = align_targets[0].reshape(B, F, 10 * 15, -1)  # B, 12, 10 * 15, D
        
        # L2 normalize before calculate Gram matrix
        align = torch.nn.functional.normalize(align, dim=-1)
        align_target = torch.nn.functional.normalize(align_target, dim=-1)
        assert align.shape[-1] == align_target.shape[-1] == self.args.align_dims[0]
		
        # Compute token–token similarity (Gram) matrices for model and target. TRD matches relational structure (pairwise cosine) across all tokens (spatial and temporal).
        # BF, HW, C @ BF, C, FHW -> BF, HW, FHW
        align_sim = torch.bmm(align.flatten(0, 1), align.flatten(1, 2).unsqueeze(1).expand(-1, F, -1, -1).flatten(0, 1).transpose(1, 2))
        align_target_sim = torch.bmm(align_target.flatten(0, 1), align_target.flatten(1, 2).unsqueeze(1).expand(-1, F, -1, -1).flatten(0, 1).transpose(1, 2))
        assert align_sim.shape == align_target_sim.shape
        # or refer to more concise implementation: B, FHW, C @ B, C, FHW -> B, FHW, FHW
        # align_sim = torch.bmm(align.flatten(1, 2), align.flatten(1, 2).transpose(1, 2))
        # align_target_sim = torch.bmm(align_target.flatten(1, 2), align_target.flatten(1, 2).transpose(1, 2))
        # penalize when relational discrepancies exceed margin; tolerate small deviations.
        token_relation_distillation_loss = nn.functional.relu((align_sim - align_target_sim).abs() - self.args.margin).mean()
    else:
        raise NotImplementedError
    
    # Denoise
    latent_pred = self.components.scheduler.get_velocity(predicted_noise, latent_added_noise, timesteps)

    alphas_cumprod = self.components.scheduler.alphas_cumprod[timesteps]
    weights = 1 / (1 - alphas_cumprod)
    while len(weights.shape) < len(latent_pred.shape):
        weights = weights.unsqueeze(-1)

    loss = torch.mean((weights * (latent_pred - latent) ** 2).reshape(batch_size, -1), dim=1)
    loss = loss.mean()

    if self.args.loss == 'token_relation_distillation' or self.args.loss == 'token_relation_distillation_only_spatial' or self.args.loss == 'token_relation_distillation_only_temporal':
        return [loss, None, token_relation_distillation_loss]
    return [loss, proj_loss]

Literature Review (Benchmarks and Metrics)

Embodied AI Agents: Modeling the World (‘25July)

Type of Contributions (only what we care):
Key Takeaway:

Physics Properties to look for:

WM-ABench (‘25June)

Type of Contributions (only what we care):
Key Takeaway:

Physics Properties to look for:

IntPhy2 (‘25June)

Type of Contributions (only what we care):

● (Benchmark)

○ 60 Debug set (each contains 3 additional videos), Static camera, for model calibration and Evaluating Sensitivity to Noise (check if model can be impacted by subtle pixel-level variations and is robust to imperceptible noise)

○ 1012 Main set, Static and moving camera, for evaluation set (zero-shot eval)

○ 344 Videos, Moving camera, for test set (not-released, with GT)

● Probably the only benchmark with moving camera, which older ones (GRASP, InfLevel and IntPhy1) are static? But VideoPhy2, WISA, WorldModelBench might also have moving videos

Key Takeaway:

● UE4 can live up to full potential to populate photorealistic scenes for benchmarking

● Dynamic camera movements for harder benchmark

● Improved realism, variety compare to IntPhy1

● Increased Short-Term memory demand: important for predict accuracy. That means object will go out-of-frame due to camera movement, and reappear later to test the short term memory

● It seems SOTA performance on that benchmark is nearly random (50%), and surprisingly sometimes model perform better on moving camera benchmarks

Physics Properties to look for:

● Permanence

● Immutability

● Spatio-Temporal Continuity

● Solidity

DiffPhy (‘25May)

Type of Contributions (only what we care):

● (Method) Using MLLM to analyze the prompt and tuning the diffusion model in ControlNet/LoRA style with

○ MLLM-based Physical phenomena loss: physical principles (are they matching YES/NO)

○ MLLM-based Commonsense loss: physical plausibility (how well are they matching 1 to 5 scale)

○ MLLM-based Semantic consistency loss: ensure the generated content still follows the input prompt

● (Dataset) 8000 vids from VIDGEN-1M and extracted physical phenomena labels

Key Takeaway:

● User prompts are often simple and incomplete

● Chain-of-Thought prompting can reason whether a prompt contains proper physical context and generate a list of physical phenomena

● A “Cookbook” on a list of physical attributes can be used as prompt to obtain a list of physical phenomena associated with the target event.

● It assumes that MLLM has the ability to effectively detect physical principles/phenomena?

○ Limitations show MLLMs struggle to interpret videos, particularly in determining the physical commonsense of complex scenarios.

Physics Properties to look for:

● Underlying forces (e.g. gravity, friction)

● Kinematic relationships (e.g. constant velocity, acceleration)

● Interaction rules (e.g. elastic vs inelastic collisions)

TRAJAN/Direct Motion Models (‘25May)

Type of Contributions (only what we care):

● (Representation) Better measures plausible object interactions and motion, by using TRAJAN to give us latent representations

● (Metric) For distribution-level, Frechet distance

● (Metric) For pairwise-level, L2 distance

● (Metric) For single video, ordinal number (reconstruction score)

Key Takeaway:

● TRAJAN embeddings capture similarities in motion even when appearance-based pixel-level metrics suggest that they are different.

● TRAJAN shows highest temporal sensitivity in Frechet and Reconstruction.

Physics Properties to look for:

● Consistency (Appearance and motion)

● Interactions

● Speed (slow/normal/fast)

VideoPhy2(‘25 March)

Type of Contributions (only what we care):

● (Dataset) 197 actions, 3940 prompts

● (Metric) VideoPhy2 for Semantic adherence score (1-5), physical commonsense (1-5), and physical rule classification(0-2)

Key Takeaway:

● VideoPhy2 finetuned from VideoPhy1 VLM on 50k human annotations

● Claimed at least 1.5X performance on VideoPhy1, on unseen prompts and unseen video models

Physics Properties to look for:

● Physical Commonsense

● Physical Rule classification

ImpossibleVideo(‘25 March)

Type of Contributions (only what we care):

● (Dataset) 2,600 synthetic videos generated from 260 text prompts in the IPV-TXT suite.

● (Dataset) IPV-VID contains 902 filtered videos (from generated+internet+OpenVid)

Key Takeaway:

● Heavily relied on human evaluation to find out what

● Designed Tasks for Video-LLMs:

○ Judgment Task: Models classify videos as synthetic or real. This is a binary classification problem.

○ Multi-Choice QA (MCQA) Task: Models choose the best description of the impossible phenomenon from multiple options, including carefully designed distractors.

○ Open-Ended QA (OpenQA) Task: Models independently describe what makes the video impossible or unusual. This is evaluated by an LLM (like GPT-4o) comparing model responses to human annotations and assigning a semantic alignment score.

Physics Properties to look for:

● Physical: Conservation Laws, Mechanics, Thermal, Optics, Fluid, Material Properties

● Biological: Morphology

WorldModelBench(‘25 Feb)

Type of Contributions (only what we care):

● (dataset) 350 image and text condition pairs

Key Takeaway:

●

Physics Properties to look for:

●

Physics-IQ (‘25 Jan)

Type of Contributions (only what we care):

● (Benchmark) 396 videos, 8s long, 66 different physical scenarios, 30fps, 3840x2160 from 3 perspective left center right

● (Metric) Spatial IOU

● (Metric) Spatiotemporall IOU

● (Metric) Weighted Spatial IOU

● (Metric) MSE

Key Takeaway:

● Metrics only work on static camera views

● 2AFC paradigm on Visual realism

Physics Properties to look for:

● Fluid dynamics

● Optics

● Solid Mechanics

● Magnetism

● Thermodynamics

VideoScore (‘24 June)

Type of Contributions (only what we care):

● Dataset (Human score on 37.6K videos)

● (Metric) VideoScore-Factual Correctness

Key Takeaway:

● Best-of-k sampling shows general improvement

● Not working very well according to multiple researchers who tried it

Physics Properties to look for:

● Factual Correctness (FC)

PhyGenBench(‘24 October)

Type of Contributions (only what we care):

● (Benchmark) 160 prompts across 27 distinct physical laws, in 4 domains

Key Takeaway:

● Evaluation framework is old because it treat video frame by frame

● Scaling up models can solve some issues but still fails to handle dynamic physical phenomenons.

● Prompt engineering only solves a few simple issues (e.g., flame color).

Physics Properties to look for:

● Mechanics (Gravity, Buoyancy, Elasticity, Friction)

● Optics (Reflection, Refraction, Interference & Diffraction, Tyndall Effect)

● Thermal (Sublimation, Melting, Boiling, Liquefaction)

● Material Properties (Hardness, Solubility, Dehydration property, Flame Reaction)

VideoPhy(‘24 June)

Type of Contributions (only what we care):

● (Dataset) 688 prompts for generating video and human-labeled annotations for physical commonsense

● (Metric) VLM “VideoCon-Physics” proven to beat Gemini-1.5-Pro on semantic adherence and physical commonsense evaluation

Key Takeaway:

● Funetune VideoCon-7B (which originally trained on real videos for robust semantic adherence evaluation) with 12000 human annotations (half for semantic adherence, another half for physical commonsense) to maximize the log likelihood of Yes/No.

● Gemini-1.5-Pro-Vision can achieve up to 58 RUC-AOC on PC, while VideoCon-Phy can achieve up to 73.

Physics Properties to look for:

● Physical Common (PC) Sense

IntPhys(‘20 Feb)

Type of Contributions (only what we care):

● (Benchmark) 3 Types of

Key Takeaway:

● Constructed 15K videos (7sec, 15fps) on possible events as training set, from UE 4

● that diagnoses how much a given system understands about physics by testing its ability to tell apart well-matched videos of possible versus impossible events

Physics Properties to look for:

● Intrinsic properties

○ Object permanence (O1) - objects continuously exist and do not pop in/out,

■ as this turns into the computational challenge of tracking objects through occlusion

○ Shape constancy (O2) - tendency of rigid objects to preserve their shape through time

■ Rigid objects can undergo a change in appearance due to other factors

● Spatio-temporal continuity

○ Continuous motions and trajectories

- CRAFT

- IntPhys

- Physion

- ESPRIT

- Physion++

- CoPhy

- CLEVERER

- PhyWorld

Eyeballing

Archive on previous attempts

Common Mistakes in Videos (L1/L2)? We need something very simple:

● If something is easy to detect, then its not what we should focus

● Subtle, also task complexity

Attempt #2 on classifying the categories

Derive a better version from Attempt #1

Quality, Objectness, Time-Flow, Material Properties

L1 is easier to spot, while L2 is subtle/harder to spot

1. Perception Abnormalities

This category encompasses anomalies that primarily affect the visual fidelity, rendering, and overall aesthetic appearance of the scene or its elements. These glitches often manifest as visual corruption in noise , lack of detail, or incorrect coloration, impacting how things look without necessarily breaking fundamental physical laws of objects, body rigidity, object disproportion or the flow of time.

● Visual Artifacting (VA): Random blocks, noises, pixelation, or static-like distortions momentarily appear on surfaces or around objects. (This directly impacts visual fidelity.)

● Lighting Flickering/Shifting (LS): Light sources or ambient lighting unnaturally flicker, change intensity, or cast shadows in illogical ways. (Affects the visual illumination and overall scene quality.)

● Repeating Scenery (RS): Identical sequences of buildings, trees, or patterns in the background visibly repeat themselves where they shouldn’t. (An aesthetic and structural anomaly of the environment’s visual design.)

● Missing or Simplified Detail (MD): Areas that should be complex (e.g., distant crowds, intricate architecture, foliage) appear unnaturally simplified, flat, blurry, or lacking expected detail. (Directly concerns the level of visual fidelity and completeness.)

● Unnatural Color Palette (UP): The overall scene or specific elements display a color scheme that is jarringly artificial, overly uniform, or illogical compared to natural reality. (Aesthetic issue related to color accuracy.)

2. Objectness Abnormalities

This category covers anomalies related to the presence, form, spatial integrity within the scene.

● Object Pop-in/Out (OP): Objects or environmental elements suddenly appear or disappear from the scene, especially at the edges of your vision. (Concerns the very presence/existence of objects.)

● Unnatural Disproportion (UD): Body parts, objects, or features are unnaturally stretched, shrunken, or warped in their dimensions. (Directly alters an object’s expected form.)

● Impossible Rigidity/Angles (IA): Objects or body parts hold physically impossible or extremely uncomfortable static positions or angles for too long. (Affects an object’s physical state and form.)

● Sudden Form Morphing (SFM): An object or person visibly changes its fundamental shape or identity into something entirely different. (A drastic change in an object’s core visual/physical identity.)

● Impossible Spatial Layout (SL): The physical arrangement of an environment defies logic (e.g., a hallway leads back to itself with no turns, a room has no discernible entrance/exit). (Breaks the logical “objectness” and spatial coherence of the environment as a large object.)

3. Time Abnormalities

This category is exclusively dedicated to aberrations directly affecting the perceived flow, speed, or sequence of time within the scene. These glitches distort the temporal reality of the observed events.

● Desynchronized Time (DT): Different elements within the same field of view move at demonstrably different, unnatural speeds, or unreasonably fast or slow, or even time cut, or sudden time freeze.

● Instantaneous “Correction” (IC): After a minor glitch or displacement, an object or person immediately snaps back to its “correct” or initial state without fluid movement. (Abrupt, unnatural state/position changes for objects.) As if someone time freeze and corrected the scene.

● Instantaneous Stop/Start (ISS): Moving objects or people visibly halt from full speed to a complete stop, or instantly accelerate to full speed, with no visible period of deceleration or acceleration. (Abnormalities in object motion dynamics.)

● Preceding Effect (PE): A visual consequence (e.g., a light turning on) is seen before its obvious physical cause (e.g., the switch being flipped). (Impacts how an object’s state or actions relate to cause and effect.)

4. Behavioural Abnormalities

This category focuses on contradictions related to the inherent physical characteristics of materials (solids, liquids, gases, etc.) and how they interact, deform, or react to forces in ways that defy their expected properties.

● Clipping/Interpenetration (I): Solid objects or characters visibly pass through each other without collision or displacement. (While also a quality issue, it fundamentally breaks the physical integrity and interaction of objects as distinct entities.)

● Unnatural Fluid Dynamics (UFD): Liquids visibly flow uphill, form impossible stable shapes, or fail to produce natural splashes or ripples. (Concerns the behavior of liquids.)

● Impossible Material Resilience (IMR): Solid objects visibly bend like rubber, shatter into perfectly geometric pieces, or instantly reform after being broken, defying their inherent material properties. (Concerns the structural integrity and deformation of solids.)

● Erratic Movement (EM): describes a scenario where moving objects or people visibly change direction at sharp, impossible angles without any arc or loss of speed, and show no realistic deformation, recoil, or transfer of force upon impact. This concerns the abnormal dynamics of object motion and the unexpected physical response of materials and objects during interactions.

Attempt #1 on classifying the categories

In summary, for visual cues focusing on physical properties, your list covers:

● How objects behave in space (gravity, trajectory, levitation).

● How objects move (momentum, acceleration, deceleration, direction change).

● How objects interact (collisions, friction, energy transfer).

● How objects are composed and deform (material properties, fluid dynamics, rigidity, elasticity).

● The fundamental order of events in the physical world (causality).

Time Flow Aberrations (T):

● Sudden Time Freeze (STF): All motion in the scene abruptly halts for an unnatural duration, then potentially resumes.

● Temporal Loop (TL): A specific, short sequence of events or movements repeats perfectly, like a video loop.

● Desynchronized Movement (DM): Different elements within the same field of view move at demonstrably different, unnatural speeds, or unreasonably fast or slow (e.g., a car speeds by while falling rain appears unnaturally slow).

● Reverse Motion (RM): Objects or people briefly move backward in time before continuing forward or stopping.

Rendering & Visual Glitches (G):

● Visual Artifacting (VA): Random blocks, pixelation, or static-like distortions momentarily appear on surfaces or around objects.

● Texture Popping (TP): The visual detail or quality of surfaces abruptly changes (e.g., a blurry texture suddenly sharpens, or vice versa) without a logical cause.

● Lighting Flickering/Shifting (LS): Light sources or ambient lighting unnaturally flicker, change intensity, or cast shadows in illogical ways.

● Object Pop-in/Out (OP): Objects or environmental elements suddenly appear or disappear from the scene, especially at the edges of your vision.

● Clipping/Interpenetration (I): Solid objects or characters visibly pass through each other without collision or displacement.

Object & Form Distortions (O):

● Unnatural Disproportion (UD): Body parts, objects, or features are unnaturally stretched, shrunken, or warped in their dimensions.

● Impossible Rigidity/Angles (IA): Objects or body parts hold physically impossible or extremely uncomfortable static positions or angles for too long.

● Rubber-banding/Snapping (RS): Objects or body parts visibly stretch and then abruptly snap back to their original form or position.

● Sudden Form Morphing (SFM): An object or person visibly changes its fundamental shape or identity into something entirely different.

Environmental Inconsistencies (E):

● Repeating Scenery (RS): Identical sequences of buildings, trees, or patterns in the background visibly repeat themselves where they shouldn’t.

● Missing or Simplified Detail (MD): Areas that should be complex (e.g., distant crowds, intricate architecture, foliage) appear unnaturally simplified, flat, blurry, or lacking expected detail.

● Unnatural Color Palette (UP): The overall scene or specific elements display a color scheme that is jarringly artificial, overly uniform, or illogical compared to natural reality.

● Impossible Spatial Layout (SL): The physical arrangement of an environment defies logic (e.g., a hallway leads back to itself with no turns, a room has no discernible entrance/exit).

Behavioral Abnormalities of inhabitants/objects (B):

● Robotic/Scripted Movement (SM): Non-player characters or background elements move in overly repetitive, synchronized, or unnaturally precise patterns.

● Defiance of Physics (DP): Objects or people perform actions that visibly ignore fundamental physical laws like gravity, momentum, or inertia (e.g., a falling object stops mid-air without support, a person jumps an impossible distance).

● Unresponsive Stasis (US) : Background characters or elements appear frozen or completely unresponsive to direct visual stimuli or events that should provoke a reaction.

● Instantaneous “Correction” (IC) : After a minor glitch or displacement, an object or person immediately snaps back to its “correct” or initial state without fluid movement.

Impossible Gravitational Behavior (GB):

● Unnatural Suspension (USp): Objects or people visually remain stationary in the air without any visible support or means of flight.

● Inverted Trajectory (IT): Objects or people are seen to move upwards when they should naturally fall downwards.

● Erratic Fall Speed (EFS): Falling objects visibly accelerate or decelerate unnaturally, or abruptly pause mid-air, defying consistent gravitational pull.

Momentum & Inertia Disruption (MID):

● Instantaneous Stop/Start (ISS): Moving objects or people visibly halt from full speed to a complete stop, or instantly accelerate to full speed, with no visible period of deceleration or acceleration.

● Abrupt Direction Change (DC): Moving objects or people visibly change direction at sharp, impossible angles without any arc or loss of speed.

● Unending Glide (UG): Objects visibly slide continuously across surfaces without any apparent friction slowing them down.

Material & Interaction Contradictions (MIC):

● Unnatural Fluid Dynamics (UFD): Liquids visibly flow uphill, form impossible stable shapes, or fail to produce natural splashes or ripples.

● Impossible Material Resilience (IMR): Solid objects visibly bend like rubber, shatter into perfectly geometric pieces, or instantly reform after being broken, defying their inherent material properties.

● Absent Impact Reaction (AIR): Colliding objects or people visibly show no realistic deformation, recoil, or transfer of force upon impact.

Causality Violation ©:

● Preceding Effect (PE): A visual consequence (e.g., a light turning on) is seen before its obvious physical cause (e.g., the switch being flipped).

Common Physical Knowledge Graph:

I. Fundamental Physical Law Violations (Direct Contradictions of Core Laws)

● Impossible Gravitational Behavior (GB):

○ Unnatural Suspension (USp):

■ Laws Related: Law of Universal Gravitation (objects with mass attract each other); Law of Support (objects require a continuous, opposing force to remain stationary against gravity).

■ Intrinsics: Mass, density (assumed to exist and be affected by gravity).

■ Extrinsics: Position (stable against gravitational pull), Velocity (zero, defying gravity’s constant acceleration).

■ Materials: All materials are subject to gravity.

○ Inverted Trajectory (IT):

■ Laws Related: Law of Universal Gravitation (defines direction of gravitational force); Newton’s Second Law of Motion (Force = mass × acceleration, meaning acceleration should be downwards due to gravity).

■ Intrinsics: Mass, density.

■ Extrinsics: Acceleration (visibly upwards, defying gravity’s direction); Velocity (upwards for falling objects).

■ Materials: All materials.

○ Erratic Fall Speed (EFS):

■ Laws Related: Law of Universal Gravitation (implies constant acceleration due to gravity in a vacuum, or predictable acceleration with air resistance); Newton’s Second Law of Motion.

■ Intrinsics: Mass, density (affecting air resistance).

■ Extrinsics: Acceleration (inconsistent or unnatural rate); Velocity (changing unpredictably).

■ Materials: Physical properties affecting air resistance (e.g., shape, surface area).

● Momentum & Inertia Disruption (MID):

○ Instantaneous Stop/Start (ISS):

■ Laws Related: Newton’s First Law of Motion (Inertia: object at rest stays at rest, object in motion stays in motion with constant velocity unless acted upon by a net force); Newton’s Second Law of Motion (Force causes acceleration over time).

■ Intrinsics: Mass (determines inertia).

■ Extrinsics: Velocity (instantaneous change); Acceleration (appears infinite or instantaneous).

■ Materials: All materials (their mass is key).

○ Abrupt Direction Change (ADC):

■ Laws Related: Newton’s First Law of Motion (requires a force for a change in direction); Newton’s Second Law of Motion (force causes acceleration, implying a curved path over time for directional change).

■ Intrinsics: Mass.

■ Extrinsics: Velocity (instantaneous change in direction vector); Acceleration (appears infinite or instantaneous perpendicular to velocity).

■ Materials: All materials.

○ Unending Glide (UG):

■ Laws Related: Law of Friction (friction opposes relative motion between surfaces); Laws of Energy Dissipation (kinetic energy is typically converted to other forms like heat/sound over time due to friction).

■ Intrinsics: None specific beyond existence.

■ Extrinsics: Velocity (maintains indefinitely without external force); Force (absence of resistive forces).

■ Materials: Surface properties (coefficient of friction, texture).

● Causality Violation:

○ Preceding Effect (PE):

■ Laws Related: Principle of Causality (cause must precede effect in time); Second Law of Thermodynamics (defines the arrow of time, irreversible processes).

■ Intrinsics: None specific, applies to events.

■ Extrinsics: Temporal order of events.

■ Materials: Not directly material-related, but event-related.

II. Object & Material Property Anomalies (Intrinsic & Reactive Physical Characteristics)

● Unnatural Disproportion (UD):

○ Laws Related: Laws of Geometry and Rigid Body Physics (objects maintain defined dimensions and volumes); Conservation of Mass/Volume (often violated by stretching/shrinking).

○ Intrinsics: Dimensions (length, width, height), volume, shape, density (if mass is conserved).

○ Extrinsics: Scale.

○ Materials: Assumed to have consistent physical dimensions.

● Impossible Rigidity/Angles (IR):

○ Laws Related: Laws of Statics and Kinematics (rigid bodies maintain shape, joints have specific range of motion); Material Rigidity/Flexibility.

○ Intrinsics: Rigidity, flexibility, specific joint limits (for articulated bodies).

○ Extrinsics: Orientation, internal forces/stresses.

○ Materials: Material stiffness (e.g., steel vs. cloth).

● Rubber-banding/Snapping (RBS):

○ Laws Related: Laws of Elasticity (deformation and return to original shape within elastic limits); Laws of Material Strength (limits of deformation before breaking).

○ Intrinsics: Elasticity, material strength, cohesion.

○ Extrinsics: Position (sudden reset).

○ Materials: Elastic properties (e.g., rubber, spring steel, but applied impossibly).

● Sudden Form Morphing (FM):

○ Laws Related: Conservation of Mass/Volume (often violated); Laws of Material Transformation (requires specific energy/processes for changes of state or form).

○ Intrinsics: Shape, dimensions, material composition (changing).

○ Extrinsics: None primarily.

○ Materials: All materials (violates their stable form/composition).

● Unnatural Fluid Dynamics (UFD):

○ Laws Related: Principles of Fluid Dynamics (e.g., viscosity, surface tension, pressure, laminar/turbulent flow); Hydrostatics (fluids at rest); Law of Gravity (influencing fluid flow).

○ Intrinsics: Viscosity, density, surface tension.

○ Extrinsics: Flow rate, pressure, volume, form/shape of fluid.

○ Materials: Liquids and gases (e.g., water, smoke, steam).

● Impossible Material Resilience (IMR):

○ Laws Related: Material Science (e.g., tensile strength, yield strength, ductility, brittleness, hardness); Laws of Force and Deformation.

○ Intrinsics: Strength, brittleness, elasticity, plasticity, hardness.

○ Extrinsics: Applied force.

○ Materials: All solids (e.g., glass, steel, wood, rock).

● Absent Impact Reaction (AIR):

○ Laws Related: Newton’s Third Law (Action-Reaction: for every action, there is an equal and opposite reaction); Conservation of Momentum; Conservation of Energy (elastic vs. inelastic collisions).

○ Intrinsics: Mass, elasticity, rigidity (of colliding bodies).

○ Extrinsics: Force, momentum, kinetic energy, deformation, rebound angle/speed, sound/heat generation (visually inferred).

○ Materials: All materials involved in physical contact.

III. Spatial & Temporal Coherence Breaks (Consistency of Physical Space and Time)

● Time Flow Aberrations (TF, TL, DM, RM):

○ Laws Related: All physical laws (presuppose a consistent, uniform, and generally unidirectional flow of time). Laws of Thermodynamics (define the “arrow of time”).

○ Intrinsics: None, apply to the fabric of spacetime.

○ Extrinsics: Temporal position/progression of events.

○ Materials: N/A (applies to the temporal framework).

● Impossible Spatial Layout (ISL):

○ Laws Related: Laws of Euclidean Geometry and Topology (space is consistent, connected, and objects occupy unique positions relative to each other).

○ Intrinsics: None, applies to the spatial framework.

○ Extrinsics: Relative position, connectivity, dimensionality (implied).

○ Materials: N/A (applies to environmental arrangement).

● Clipping/Interpenetration (CI):

○ Laws Related: Law of Impenetrability (two distinct physical objects cannot occupy the same space at the same time).

○ Intrinsics: Volume.

○ Extrinsics: Position (overlapping).

○ Materials: All solid materials.

● Repeating Scenery (RSC):

○ Laws Related: Laws of Probability and Entropy (natural environments tend towards randomness and increasing disorder, not perfect, low-entropy repetition). Laws of Conservation of Information (a perfectly repeating pattern implies reduced information complexity).

○ Intrinsics: None.

○ Extrinsics: Spatial distribution of objects, variety.

○ Materials: N/A (applies to environmental arrangement).

IV. Dynamic Physical Behavior Anomalies (Unnatural Physical Movement & Response of Entities)

● Robotic/Scripted Movement (RSM):

○ Laws Related: Principles of Biological Movement (complex, fluid, variable, adapted to environment).

○ Intrinsics: None specific, applies to animation/control of agents.

○ Extrinsics: Trajectory, rhythm, fluidity, reaction time, spontaneity of motion.

○ Materials: N/A (applies to animated agents).

● Unresponsive Stasis (US):

○ Laws Related: Laws of Stimulus-Response (living beings respond to physical stimuli); Laws of Motion (objects persist in state unless acted upon, but this applies to response to forces/stimuli).

○ Intrinsics: None specific, applies to agent’s reactivity.

○ Extrinsics: Reaction to external physical forces/events.

○ Materials: N/A (applies to agents).

● Instantaneous “Correction” (IC):

○ Laws Related: Conservation of Momentum/Energy (movement should be continuous, not instantaneous jumps); Laws of Continuity (physical processes are continuous).

○ Intrinsics: None.

○ Extrinsics: Position, velocity (instantaneous jump/teleportation).

○ Materials: N/A (applies to any object undergoing a “snap” back).

def forward(self, x: torch.Tensor) -> torch.Tensor:
    # x: [B,3,T,H,W] in [0,1] → normalize to [-1,1] for Wan VAE
    x = x * 2 - 1
    # VAE expects a list of [3,T,H,W] tensors; no grads through VAE
    with torch.no_grad():
        vids = [v for v in x]                        # list length B
        mu_list = self.vae.encode(vids)              # list of [z_dim, T', H', W']
        mu = torch.stack(mu_list, dim=0)             # [B, z_dim, T', H', W']
        # Preserve temporal dynamics for objectness:
        # 1) Pool spatially to get per-frame latent summary
        mu_space = mu.mean(dim=(3, 4))               # [B, z_dim, T']
        # Perception fidelity proxies (before spatial pooling):
        # spatial variance (detail richness) and spatial max (sharp activations)
        mu_var = mu.var(dim=(3, 4))                  # [B, z_dim, T']
        mu_spmax = mu.amax(dim=(3, 4))               # [B, z_dim, T']

        # 2) Temporal statistics (global): mean and std across T'
        f_mean = mu_space.mean(dim=2)                # [B, z_dim]
        f_std  = mu_space.std(dim=2)                 # [B, z_dim]
        # Temporal aggregation for perception proxies
        f_var_mean = mu_var.mean(dim=2)              # [B, z_dim]
        try:
            f_var_p90  = torch.quantile(mu_var, 0.9, dim=2)
        except Exception:
            k_var = max(1, int(0.9 * mu_var.size(2)))
            f_var_p90 = mu_var.topk(k_var, dim=2).values.min(dim=2).values
        f_spmax_mean = mu_spmax.mean(dim=2)          # [B, z_dim]
        try:
            f_spmax_p90 = torch.quantile(mu_spmax, 0.9, dim=2)
        except Exception:
            k_mx = max(1, int(0.9 * mu_spmax.size(2)))
            f_spmax_p90 = mu_spmax.topk(k_mx, dim=2).values.min(dim=2).values

        # 3) Temporal dynamics: first/second differences + extremal stats
        #    Use safe fallbacks when T' is too short
        Tprime = mu_space.size(2)
        # f_dt (velocity) (first-order difference)
        if Tprime >= 2:
            dt = mu_space[:, :, 1:] - mu_space[:, :, :-1]        # [B, z_dim, T'-1]
            f_dt = dt.mean(dim=2)                                 # [B, z_dim]
            abs_dt = dt.abs()
            f_dt_max = abs_dt.max(dim=2).values                  # [B, z_dim]
            try:
                f_dt_p90 = torch.quantile(abs_dt, 0.9, dim=2)    # [B, z_dim]
            except Exception:
                k = max(1, int(0.9 * abs_dt.size(2)))
                f_dt_p90 = abs_dt.topk(k, dim=2).values.min(dim=2).values
        else:
            f_dt = torch.zeros_like(f_mean)
            f_dt_max = torch.zeros_like(f_mean)
            f_dt_p90 = torch.zeros_like(f_mean)

        # f_ddt (acceleration/jerk)  (second-order difference)
        if Tprime >= 3:
            ddt = mu_space[:, :, 2:] - 2 * mu_space[:, :, 1:-1] + mu_space[:, :, :-2]  # [B, z_dim, T'-2]
            f_ddt = ddt.mean(dim=2)                               # [B, z_dim]
            abs_ddt = ddt.abs()
            f_ddt_max = abs_ddt.max(dim=2).values                # [B, z_dim]
            try:
                f_ddt_p90 = torch.quantile(abs_ddt, 0.9, dim=2)  # [B, z_dim]
            except Exception:
                k2 = max(1, int(0.9 * abs_ddt.size(2)))
                f_ddt_p90 = abs_ddt.topk(k2, dim=2).values.min(dim=2).values
        else:
            f_ddt = torch.zeros_like(f_mean)
            f_ddt_max = torch.zeros_like(f_mean)
            f_ddt_p90 = torch.zeros_like(f_mean)

        # 4) Multi-scale deltas (k=2,4)
        if Tprime >= 3:
            dt2 = mu_space[:, :, 2:] - mu_space[:, :, :-2]
            f_dt2 = dt2.mean(dim=2)
        else:
            f_dt2 = torch.zeros_like(f_mean)
        if Tprime >= 5:
            dt4 = mu_space[:, :, 4:] - mu_space[:, :, :-4]
            f_dt4 = dt4.mean(dim=2)
        else:
            f_dt4 = torch.zeros_like(f_mean)

        # 5) Attention pooling over time (simple content-based weights)
        att_logits = mu_space.abs().mean(dim=1)                  # [B, T']
        att_weights = torch.softmax(att_logits, dim=1)           # [B, T']
        f_att = (mu_space * att_weights.unsqueeze(1)).sum(dim=2) # [B, z_dim]

        # Final feature concat → [B, 15*z_dim]
        mu_vec = torch.cat([
            f_mean, f_std,
            f_dt, f_ddt,
            f_dt_max, f_ddt_max,
            f_dt_p90, f_ddt_p90,
            f_dt2, f_dt4,
            f_att,
            f_var_mean, f_var_p90,
            f_spmax_mean, f_spmax_p90
        ], dim=1)
    out = self.pred_head(mu_vec)
    return out

many of these can be sensed by your latent-temporal descriptor (especially objectness/time/behavioral spikes), but pure spatial fidelity issues are weaker with full spatial pooling. Below is a quick mapping to what you have now, plus gaps.

Visual Artifacting (VA)
- Correlated: f_dt_max, f_dt_p90, f_ddt_max (brief spikes), f_std (temporal “noisiness”)
- Gap: spatial localization/detail; consider adding per-frame spatial variance/max before spatial mean, or 1-in-k frame decoded SSIM/LPIPS.
Lighting Flickering/Shifting (LS)
- Correlated: f_dt, f_ddt (global latent shifts), f_dt_p90; multi-scale f_dt2, f_dt4 for slower oscillations
- Gap: color-specific evidence; decoded-frame luminance stats can help.
Repeating Scenery (RS)
- Weak with current features (needs periodicity). Add temporal spectral energy/auto-correlation on mu_space.
Missing or Simplified Detail (MD)
- Weak. Latent f_mean/f_std may drift but not reliable. Add spatial detail proxies (per-frame spatial variance/max of latents or decoded 1-in-k edge/texture metrics).
Unnatural Color Palette (UP)
- Weak. f_mean bias shifts are crude. Consider decoded-frame color histograms or latent channel groups tied to chroma.
Object Pop-in/Out (OP)
- Strong: f_dt_max, f_dt_p90, f_ddt_max (sudden appearance/disappearance), f_dt4 (step-like).
Unnatural Disproportion (UD)
- Moderate/Strong: sustained drift in f_dt2/f_dt4, elevated f_std; spikes in f_dt/f_ddt during morph onsets.
Impossible Rigidity/Angles (IA)
- Strong for onsets: f_ddt_max (jerk), f_dt peaks. Sustained poses are harder—attention pooling f_att may focus the static abnormal segment; spatial cues still help.
Sudden Form Morphing (SFM)
- Strong: f_dt, f_dt_max, f_ddt_max; multi-scale f_dt2/f_dt4 for gradual identity drift; f_att to weight the morph window.
Impossible Spatial Layout (SL)
- Weak with spatial pooling. Might show as unusual latent “state” (f_mean), but better with decoded-frame geometry (vanishing lines/depth/consistency checks).
Desynchronized Time (DT)
- Moderate: broad f_std, mixed f_dt magnitudes; multi-scale deltas capture inconsistent rates. Per-object segmentation would help more.
Instantaneous “Correction” (IC)
- Strong: f_ddt_max spike; often paired with large |dt| sign changes; p90 catches rare events.
Instantaneous Stop/Start (ISS)
- Strong: f_ddt_max, f_dt_max; multi-scale deltas help discriminate true stops vs noise.
Preceding Effect (PE)
- Weak: requires causality/ordering; not captured by magnitude-only deltas. Needs event detectors (cause vs effect streams) or cross-modal sync.
Clipping/Interpenetration (I)
- Moderate: f_dt_max/f_ddt_max spikes when contact should occur; but spatial evidence is needed for high precision.
Unnatural Fluid Dynamics (UFD)
- Moderate: sustained f_dt/f_dt2/f_dt4 patterns and elevated f_std; decoded sparse frames with simple flow/edge stats improve robustness.
Impossible Material Resilience (IMR)
- Strong near impacts: f_ddt_max (jerk), f_dt patterns; multi-scale deltas capture rebound anomalies.
Erratic Movement (EM)
- Strong: high |dt|, |ddt| spikes, peaks (max/p90), multi-scale deltas for sharp trajectory kinks.

Keep vs extend

Keep: f_dt/f_ddt (means), extremals (max/p90), multi-scale (k=2,4), and f_att. These carry most signal for objectness/time/behavioral anomalies.
Add for perception fidelity: simple per-frame spatial stats before pooling (e.g., latent spatial variance, spatial max), and optional 1-in-k decoded-frame metrics (SSIM/LPIPS or |frame_t−frame_t−1| mean).
Add for periodicity (RS): temporal FFT/ACF on mu_space to detect repeats.
Add for causality (PE): simple event detectors on separate latent channels or decoded cues to score cause→effect ordering.

Keep these 6 (drop the rest):

f_ddt_max: maps to IC, ISS/EM, SFM (jerk spikes)
f_dt_max: maps to OP, SFM, VA spikes
f_dt_p90: robust to noise; maps to OP/SFM/VA transient bursts
f_dt4: long-stride change; maps to DT and slow morph/drift
f_dt2: medium-stride change; complements dt4 for DT and gradual deformations
f_var_p90: perception fidelity; maps to VA (detail/texture instability)

Remove:

f_mean, f_std, f_ddt, f_dt2/4 already chosen so drop the other multi-scale if any, f_ddt_p90, f_spmax_mean, f_spmax_p90, f_dt_max already kept so drop redundant extremals if you prefer, f_att.