Notes on Playable World Models
Playable World Models
| Work / System | Main datasets / environments | Core architecture (high level) | Key interaction signals | Notable features / goals | Real‑time suitability |
|---|---|---|---|---|---|
| Matrix‑Game 2.0 | Internal “Matrix‑Game 2.0 Unreal Dataset” (Unreal Engine worlds with PPO agents), plus Minecraft, Sekai, GTA V driving logs, and Temple Run game data; these are described as in‑house collections, not public named benchmarks. | Teacher: SkyReelsV2‑style large I2V diffusion transformer (Wan 2.1–like) with 3D VAE tokenizer; student: distilled few‑step causal diffusion transformer with KV‑cache. | Frame‑level keyboard and mouse/camera actions. | Open‑source streaming interactive world model; action‑conditioned minute‑long videos; multi‑domain environments; focuses on stability and data/engineering pipeline end‑to‑end. | Yes, ~25 FPS with 3–4 steps and caching. |
| Genie 3 | Original Genie uses a large web‑scale video dataset and game‑like videos, described generically as “internet videos” without specific benchmark names; Genie 3 continues with similar large‑scale video corpora plus internal interactive trajectories. | Spatiotemporal VQ‑VAE video tokenizer + large autoregressive dynamics model with latent action model and visual memory. | Latent “gamepad‑like” controls; environment‑conditioned actions at latent level (details vary across versions). | General‑purpose world model from text/image to interactive environments; supports a wide range of styles and physics‑like behavior; focuses on diversity of worlds and minutes‑scale consistency. | Yes, real‑time navigation at about 24 FPS at 720p. |
| HY‑World 1.5 | Base: HunyuanVideo‑1.5 training corpus (Tencent’s web‑scale text‑video dataset); plus the internal HY‑World‑1.5 interactive dataset of real and stylized scenes and 3D reconstructions (introduced in the tech report, not a standard benchmark). | HY‑WorldPlay streaming video diffusion built on HunyuanVideo‑1.5 backbone; Context Forcing and Reconstituted Context Memory distillation. | Dual action representation for keyboard and mouse, plus camera trajectories and text prompts. | Open‑source framework from data collection to deployment; emphasizes long‑term geometric consistency, RL post‑training (WorldCompass), and memory‑aware distillation (Context Forcing) for stable long‑horizon control. | Yes, real‑time streaming video at about 24 FPS with long‑horizon consistency. |
| Yume 1.5 | Pretraining on generic web‑scale text‑video datasets (no specific benchmark names given), then an internal Yume‑1.5 interaction dataset of keyboard‑controlled worlds from single images/text prompts; the paper/blog does not name standard public datasets. | Temporal Sparse Context Memory (TSCM) long‑video generator distilled to a real‑time streaming world model. | Keyboard (WASD‑style) for camera control and exploration; text for world events and control. | Text‑controlled, interactive world generation from one image or prompt; focuses on long‑video continuity, context compression, and text‑driven events while remaining real‑time. | Yes, real‑time or near‑real‑time streaming after acceleration/distillation. |
| LingBot‑World | Hybrid internal dataset combining Unreal Engine environments (no specific map name) and “real‑world video corpora” for urban/indoor/driving scenes; public materials only refer to them generically. | Two‑expert diffusion video model (high‑noise and low‑noise experts) with LingBot‑World‑Base and LingBot‑World‑Fast checkpoints. | Keyboard and mouse for character/camera control; text commands for events and environment changes (e.g., weather, style). | Real‑world oriented world model; single image or screenshot to interactive world; supports text‑triggered events while maintaining spatial consistency; open‑sourced for low‑latency deployment. | Yes, about 16 FPS with end‑to‑end interaction latency under one second. |
Genie 3
LingBot-World (Jan 26’)
Advancing Open-source World Models
LingBot-World is an open-source, high-capacity world simulator built by evolving a large video diffusion model into an interactive, real-time world model with long-horizon consistency and explicit action control.
- Built on Wan2.2+Qwen3-VL-2B;
- real-time interactive simulation at 720p@16fps, <1s latency;
- minute-long contextual memory
Data Engine
- Uses a hybrid data pipeline combining real videos, game-engine data with synchronized actions (e.g., WASD, camera), and Unreal Engine synthetic renders with accurate poses and rich trajectories.
- Profiles data via basic filtering, segmentation, semantic analysis with VLMs, and pseudo camera pose estimation (e.g., MegaSAM) to ensure geometric labels.
- Adds hierarchical captions per clip: narrative (global story), scene-static (environment only), and dense temporal captions for fine-grained time-aligned supervision and motion/scene decoupling.
Model Formulation and Training Stages
- Formulates a conditional generative world model over future visual states given past frames and actions.
- Stage I (Pre-training): initializes from Wan2.2, a 14B image-to-video diffusion model, to get strong open-domain video priors (realism, temporal coherence).
- Stage II (Middle-training):
- Trains a 28B MoE diffusion “fundamental world model” (two 14B experts for high-noise and low-noise phases) on longer videos with curriculum from 5 s to 60 s, plus image-to-video and continuation tasks, to build long-term consistency and spatial memory.
- Adds action-conditioning via a hybrid action representation (Plücker-based continuous camera rotation + discrete keyboard vectors) injected with AdaLN into DiT blocks, while freezing the backbone for parameter-efficient finetuning.
- Stage III (Post-training):
- Adapts the bidirectional teacher to a causal autoregressive generator using block causal attention and diffusion forcing, enabling KV-cached streaming generation.
- Distills to a few-step generator with distribution matching distillation plus an adversarial head; training includes self-rollout long-horizon regimes to reduce drift and improve visual quality while preserving action-following.
Architecture and Inference
- Uses DiT-style blocks over video latents: self-attention for spatiotemporal coherence and emergent spatial memory, AdaLN-modulated action injection via Plücker encoder, and cross-attention on text embeddings.
- Distinguishes LingBot-World-Base (high-fidelity, middle-trained teacher) from LingBot-World-Fast (post-trained, real-time variant) that trades little perceptual quality for speed.
Empirical Properties and Comparisons
- Qualitative samples show high-fidelity, coherent sequences across many domains for both Base and Fast variants.
- Demonstrates emergent long-term memory: structures (e.g., Stonehenge) remain consistent after being out of view for up to 60 seconds, and the model updates unobserved states plausibly (bridge getting closer, car continuing off-screen).
- Generates ultra-long videos up to 10 minutes while preserving coherence.
Yume-1.5 (Dec 25’)
Yume1.5: A Text-Controlled Interactive World Generation Model
Yume-1.5 is a video diffusion–based model for generating explorable first-person “worlds” from text or a single image, with real-time keyboard control of camera and ego-motion and optional text-driven event edits.
- The model turns a text prompt or image into a persistent virtual world that can be explored autoregressively with W/A/S/D and arrow-key inputs controlling movement and camera.
- It supports three modes: text-to-world, image-to-world, and text-based event editing (e.g., “a ghost appeared”), all in the same framework.
Core technical contributions
- Joint Temporal–Spatial–Channel Modeling (TSCM): Historical frames are compressed along time, space, and channels, combining standard and linear attention so that context length can grow without blowing up memory or latency.
- Self-Forcing–style acceleration: They distill a multi-step diffusion teacher into a few-step generator while feeding the model its own generated history (with TSCM instead of KV cache) to reduce error accumulation in long videos and enable ~4-step inference.
- Text-controlled actions and events: Captions are split into an “Event description” (semantic scene/event) and an “Action description” (discrete keyboard/mouse controls), encoded separately with T5 and concatenated, allowing cached action embeddings and efficient control.
Data and training
- Training uses three components:
- A real-world dataset (Sekai-Real-HQ) with walking videos, from which continuous trajectories are converted into discrete vocabularies for camera and human motion (e.g., W, A, S, D and arrow combinations).
- A synthetic video set built from OpenVid captions rendered with Wan 2.x and filtered by VBench scores to retain 50k high-quality samples.
- A curated event dataset (≈4k image-to-video clips) covering urban, sci-fi, fantasy, and weather events to improve event-specific text control.
- They alternate text-to-video and image-to-video batches when training the “foundation” model, then apply distillation/self-forcing with TSCM for fast inference.
Limitations and outlook
- The model can still generate artifacts such as reversed vehicle or human motion and degrades in very crowded scenes; higher resolution helps but does not fully fix these issues.
- The authors attribute limits partly to the 5B parameter budget and suggest MoE-style scaling as a future direction, along with richer interactive behaviors and broader simulation use cases.
HY-World (Dec 25’)
HY-World 1.5 (WorldPlay + WorldCompass) is a world-modeling framework that turns a single image or text prompt into a real-time, interactive, traversable 3D-like world, generating streaming video at 24 FPS while maintaining long-horizon geometric consistency across diverse real and stylized scenes.
- Treats interactive world modeling as next-chunk (16-frame) video prediction conditioned on user actions (keyboard/mouse) to provide instant visual feedback.
- Supports first- and third-person views, AAA-style game scenes, real 3D captures, synthetic 4D content, promptable events, and 3D reconstruction outputs usable by systems like WorldMirror.
Data and Pre-training
- Trains on 320K curated clips: 170K AAA game recordings, 60K real-world 3D (DL3DV + 3DGS), 50K synthetic 4D from Unreal, and 40K real-world Sekai videos, with automated quality and motion filtering plus rich captions, camera poses, and action labels.
- Starts from a bidirectional video diffusion model (3D VAE + DiT with flow-matching objective), then converts it into a chunk-wise autoregressive generator with causal masking to support infinite-length interactive generation.
Core Technical Contributions
- Dual action representation: Combines discrete actions (keys/mouse) and continuous camera poses via PRoPE so the model has both robust, scale-invariant control and accurate spatial localization for memory retrieval.
- Reconstituted context memory: Builds per-chunk context from recent temporal memory and sampled spatial memory (selected by FOV overlap and distance), then uses temporal reframing to reassign RoPE indices so far-past but geometrically important frames stay influential.
- WorldCompass RL: An RL post-training framework with clip-level rollouts, complementary rewards (action-following + visual quality), and DiffusionNFT-based updates to improve long-horizon action following and reduce artifacts under complex composite actions.
- Context Forcing distillation: Aligns teacher and student memory contexts when distilling a memory-augmented bidirectional teacher into a fast autoregressive student, enabling 4-step denoising with preserved long-term consistency and reduced error accumulation.
Inference and Interaction Features
- Uses mixed parallelism across 8 GPUs, streaming deployment with progressive VAE decoding, quantization, SageAttention, and KV cache to achieve real-time latency for high-resolution streaming.
- Supports text-based event triggering during generation (e.g., adding objects, changing weather, starting explosions, character behaviors) for interactive storytelling and environment manipulation, beyond basic navigation control.
Matrix-Game 2.0 (Aug 25’)
Matrix-game 2.0: An open-source real-time and streaming interactive world model
Matrix-Game 2.0 is an open-source, real-time interactive world model that generates minute-long videos on the fly at 25 FPS using a few-step auto-regressive diffusion transformer conditioned on fine-grained keyboard and mouse actions, without any text input.
- Existing interactive video/world models rely on bidirectional attention and many diffusion steps, making them too slow, memory-heavy, and error-prone for real-time, long-horizon interaction.
- They also lack large, high-quality interactive datasets with precise action and camera annotations.
- Matrix-Game 2.0 targets real-time, streaming, action-conditioned video generation that remains stable over minutes and generalizes across multiple environments.
Data Pipelines
- The authors build a scalable data pipeline in Unreal Engine with: navigation-mesh path planning, RL-enhanced agents (PPO) with collision avoidance and exploration/diversity rewards, precise multi-key and camera logging, and filtering of redundant/invalid frames, all running in a multi-threaded setup.
- A complementary GTA5 system uses Script Hook plugins, OBS recording, synchronized CSV logs of mouse/keyboard, and configurable environment parameters (NPC density, vehicle density, weather, time-of-day) to capture complex interactive driving scenes.
- Overall, the pipeline produces about 800 hours of action-annotated training video (Minecraft, Unreal, Sekai), plus hundreds of hours of GTA driving and Temple Run data for further finetuning, totaling over 1.2M clips with >99% data accuracy.
Model Architecture and Training
- The foundation model is a 3D causal VAE + diffusion transformer initialized from SkyReelsV2 I2V (Wan 2.1 style), with text removed and CLIP image plus 3D VAE latents as visual conditions.
- Frame-level actions are injected as two streams: continuous mouse actions concatenated to latents and processed with temporal self-attention; discrete keyboard actions queried via cross-attention with RoPE for long temporal horizons.
- They then distill this bidirectional model into a causal, few-step auto-regressive student via Self-Forcing: initialize from ODE trajectories, then train with DMD-based distribution matching where the student conditions on its own generated history, combined with KV-cache for efficient long-context generation.
Performance and Comparisons
- On Minecraft, Matrix-Game 2.0 outperforms Oasis in image quality, aesthetics, temporal consistency, and action controllability, and avoids the rapid quality collapse seen in Oasis during long rollouts.
- On wild scenes, it matches or slightly exceeds YUME in visual/temporal metrics while being fast enough for interactive use; YUME suffers artifacts and saturation over long sequences.
- Ablations show that a moderate KV-cache window (6 frames) yields better long-term quality than larger caches, and that combining VAE caching, halved action modules, and reducing denoising steps from 4 to 3 achieves about 25 FPS with similar quality.
Capabilities and Limitations
- The system can generate high-quality, minute-level videos interactively across diverse domains (Minecraft, wild images, GTA driving, Temple Run) while closely following user action sequences.
- Limitations include weaker generalization to strongly out-of-domain scenes (e.g., extreme camera motions), relatively low output resolution (352×640), and challenges in preserving very long-term content consistency due to limited explicit memory.
Genie (Feb 24’)
Genie: Generative Interactive Environments
The paper introduces Genie, a large (≈11B parameter) generative world model that turns single text or image prompts (including sketches and real photos) into controllable, game-like interactive environments learned purely from unlabeled Internet video.
The paper introduces Genie, a large (≈11B parameter) generative world model that turns single text or image prompts (including sketches and real photos) into controllable, game-like interactive environments learned purely from unlabeled Internet video.
Core idea and architecture
- Genie is trained on video-only data (no action or text labels) and learns a discrete latent action space that lets users control dynamics frame by frame, essentially “playing” generated worlds.
- The system has three main components: a latent action model (LAM) that infers discrete actions between frames, a video tokenizer that compresses videos into VQ-VAE tokens using a spatiotemporal transformer (ST-ViViT), and a MaskGIT-based dynamics model that predicts future tokens conditioned on past tokens and latent actions.
- All components use a memory-efficient spatiotemporal transformer with separate spatial and temporal attention, making cost scale linearly in the number of frames instead of quadratically.
Latent actions and controllability
- The LAM takes past frames and the next frame, then learns discrete actions via a VQ-VAE objective with a small action codebook (|A|=8 by default) to keep control human-playable and semantically consistent (e.g., left, right, jump, no-op).
- At inference, users select discrete latent actions which index the learned codebook; the dynamics model rolls forward tokenized frames under these actions, and the tokenizer’s decoder renders the video.
- Controllability is measured by ΔtPSNR: the PSNR difference between reconstructions driven by inferred actions from ground truth vs random latent actions, with higher values indicating stronger action influence.
Training data, scaling, and performance
- Main training data is a curated 30k-hour “Platformers” dataset of 2D platformer gameplay videos (6.8M clips of 16 s at 10 FPS, 160×90), obtained by filtering and quality-classifying an initial 244k-hour scrape.
- A second “Robotics” dataset combines RT1 robot demos and related simulation/real-robot videos, again used without action labels.
- Scaling experiments over dynamics models from 40M to 2.7B parameters and larger batch sizes show smooth loss improvements, motivating the final ≈10.7B-parameter Genie model (10.1B dynamics + tokenizer + LAM) trained on ≈942B tokens.
- The ST-ViViT tokenizer outperforms spatial-only ViT and more expensive C‑ViViT in both video quality (FVD) and controllability (ΔtPSNR) at similar parameter counts.
Using Genie for agents
- A frozen LAM trained on Internet videos can be applied to unseen expert trajectories (e.g., CoinRun) to label them with latent actions, enabling behavioral cloning from observation without environment-specific action labels.
- Mapping latent actions to real environment actions using a small labeled dataset (≈200 expert samples) yields performance matching an oracle BC agent that has full access to expert action labels, in both easy and hard CoinRun settings.
Limitations, impact, and reproducibility
- Current limitations include short effective temporal memory (16-frame context), autoregressive hallucinations, and slow inference (~1 FPS), which hinder long-horizon consistency and real-time interaction.
- The authors see potential for large-scale training on broader Internet video to create general foundation world models and richer training grounds for generalist RL agents, but they withhold weights and training data for now, citing safety and responsible release concerns.
- To aid reproducibility, they provide a smaller CoinRun-based case study: a single-TPU/GPU setup with reduced tokenizer, LAM, and dynamics sizes that still yields playable latent actions within about a week of training.





