Paper Review - Post-SD3 era Image Generation models
NExT-GPT (ICML 2024 Oral)
NExT-GPT: Any-to-Any Multimodal LLM
- MLLM-based Image generation / editing.

NEXT-GPT integrates diffusion decoders and the LLM in a three-tier architecture to enable any-to-any multimodal understanding and generation. Here’s a breakdown:
- Multimodal Encoding Stage:
- It starts by leveraging established encoders (like CLIP or ImageBind) to process inputs from various modalities (text, image, video, audio).
- These representations are then projected into a “language-like” format that the LLM can understand, using a projection layer.
- LLM Understanding and Reasoning Stage:
- An open-source LLM (specifically Vicuna 7B-v0) acts as the central agent.
- It takes the language-like representations from the encoding stage and performs semantic understanding and reasoning.
- The LLM then outputs two things:
- Direct textual responses.
- “Modality signal tokens” for each desired output modality. These tokens serve as instructions for the decoding layers, dictating what multimodal content to generate.
- Multimodal Generation Stage (Decoding):
- The modality signal tokens, carrying specific instructions from the LLM, are passed to Transformer-based output projection layers.
- These projection layers map the signal tokens into representations understandable by the multimodal decoders.
- Finally, off-the-shelf latent conditioned diffusion models (like Stable Diffusion for images, Zeroscope for video, and AudioLDM for audio) receive these representations and generate the content in the corresponding modalities.
Crucially, the system is trained in an end-to-end manner. This means that instead of the LLM simply generating text that then prompts a separate, pre-trained diffusion model (as in pipeline-style approaches), NEXT-GPT fine-tunes the input and output projection layers, and some LLM parameters using the LoRA technique. This allows for a more direct alignment between the LLM’s understanding and the diffusion models’ generation capabilities, avoiding the “noise” and limitations often seen when information is transferred solely via discrete text between modules. The modality signal tokens are designed to implicitly carry rich and flexible instructions for the downstream diffusion models, allowing the LLM to learn what content to generate.
VAR (NeurIPS 2024 Best Paper)
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
- “Visual Autoregressive Transformer”
UNO (ICCV 2025)
Less-to-More Generalization: Unlocking More Controllability by In-Context Generation
GPT-ImgEval (May 2025, Arxiv)
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
- Guessing how GPT4o do image generation / editing
- The first comprehensive benchmark, GPT-ImgEval, to evaluate the image generation capabilities of OpenAI’s GPT-4o model.
- The benchmark assesses GPT-4o across three key dimensions:
- generation quality,
- editing proficiency, and
- world knowledge-informed semantic synthesis.
- The benchmark assesses GPT-4o across three key dimensions:
This paper introduces GPT-ImgEval, the first comprehensive benchmark designed to evaluate the image generation capabilities of OpenAI’s GPT-4o model. It quantitatively and qualitatively diagnoses GPT-4o’s performance across three critical dimensions: generation quality, editing proficiency, and world knowledge-informed semantic synthesis. The research also delves into GPT-4o’s potential underlying architecture, offering empirical evidence that it might combine an auto-regressive model with a diffusion-based head for image decoding. Furthermore, the paper identifies and visualizes GPT-4o’s specific limitations and common synthetic artifacts in image generation, conducts a comparative study of multi-round image editing between GPT-4o and Gemini 2.0 Flash, and discusses the safety implications of GPT-4o’s outputs, including their detectability by existing image forensic models. The authors aim for their work to provide valuable insight, foster reproducibility, and accelerate innovation in the field of image generation.
Guessing on how GPT4o do image generation / editing
- Hypothesis-1: An VAR-based Architecture with Next-Scale Prediction.
- In this design, the image generation process happens in progressive steps. It starts by creating a low-resolution, somewhat blurry image, and then gradually refines it, adding more detail and increasing the resolution until the final high-quality image is produced. The paper notes that this aligns with what users observe when GPT-4o generates an image, as it often appears to get sharper and more detailed over time in the interface.
- Hypothesis-2: A Hybrid AR Architecture with a Diffusion-Based Head.
- This alternative hypothesis proposes that GPT-4o uses a combination of two different architectural styles. It suggests an AR backbone, which is good at understanding language and predicting sequences, coupled with a “diffusion-based generation head” for the actual image creation. In this setup, the AR model would first generate intermediate visual “tokens” or latent representations, which then act as input for the diffusion model. The diffusion model then decodes these representations into the final image. This hypothesis is supported by OpenAI’s own descriptions and an “easter egg” image that seemingly depicts this pipeline: “token → [transformer] → [diffusion] → image pixels.” This hybrid approach aims to combine the strong semantic understanding of AR models with the visual fidelity of diffusion models.
- The empirical evidence presented in the paper (they trained a ViTB:CLIP model to classify), Hypothesis-2: A Hybrid AR Architecture with a Diffusion-Based Head is considered to be the more likely architecture behind GPT-4o’s image generation. They says it could be one of the designs in figure 7.
Weakness Analysis of GPT-4o (Section 4)
- Inconsistency in image generation: GPT-4o sometimes struggles to perfectly reproduce input images without edits, introducing subtle modifications or unpredictable changes in aspect ratio and cropping.
- High-resolution & over-refinement limitations: The model tends to prioritize high-frequency visual information, often producing overly detailed outputs even when a blurry or low-resolution image is requested. This limits its ability to reproduce certain artistic styles.
- Brush tool limitations: Although GPT-4o has a brush tool for localized editing, the underlying process regenerates the entire image, leading to unintended changes in global properties like texture, color, or fine details. The model also exhibits a warm color bias.
- Failures in complicated scenes generation: GPT-4o faces challenges in creating coherent multi-person scenarios and object-human interactions, often resulting in abnormal poses, anatomical structures, or spatially implausible object overlaps.
- Non-English text capability limitation: While strong in English text generation, GPT-4o’s performance in generating Chinese text within complex scenes is limited, often producing errors in fonts or using unintended traditional characters.
Editing Comparison
BAGEL (May 2025, Arxiv)
Emerging Properties in Unified Multimodal Pretraining
OmniGen2 (June 2025, Arxiv)
OmniGen2: Exploration to Advanced Multimodal Generation
- Multimodal Large Language Model (AR) + Diffusion Decoder for image generation.
- Multimodal Rotary Position Embedding (Omni-RoPE)
This paper introduces OmniGen2, an open-source generative model designed for a wide range of tasks including text-to-image, image editing, and in-context generation. It features distinct decoding pathways for text and image, unshared parameters, and a decoupled image tokenizer, which allows it to build on existing multimodal understanding models without re-adapting VAE inputs, thereby preserving original text generation capabilities. The authors also developed comprehensive data construction pipelines, a reflection mechanism, and a new benchmark called OmniContext for evaluating in-context generation.
What OmniGen2 found
Things did work
Things didnt work
Qwen-Image (Aug 2025, Arxiv)
- MMDiT with MS-ROPE





