Diffusion Models Applications

Image Synthesis, Text-to-image, Controllable Generation

Text-to-image

Inverse of image captioning problem
Conditional generation: given a text prompt $c$ , generate high-res images $x$ .

Text-to-image: GLIDE

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models: https://arxiv.org/abs/2112.10741

Main contribution:

A 64x64 base model + a 64x64 to 256x256 super-resolution model
Tried classifier-free and CLIP guidance, finding that classifier-free guidance works better than CLIP guidance.

CLIP

https://openai.com/blog/clip/

Learning Transferable Visual Models From Natural Language Supervision: https://arxiv.org/abs/2103.00020

CLIP (Contrastive Language–Image Pre-training) model contains two componts:

Image Encoder $f$

Text Encoder $g$

During training of CLIP, batch of caption pairs are sampled from a large dataset.

The model optimize a contrastive cross-entropy loss which encourages high dot product value between Image Encoder $f$ and Text Encoder $g$ if image $x$ and caption $c$ are from the same pair. Low dot product if image and caption are from the different pair. Formulated as:

$-\log\frac{\exp(f(x_i)\cdot g(c_j)/\tau)}{\sum_k \exp(f(x_i)\cdot g(c_k)/\tau)} - \log \frac{\exp(f(x_i) \cdot g(c_j)/\tau)}{\sum_k \exp(f(x_k)\cdot g(c_j)/\tau)}$

The optimal value of the dot product, $f(x)\cdot g(c)$ should be:

$\log \frac{p(x, c)}{p(x)p(c)} = \log p(c|x) - \log p(c)$

CLIP Guidance : Replace the classifier in classifier guidance with CLIP model

Given the formula above, we can use the CLIP model as a classifier guidance.

Recall in the classifier guidance, we sample with a modified score: $\nabla_{x_t} [\log p(x_t|c) + \omega \log p(c|x_t)]$

We augment the second term to fit our CLIP model $\omega \log p(c|x_t) - \log p(c)$ . Note when we take gradient over $x$ , the term $- \log p(c)$ will disappear.

$\nabla_{x_t} [\log p(x_t|c) + \omega \log p(c|x_t) - \log p(c)]$

Then we basically modify the score function

$\nabla_{x_t} [\log p(x_t|c) + \omega (f(x_t)\cdot g(c))]$

However in GLIDE they showed that classifier-free guidence is better

Text-to-image: DALL-E 2

https://openai.com/dall-e-2/
Hierarchical Text-Conditional Image Generation with CLIP Latents: https://arxiv.org/abs/2204.06125

Note that DALL-E 1 was an autoregressive transformer model.

1kx1k Text-to-image generation model

Main Idea:

Built up on a pre-trained CLIP model
- Grab text embedding from the pretrained CLIP embedding then frozen
- Next, use a prior model then decoder model as a pipeline
  - Prior model: produces CLIP image embeddings conditioned on the input caption
    - Option 1: Autoregressive prior: quantize image embedding to a sequence of discrete codes and predict them autoregressively
    - Option 2 (better): Diffusion Prior: model the continuous image embedding by diffusion models conditioned on caption
  - Decoder: produce image conditioned on the CLIP image embedding and text
    - Cascaded diffusion model: 1 base model (64x64), 2 super-resolution models (64x64 -> 256x256, 256x256->1024x1024)
      - Largest super-resolution model trained on patches of 1/4 size, takes full-res inputs at inference time
      - Classifier-free guidance & noise conditioning augmentation are important

Why conditional the decoder on the CLIP image embedding?

CLIP Image embeddings capture high-level semantic meaning; latents in the decoder model take care of the low-level details
The bipartite latent representation of the CLIP image embeddings enable serveral text-guided image manipulation tasks

Bipartite latent representations

Given an input image, we can get the bypass latent representation known as Bipartite latent representations $(z , x_T)$

it contains

$z$ is the CLIP image embeddings, can be derived by running the clip image encoder

$x_T$ is the latents in the decoder which can be derived by running inversion of DDIM sampler for decoder

Paper shows that it is possible to run the decoder and get near exact reconstruction of the input image

Some interesting examples using the Bipartite latent representations

DALL-E 2 Image variations

Generate image variations given the input image

Fix the CLIP embedding $z$ while decode using different decoder latents $

DALL-E 2 Image interpolation

Interpolate any two images

Interpolate image CLIP embedding $z$ , use different $x_T$ to get different interpolation trajectories

DALL-E 2 Text Diffs

Edit the image towards a different prompt given the input image and the caption

Change the image CLIP embedding towards the difference of the text CLIP embeddings of two prompts, while decoder latent is kept as a constant

Text-to-image: Imagen

https://imagen.research.google/

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding: https://arxiv.org/abs/2205.11487

1kx1k Text-to-image generation model

Highlight:

Photorealism; SOTA automatic scores (FID scores) and human ratings
Deep level of language understanding to the prompts
So simple that it does not involve latent space and have no quantization
DrawBench: a new benchmark for text-to-image evaluations
- a set of 200 prompts to evaluate text-to-image models across multiple dimensions

Key modeling components of Imagen

Cascaded diffusion models
Classifier-free guidance
Dynamic thresholding
Frozen large pretrained language models as text encoders. (T5-XXL)

Key observations:

Beneficial to use text conditioning for all super-res models
- Noise conditioning augmentation weakens information from low-res models, thus need text conditioning as extra information input
Scaling text encoder is extremely efficient in terms of improving performance
- More important than scaling diffusion model size
Human raters prefer T5-XXL as the text encoder over CLIP encoder on DrawBench

Dynamic thresholding

New technique introduced by Imagen

When using large classifier-free guidance weights, although it gives better text alignment, it gives worse image quality (higher FID score).

Hypothesis: at large guidance weight, the generated images are saturated due to the very large gradient updates during sampling

Solution: dynamic thresholding: adjusts the pixel values of samples at each sampling step to be within a dynamic range computed over the statistics of the current samples

Static thresholding => image look saturated

Dynamic thresholding => image look less saturated and more realistic

Controllable Generation: Diffusion Autoencoders

https://diff-ae.github.io/

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation https://arxiv.org/abs/2111.15640

Learning semantic meaningful latent representations in diffusion models

$x_T$ is the inversion of the DDIM sampler which capture low-level stochastic variations of the images
by assuming a low dimensional semantic latent $z_{sem}$ they are able to learn different semantic meanings

Image Editing, Image-to-Image, Super-resolution, Segmentation

Super-Resolution: SR3

https://iterative-refinement.github.io/

Image Super-Resolution via Iterative Refinement https://arxiv.org/abs/2104.07636

Image super-resolution can be considered as training $p(x|y)$ $p (x ∣ y)$ where
- $y$ = low-resolution image
- $x$ = high-resolution image

Train a score model for $x$ conditioned on $y$ using

$\mathbb{E}_{\mathbf{X}, \mathbf{y}} \mathbb{E}_{\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \mathbb{E}_t\left\|\epsilon_\theta\left(\mathbf{x}_t, t ; \mathbf{y}\right)-\epsilon\right\|_p^p$

where L1 norm give better diversity, L2 norm give better quality

The conditional score is simply a U-Net with $x_t$ and $y$ concatenated

Image-to-Image Translation: Palette

Palette: Image-to-Image Diffusion Models https://arxiv.org/abs/2111.05826

Many Image-to-image translation applications can be considered as training $p(x|y)$ where

$y$ is the input image
$x$ is the target

Train a score model for $x$ conditioned on $y$ using

$\mathbb{E}_{\mathbf{X}, \mathbf{y}} \mathbb{E}_{\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \mathbb{E}_t\left\|\epsilon_\theta\left(\mathbf{x}_t, t ; \mathbf{y}\right)-\epsilon\right\|_p^p$

The conditional score is simply a U-Net with $x_t$ and $y$ concatenated

Works on Colorization, Uncroping
Assume you have Paired access of the input and output data

Conditional Generation: ILVR

ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models: https://arxiv.org/abs/2108.02938

Iterative Latent Variable Refinement

A simple technique to guide the generation process of an unconditional diffusion model using a reference image

Given the conditioning (reference) image $y$ $y$ the generation process is modified to pull the samples towards the reference image.
- Modify the reverse denoising process

Semantic Segmentation

Label-Efficient Semantic Segmentation with Diffusion Models https://arxiv.org/abs/2112.03126

Image Editing: SDEdit

https://sde-image-editing.github.io/

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations https://arxiv.org/abs/2108.01073