DreamBooth

Paper: DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

img

  • Given only a few (typically 3-5) casually captured images of a specific subject, without any textual description, our objective is to generate new images of the subject with high detail fidelity and with variations guided by text prompts.

DreamBooth : The Model is finetuned

  • +ve : most effective
  • -ve : storage inefficient because new model is trained
    • Output size around 2000MB

Method

Personalization of Text-to-Image Models

  • Exploiting the ability of large text-to-image diffusion models
    • seem to excel at integrating new information into their domain without forgetting the prior or overfitting to a small set of training images
img

Designing Prompts for Few-Shot Personalization

  • implant a new (unique text identifier, image object) pair into the diffusion model’s dictionary
  • unique text identifier is a rare identifier that contain random characters; e.g. “xxy5syt00”.

img

Class-specific Prior Preservation Loss

  • The best results for maximum subject fidelity are achieved by fine-tuning all layers of the model.
    • This includes fine-tuning layers that are conditioned on the text embeddings

Checkout LoRA + DreamBooth

  • Language drift
    • an observed problem in language models
    • a model that is pre-trained on a large text corpus and later fine-tuned for a specific task progressively loses syntactic and semantic knowledge of the language

similar phenomenon happen in diffusion models finetuning

  • Reduced output diversity
    • When fine-tuning on a small set of images we would like to be able to generate the subject in novel viewpoints, poses and articulations
    • diversity is lost when the model is trained for too long

Class-specific prior preservation loss

  • encourages diversity
  • counters language drift

To mitigate the two aforementioned issues, we propose an autogenous class-specific prior preservation loss that encourages diversity and counters language drift. In essence, our method is

  • to supervise the model with its own generated samples, in order for it to retain the prior once the few-shot fine-tuning begins.

This allows it to generate diverse images of the class prior, as well as retain knowledge about the class prior that it can use in conjunction with knowledge about the subject instance.

Orginally the diffusion model is train by the loss:

Ex,c,ϵ,ϵ,t[wtx^θ(αtx+σtϵ,c)x22]\begin{aligned} \mathbb{E}_{\mathbf{x}, \mathbf{c}, \boldsymbol{\epsilon}, \boldsymbol{\epsilon}^{\prime}, t}\left[w_t\left\|\hat{\mathbf{x}}_\theta\left(\alpha_t \mathbf{x}+\sigma_t \boldsymbol{\epsilon}, \mathbf{c}\right)-\mathbf{x}\right\|_2^2\right] \end{aligned}

with Class-specific prior preservation loss, Therefore the new loss for finetuning:

Ex,c,ϵ,ϵ,t[wtx^θ(αtx+σtϵ,c)x22+λwtx^θ(αtxpr+σtϵ,cpr)xpr22]\begin{aligned} \mathbb{E}_{\mathbf{x}, \mathbf{c}, \boldsymbol{\epsilon}, \boldsymbol{\epsilon}^{\prime}, t}\left[w_t\left\|\hat{\mathbf{x}}_\theta\left(\alpha_t \mathbf{x}+\sigma_t \boldsymbol{\epsilon}, \mathbf{c}\right)-\mathbf{x}\right\|_2^2+\right. \left.\lambda w_{t^{\prime}}\left\|\hat{\mathbf{x}}_\theta\left(\alpha_{t^{\prime}} \mathbf{x}_{\mathrm{pr}}+\sigma_{t^{\prime}} \boldsymbol{\epsilon}^{\prime}, \mathbf{c}_{\mathrm{pr}}\right)-\mathbf{x}_{\mathrm{pr}}\right\|_2^2\right] \end{aligned}

Implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py#LL1001C1-L1015C92
#...
if args.with_prior_preservation:
# Chunk the noise and model_pred into two parts and compute the loss on each part separately.
model_pred, model_pred_prior = torch.chunk(model_pred, 2, dim=0)
target, target_prior = torch.chunk(target, 2, dim=0)

# Compute instance loss
loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")

# Compute prior loss
prior_loss = F.mse_loss(model_pred_prior.float(), target_prior.float(), reduction="mean")

# Add the prior loss to the instance loss.
loss = loss + args.prior_loss_weight * prior_loss
else:
loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")
#...

Metrics

DINO-Fidility

  • Our proposed DINO metric is the average pairwise cosine similarity between the ViT-S/16 DINO embeddings of generated and real images.
  • This is our preferred metric, since, by construction and in contrast to supervised networks, DINO is not trained to ignore differences between subjects of the same class. Instead, the self-supervised training objective encourages distinction of unique features of a subject or image.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Helper code 
import torch
import numpy as np

def compute_cosine_distance(image_features, image_features2):
# normalized features
image_features = image_features / np.linalg.norm(image_features, ord=2)
image_features2 = image_features2 / np.linalg.norm(image_features2, ord=2)
return np.dot(image_features, image_features2)

class VITs16():
def __init__(self, device="cuda"):
self.model = torch.hub.load('facebookresearch/dino:main', 'dino_vits16').to(device)
self.model.eval()

def get_embeddings(self, tensor_image):
output = self.model(tensor_image)
return output

def get_embeddings_intermediate(self, tensor_image, n_last_block=4):
"""
We use `n_last_block=4` when evaluating ViT-Small
"""
intermediate_output = self.model.get_intermediate_layers(tensor_image, n=n_last_block)
output = torch.cat([x[:, 0] for x in intermediate_output], dim=-1)
return output

Textual Inversion

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Key ideas:

  • Distill only a few (3 to 5) images into input token embeddings
    • e.g. 3 cat images become a token <cat>
  • Model weights itself is frozen and remains unchanged
  • Gradient update is performed on the token vector instead of the model

Basically: The idea assumed that the model already understand the concept. We just need to find the “right input vector”.

Teaser.

img

img

Method

  • The idea assumed that the model already understand the concept. We just need to find the “right input vector”.
    • Gradient update is performed on the token vector S* instead of the model
    • The output is the S* embedding (light-weight tiny embedding)
      • Output size ~ 0.013MB

img

LoRA

LoRA: Low-Rank Adaptation of Large Language Models

  • Introduce new weights to the model
    • Insert new layers between the model intermediate states
    • Only new layers are trained, other weights are frozen
  • Teaching the model a new concept without finetuning the whole model
    • much faster training
    • Note that LoRA generates a small file that just notes the changes for some weights in the model.
    • Output size around 145MB

Originally LoRA was used in large language model, but now it is also used in diffusion model for image generation

  • Adds a tiny number of weights to the diffusion model and trains the layers until the modified model understands the concept

LoRA in Stable Diffusion

img

LoRA applies small changes to the most critical part of Stable Diffusion models: The cross-attention layers. It is the part of the model where the image and the prompt meet.