DreamBooth

Paper: DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

Given only a few (typically 3-5) casually captured images of a specific subject, without any textual description, our objective is to generate new images of the subject with high detail fidelity and with variations guided by text prompts.

DreamBooth : The Model is finetuned

+ve : most effective
-ve : storage inefficient because new model is trained
- Output size around 2000MB

Method

Personalization of Text-to-Image Models

Exploiting the ability of large text-to-image diffusion models
- seem to excel at integrating new information into their domain without forgetting the prior or overfitting to a small set of training images

Designing Prompts for Few-Shot Personalization

implant a new (unique text identifier, image object) pair into the diffusion model’s dictionary
unique text identifier is a rare identifier that contain random characters; e.g. “xxy5syt00”.

Class-specific Prior Preservation Loss

The best results for maximum subject fidelity are achieved by fine-tuning all layers of the model.
- This includes fine-tuning layers that are conditioned on the text embeddings

Checkout LoRA + DreamBooth

Language drift
- an observed problem in language models
- a model that is pre-trained on a large text corpus and later fine-tuned for a specific task progressively loses syntactic and semantic knowledge of the language

similar phenomenon happen in diffusion models finetuning

Reduced output diversity
- When fine-tuning on a small set of images we would like to be able to generate the subject in novel viewpoints, poses and articulations
- diversity is lost when the model is trained for too long

Class-specific prior preservation loss

encourages diversity
counters language drift

To mitigate the two aforementioned issues, we propose an autogenous class-specific prior preservation loss that encourages diversity and counters language drift. In essence, our method is

to supervise the model with its own generated samples, in order for it to retain the prior once the few-shot fine-tuning begins.

This allows it to generate diverse images of the class prior, as well as retain knowledge about the class prior that it can use in conjunction with knowledge about the subject instance.

Orginally the diffusion model is train by the loss:

with Class-specific prior preservation loss, Therefore the new loss for finetuning:

$\begin{aligned} \mathbb{E}_{\mathbf{x}, \mathbf{c}, \boldsymbol{\epsilon}, \boldsymbol{\epsilon}^{\prime}, t}\left[w_t\left\|\hat{\mathbf{x}}_\theta\left(\alpha_t \mathbf{x}+\sigma_t \boldsymbol{\epsilon}, \mathbf{c}\right)-\mathbf{x}\right\|_2^2+\right. \left.\lambda w_{t^{\prime}}\left\|\hat{\mathbf{x}}_\theta\left(\alpha_{t^{\prime}} \mathbf{x}_{\mathrm{pr}}+\sigma_{t^{\prime}} \boldsymbol{\epsilon}^{\prime}, \mathbf{c}_{\mathrm{pr}}\right)-\mathbf{x}_{\mathrm{pr}}\right\|_2^2\right] \end{aligned}$

Implementation:

#https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py#LL1001C1-L1015C92
#...
if args.with_prior_preservation:
    # Chunk the noise and model_pred into two parts and compute the loss on each part separately.
    model_pred, model_pred_prior = torch.chunk(model_pred, 2, dim=0)
    target, target_prior = torch.chunk(target, 2, dim=0)

    # Compute instance loss
    loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")

    # Compute prior loss
    prior_loss = F.mse_loss(model_pred_prior.float(), target_prior.float(), reduction="mean")

    # Add the prior loss to the instance loss.
    loss = loss + args.prior_loss_weight * prior_loss
else:
    loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")
#...

Metrics

DINO-Fidility

Our proposed DINO metric is the average pairwise cosine similarity between the ViT-S/16 DINO embeddings of generated and real images.
This is our preferred metric, since, by construction and in contrast to supervised networks, DINO is not trained to ignore differences between subjects of the same class. Instead, the self-supervised training objective encourages distinction of unique features of a subject or image.

# Helper code 
import torch
import numpy as np

def compute_cosine_distance(image_features, image_features2):
    # normalized features
    image_features = image_features / np.linalg.norm(image_features, ord=2)
    image_features2 = image_features2 / np.linalg.norm(image_features2, ord=2)
    return np.dot(image_features, image_features2)

class VITs16():
    def __init__(self, device="cuda"):
        self.model = torch.hub.load('facebookresearch/dino:main', 'dino_vits16').to(device)
        self.model.eval()
        
    def get_embeddings(self, tensor_image):
        output = self.model(tensor_image)
        return output

    def get_embeddings_intermediate(self, tensor_image, n_last_block=4):
        """
        We use `n_last_block=4` when evaluating ViT-Small
        """
        intermediate_output = self.model.get_intermediate_layers(tensor_image, n=n_last_block)
        output = torch.cat([x[:, 0] for x in intermediate_output], dim=-1)
        return output

Textual Inversion

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Key ideas:

Distill only a few (3 to 5) images into input token embeddings
- e.g. 3 cat images become a token <cat>
Model weights itself is frozen and remains unchanged
Gradient update is performed on the token vector instead of the model

Basically: The idea assumed that the model already understand the concept. We just need to find the “right input vector”.

Teaser.

Method

The idea assumed that the model already understand the concept. We just need to find the “right input vector”.
- Gradient update is performed on the token vector S* instead of the model
- The output is the S* embedding (light-weight tiny embedding)
  - Output size ~ 0.013MB

LoRA

LoRA: Low-Rank Adaptation of Large Language Models

Introduce new weights to the model
- Insert new layers between the model intermediate states
- Only new layers are trained, other weights are frozen
Teaching the model a new concept without finetuning the whole model
- much faster training
- Note that LoRA generates a small file that just notes the changes for some weights in the model.
- Output size around 145MB

Originally LoRA was used in large language model, but now it is also used in diffusion model for image generation

Adds a tiny number of weights to the diffusion model and trains the layers until the modified model understands the concept

LoRA in Stable Diffusion

LoRA applies small changes to the most critical part of Stable Diffusion models: The cross-attention layers. It is the part of the model where the image and the prompt meet.