ColorPeel

Paper: ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement (ECCV 2024)

“disentanglement” refers to the process of decoupling shapes and colors from a set of auto-generated colored geometries in this paper.

This paper improves disentanglement:

  • Introduced a dataset consists on basic geometric objects in precise RGB code color
  • Finetuned T2I model (SD 1.4) on the dataset with an auxiliary special loss function
    • Cross Attention Alignment (CAA) Loss
  • Resulting precise control on rgb color

Teaser Image

Method

We denote

  • color concepts as cc^*
  • shape concepts as ss^*

The prompt template will be “A photo of sis_i^* shape with cjc_j^* color”, and some text varient then paired with the dataset of the basic geometric objects in sis_i^* shape with precise RGB code cjc^*_j color

Then directly optimze the new tokens (Vc\mathcal{V}^{c^*}, Vs\mathcal{V}^{s^*}) and the Stable Diffusion UNet by minimizing the Latent Diffusion Model (LDM) Loss:

V=argminVEz0E(x),y,ϵN(0,1)Lrec\mathcal{V}^* = \arg\min_{\mathcal{V}} \mathbb{E}_{z_0 \sim \mathcal{E}(x), y, \epsilon \sim \mathcal{N}(0, 1)} \mathcal{L}_{\text{rec}}

However, this method failed to correctly disentangle color from the shape. It is because:

  • the misalignment between color attention and shape attention in the model’s cross-attention maps.

Therefore ColorPeel proposed the Cross Attention Alignment (CAA) Loss to achieve agreement between these cross-attentions, as defined by the cosine similarity between the cross-attention maps:

Lcaa=1cos(Atc,Ats)\mathcal{L}_{\text{caa}} = 1 - cos(\mathcal{A}_t^{c^*}, \mathcal{A}_t^{s^*})

  • cos(Atc,Ats)cos(\mathcal{A}_t^{c^*}, \mathcal{A}_t^{s^*}) measures the cosine similarity between the attention maps of color and shape.
    • cosine similarity measures how similar two vectors are. Higher cosine similarity means the vectors are more aligned.
  • By minimizing Lcaa\mathcal{L}_{\text{caa}} , it encourages higher similarity between the attention maps for color and shape
    • the model focuses on the correct aspects for both color and shape simultaneously.

How does it achieve disentanglement?

Its strengthing the cross attention alignment to achieve disentanglement.

  • It ensure that both color and shape attentions are directed towards the correct regions/features of the image.
    • avoid color information incorrectly attributed to shape features

So the new loss become

V=argminVEz0E(x),y,ϵN(0,1)[Lrec+λLcaa]\mathcal{V}^* = \arg\min_{\mathcal{V}} \mathbb{E}_{z_0 \sim \mathcal{E}(x), y, \epsilon \sim \mathcal{N}(0, 1)} [\mathcal{L}_{\text{rec}}+\lambda \cdot \mathcal{L}_{\text{caa}}]

  • λ=0.2\lambda = 0.2 seems to achieve the best result.

Overall

img

Illustration of the ColorPeel method.

  • Firstly, instance images along with the templates are generated, given the user-provided RGB or color coordinates.

  • Next, new modifier tokens are introduced, i.e., sis^∗_i and cic^∗_i which correspond to shapes and colors to ensure the disentanglement of shape from color.

  • Following Custom Diffusion, the key and value projection matrices in the diffusion model cross-attention layers are optimized along with the modifier tokens while training.

  • Cross attention alignment is introduced to enforce the color and shape cross-attentions disentanglement.