ColorPeel

Paper: ColorPeel: Color Prompt Learning with Diffusion Models via Color and Shape Disentanglement (ECCV 2024)

“disentanglement” refers to the process of decoupling shapes and colors from a set of auto-generated colored geometries in this paper.

This paper improves disentanglement:

Introduced a dataset consists on basic geometric objects in precise RGB code color
Finetuned T2I model (SD 1.4) on the dataset with an auxiliary special loss function
- Cross Attention Alignment (CAA) Loss
Resulting precise control on rgb color

Teaser Image

Method

We denote

color concepts as $c^*$
shape concepts as $s^*$

The prompt template will be “A photo of $s_i^*$ shape with $c_j^*$ color”, and some text varient then paired with the dataset of the basic geometric objects in $s_i^*$ shape with precise RGB code $c^*_j$ color

Then directly optimze the new tokens ( $\mathcal{V}^{c^*}$ , $\mathcal{V}^{s^*}$ ) and the Stable Diffusion UNet by minimizing the Latent Diffusion Model (LDM) Loss:

$\mathcal{V}^* = \arg\min_{\mathcal{V}} \mathbb{E}_{z_0 \sim \mathcal{E}(x), y, \epsilon \sim \mathcal{N}(0, 1)} \mathcal{L}_{\text{rec}}$

However, this method failed to correctly disentangle color from the shape. It is because:

the misalignment between color attention and shape attention in the model’s cross-attention maps.

Therefore ColorPeel proposed the Cross Attention Alignment (CAA) Loss to achieve agreement between these cross-attentions, as defined by the cosine similarity between the cross-attention maps:

$\mathcal{L}_{\text{caa}} = 1 - cos(\mathcal{A}_t^{c^*}, \mathcal{A}_t^{s^*})$

$cos(\mathcal{A}_t^{c^*}, \mathcal{A}_t^{s^*})$ $cos (A_{t}^{c^{*}}, A_{t}^{s^{*}})$ measures the cosine similarity between the attention maps of color and shape.
- cosine similarity measures how similar two vectors are. Higher cosine similarity means the vectors are more aligned.
By minimizing $\mathcal{L}_{\text{caa}}$ $L_{caa}$ , it encourages higher similarity between the attention maps for color and shape
- the model focuses on the correct aspects for both color and shape simultaneously.

How does it achieve disentanglement?

Its strengthing the cross attention alignment to achieve disentanglement.

It ensure that both color and shape attentions are directed towards the correct regions/features of the image.

avoid color information incorrectly attributed to shape features

So the new loss become

$\mathcal{V}^* = \arg\min_{\mathcal{V}} \mathbb{E}_{z_0 \sim \mathcal{E}(x), y, \epsilon \sim \mathcal{N}(0, 1)} [\mathcal{L}_{\text{rec}}+\lambda \cdot \mathcal{L}_{\text{caa}}]$

$\lambda = 0.2$ seems to achieve the best result.

Overall

Illustration of the ColorPeel method.

Firstly, instance images along with the templates are generated, given the user-provided RGB or color coordinates.

Next, new modifier tokens are introduced, i.e., $s^∗_i$ and $c^∗_i$ which correspond to shapes and colors to ensure the disentanglement of shape from color.

Following Custom Diffusion, the key and value projection matrices in the diffusion model cross-attention layers are optimized along with the modifier tokens while training.

Cross attention alignment is introduced to enforce the color and shape cross-attentions disentanglement.