Background Information

Identity-Perserving Generation task refers to synthesizing an image that contains a desired person’s identity. InstantID tackles this task with only one reference facial image. Moreover, InstantID does not require test-time training while utilizing the learned prior from diffusion-based foundation models like Stable Diffusion. Such methodology is highly efficient compared to previous methods that require fine-tuning on the foundation model. According to the paper, this work shows superior results in Identity-Perserving Generation, outperforming LoRA-based and IP-adapter-based methods qualitatively.

Technical Details

The InstantID pipeline can be seen in the figure. Firstly, a UNet from the diffusion model is present, taking noise as the input. Then, InstantID has two components inspired by IP-adapter (2023) and ControlNet (2023) to control image generation to achieve identity-preserving generation tasks. We will discuss the two components (seen in the top part and the bottom part respectively in the figure).

The first component is the upper part which consists of Face embedding and a decoupled cross-attention, inspired by the IP-adapter. IP-adapter introduced the decoupled cross-attention mechanism such that a new trainable cross-attention layer is added in the UNet model to insert the image feature. In InstantID, it leveraged a pre-trained face encoder (Antelopev2) to extract the features from the reference image and obtain the Face embedding through the lightweight adapter. The face embedding acts as an image prompt and feeds into the trainable cross-attention layer, similar to the approach from IP-adapter. This component provides the coarse-grained control to the output image.

The second component is the IdentityNet, which is a variant of ControlNet proposed by this work. ControlNet introduces a trainable copy of the foundation model to allow users to incorporate task-specific conditions. In InstantID, the facial landmarks are derived from the reference image using OpenPose but are restricted to only 5 key points. The IdentityNet takes the facial landmarks as the control signal. Instead of using the text prompt as the condition in the cross-attention layer (like in ControlNet), the previously obtained face embedding is used as a condition to focus on the identity representation. This setup provides a fine-grained control of the output image.

During training, the parameters of the IP-adapter (the projection and cross-attention) and the IdentityNet are jointly optimized, while other components’ parameters are frozen. Once the training is done, no more tuning is required and the identity-preserving generation task can be achieved with a single forward pass with the reference image.