GAN Inversion

GAN Inversion: A Survey

Deep generative models such as GANs learn the underlying variation factors of the training data through the weak supervision of image generation. Discovering and steering the interpretable latent representations in image generation facilitate a wide range of image editing applications.

This paper presents a comprehensive survey of GAN inversion methods with an emphasis on algorithms and applications.

What is GAN Inversion?

GAN inversion aims to invert a given image back into the latent space of a pretrained GAN model, for the image to be faithfully reconstructed from the inverted code by the generator.
img

GANs effectively encode rich semantic information in intermediate features and latent spaces, from the supervision of image generation.

  • GANs can synthesize images with a diverse range of attributes, such as faces with different ages and expressions, and scenes with different lighting conditions.
  • By varying the latent code of GAN generator, we can manipulate certain attributes while retaining the other attributes for the generated image.
  • However, such manipulation in the latent space is only applicable to the images generated by the GAN generator rather than any given real images due to the lack of inference capability in GANs.
    • it remains challenging to apply these unconditional GANs to the editing of real images due to the lack of inference capability

With GAN inversion:

  • GAN inversion plays an essential role in bridge the real and fake image domains
  • Can use pretrained GAN models, such as StyleGAN and BigGAN, for applications of real image editing
    • GAN inversion aims to recover the latent code in a latent space of a pretrained unconditional GAN model and thus enables numerous image editing applications by manipulating the latent code
    • In this case the pretrained unconditional GAN model can be used without modifying the architecture
  • GAN inversion interprets GAN’s latent space and examines how realistic images can be generated.

Problem formulation of GAN Inversion

Ideally the found latent code of the given image should achieve two goals:

  • reconstructing the input image faithfully and photorealistically
  • facilitating downstream tasks

The generator of an unconditional GAN learns the mapping of distribtion ZXZ \rightarrow X.

When z1,z2Zz_1, z_2 \in Z are close in ZZ space, the corresponding images x1,x2Xx_1, x_2 \in X are visually similar.

It has been shown that z values that are close in Z-space produce images that are visually similar in image space X.

GAN inversion can map generated data xx back to latent representation zz^{*}

  • From optained zz^*, we can obtain the original image, or even vary zz* to further obtain the manipulated image

GAN inversion can also find image xx^* that can be entirely synthesized by the well-trained generator GG and remain close to the real image xx

Therefore the Inversion problem can be formulated as:

z=argminz (G(z),x)\mathbf{z}^{*}=\underset{\mathbf{z}}{\arg \min } \space \ell(G(\mathbf{z}), x)

where

  • ()\ell(\cdot) is a distance metric in the image or feature space
    • can be based on L1, L2, perceptual or LPIPS metrics
  • GG is a feed-forward neural network

If we can solve the equation accurately, the first goal can be achieved. However, since the equation is a nonconvex optimization problem due to the nonconvexity of G(z)G(z), It is not easy to find accurate solutions.

GAN Inversion Methods

This section introduces different latent spaces of GAN models, representative GAN inversion methods, and their properties. As the StyleGAN models achieve state-of-the-art image synthesis, numerous GAN inversion methods have been developed using various latent spaces based on the StyleGANs. In addition to the ZZ space for generic GANs, several latent spaces are designed specifically for StyleGANs, including W,W+,S,W, W+, S, and PP spaces.

img

Which Space to Embed ?

Regardless of the GAN inversion methods, one important design choice is to which latent space to embed the image.

  • A good latent space should be disentangled and easy to embed.

Z Space

The generative model in the GAN architecture learns to map the values (sampled from a normal or uniform distribution) to the generated images.

These values are called latent codes or latent representations (denoted by zZz \in Z).

Z Space is found in all the unconditional GAN models

  • The latent ZZ space is applicable to all the unconditional GAN models.
  • However, the constraint of the ZZ space subject to a normal distribution limits its representation capacity and disentanglement for the semantic attributes.
    • Limited representation capacity due to normal distribution

W and W+ Space

Recent GAN inversion methods mostly adopt the latent spaces used in StyleGANs. These latent spaces have higher degrees of freedom and thus are significantly more expressive than the ZZ space.

W and W+ Space can be found in StyleGAN

  • StyleGAN get converts native zz to the mapped style vectors ww by a nonlinear mapping network ff implemented with an 8-layer MLP. This intermediate latent space is named as WW space.
    • Due to the mapping network and affine transformations, the WW space contains more disentangled features than ZZ space.
    • The expressiveness of W space is, however, still limited, restricting the range of images that can be faithfully reconstructed.
      • Therefore, some works make use of another layer-wise latent space, W+W^+, where a different intermediate latent vector ww is fed into each of the generator’s layers via AdaIN.
      • However, inverting images into the W+W^+ space alleviates distortion at the expense of compromised editability.
      • For a StyleGAN with 18 layers:
        • wWw \in W has 512 dimensions
        • wW+w \in W^+ has 18x512 = 9216 dimensions

img

S Space

This SS space is proposed to achieve better spatial disentanglement in the spatial dimension beyond the semantic level.

S Space can be found in StyleGAN2

  • The style space SS is spanned by channel-wise style parameters ss, where ss is transformed from wWw \in W by using a different learned affine transformation for each layer of the generator.
  • In a 1024×1024 StyleGAN2 with 18 layers:
    • WW has 512 dimensions
    • W+W+ has 18x512 = 9216 dimensions
    • SS has 9088 dimensions

P Space

PULSE, a recent method has observed a “soap bubble” effect when searching a generative model’s latent space to find the desired points.

  • the “soap bubble” effect is that much of the density of a high-dimensional Gaussian lies close to the surface of a hypersphere.
  • the transformation from WW space to PP space is x=LeakyReLU5.0(w)x = \text{LeakyReLU}_{5.0}(w), where ww and xx are latent codes in WW and PP space respectively.

Method1: Learning-based GAN Inversion

Typically involves training an Encoding Neural Network (Encoder) E(x;θE)E(x;\theta_E) to map an image, xx into the latent code zz by

θE=argminθEnL(G(E(xn;θE)),xn)\theta_{E}^{*}=\underset{\theta_{E}}{\arg \min } \sum_{n} \mathcal{L}\left(G\left(E\left(x_{n} ; \theta_{E}\right)\right), x_{n}\right)

where

  • θE\theta_{E}^{*} denotes the trainable parameters of the encoder
  • xnx_n denotes the nn-th image in the dataset.
  • GG is treated as a decoder and EE is the encoder

Note the objective function is in a similar manner of an autoencoder pipeline.

But note that the decoder GG is fixed throughout the training.

A learning-based inversion method aims to learn an encoder network to map an image into the latent space such that the reconstructed image based on the latent code look as similar to the original one as possible.

Adventage of Learning-based method

  • the learning-based approach often performs better and does not fall into local optima.

Concerns of Learning-based GAN Inversion

  • simply learn a deterministic model with no regard to whether the codes produced by the encoder align with the semantic knowledge learned by G()G(\cdot).

What is considered a Good Encoder? We want our encoder to be:

  • achieve accurate reconstruction
  • lightweight
  • data-efficiency
  • supporting highresolution images
  • generalizability to arbitrary images

In some sense, the loss for one image provides information for many more images that share a similar appearance. However, the learned inversion is not always perfect, and can often be improved further by a few additional steps of optimization. => See Hybrid method

Method2: Optimization-based GAN Inversion

Typically reconstruct a target image by optimizing the latent vector by

z=argminz (x,G(z;θ))\mathbf{z}^{*}=\underset{\mathbf{z}}{\arg \min } \space \ell(x, G(\mathbf{z;\theta}))

where

  • ()\ell(\cdot) is a distance metric in the image or feature space
  • xx denotes the the targent image
  • GG is a generator parameterized by θ\theta

An optimization-based inversion approach directly solves the objective function through back-propagation to find a latent code that minimizes pixel-wise reconstruction loss.

Concerns of Optimization-based GAN

  • It is critical to choose the optimizer since a good optimizer helps alleviate the local minima problem.
  • initialization problem.
    • Since Equation is highly nonconvex, the reconstruction quality strongly relies on a good initialization of z (sometimes w for StyleGAN).
    • Experiments show that different initial values lead to a significant perceptual difference in generated images.
    • An intuitive solution is to start with several random initial values and obtain the best result with minimal cost.
  • typically require an expensive iterative process in terms of both memory and runtime, as they have to be applied to each latent code independently.

Method3: Hybrid GAN Inversion

The hybrid methods exploit the advantages of both learning based and optimization-based approaches.

One of the pioneering works propose a framework that

  • first predict zz of a given real photo xx by training a separate encoder E(x;θE)E(x; \theta_E)
  • then uses the obtained zz as the initialization for optimization objective
  • The learned predictive model serves as a fast bottom-up initialization for the nonconvex optimization problem

A hybrid approach first uses an encoder to generate initial latent code and then refines it with an optimization algorithm.

img

Subsequent studies follow this framework and have proposed several variants.

Properties of GAN Inversion Methods

The important properties of GAN inversion methods should includes

  • Having supported resolution
  • Being semantic-aware
  • Being layerwise
  • Having out-of-distribution generalizability

Supported Resolution

The image resolution that a GAN inversion method can support is mainly determined by

  • the capacity of generators and inversion mechanisms

Some methods proposed:

  • IDinvert: propose an encoder to map the given images to the latent space of StyleGAN.
    • performs well for images of 256×256 pixels but does not scale up well to images of 1024×1024 pixels due to the high computational cost. (where 1/n in the figure means semantic feature maps of 1/n original input resolution)
  • pSp: proposed 18 map2style modules to predict 18 single-layer latent codes separately
    • synthesize images of 1024 × 1024 pixels, regardless of input image size
  • Wei et. al. : proposed a similar model to pSp but with a lightweight encoder
    • features from three semantic levels are used to predict different parts of the laatent codes

img

Semantic Awareness

GAN inversion methods with semantic-aware properties can perform image reconstruction at the pixel level and align the inverted code with the knowledge that emerge in the latent space.

Semantic-aware latent codes can better support image editing by reusing the rich knowledge encoded in the GAN models.

The existing approaches typically randomly sample a collection of latent codes zz and feed them into G()G(\cdot) to obtain the corresponding synthesis xx'. The encoder E()E(\cdot) is then trained by

minΘELE=zE(G(z))2\min _{\Theta_{E}} \mathcal{L}_{E}=\|\mathbf{z}-E(G(\mathbf{z}))\|_{2}

where

  • ΘE\Theta_E is the parameters of the encoder E()E(\cdot).
  • 2\|\cdot\|_2 denotes the l2l_2 distance.

Some methods proposed:

  • A latent object representation can be used to synthesize images with different styles and reduce artifacts.

    • However, the supervision by only reconstructing zz (or equivalently, the synthesized data) is not sufficient to train an accurate encoder.
  • To alleviate this issue, a domain specific GAN inversion approach is proposed to recover the input real image at both the pixel and semantic levels.

    • first trains a domain-guided encoder E to map the image space to the latent space such that all codes produced by the encoder are in-domain latent codes
    • The encoder E is trained to recover the real images, instead of being trained with synthesized data to recover the latent code
    • Then, they perform the instance-level domain-regularized optimization by involving this well-trained E as a regularization term to fine-tune the latent code in the semantic domain during z optimization
    • Such optimization helps better reconstruct the pixel values without affecting the semantic property of the inverted code.
    • However…

Layerwise

When the number of layers is large, it is not feasible to determine the generator for the full inversion problem.

  • To invert complex state-of-the-art GANs, Bau et al. propose solving the easier problem of inverting the final layers.
    • “Seeing what a GAN cannot generate,” in ICCV, 2019

Out-of-Distribution Generalizability

GAN inversion methods can support inverting the images, especially any given real images that are not generated by the same process of the training data.

We call this out-of-distribution generalizability.

With this property, given a StyleGAN pretrained on the FFHQ dataset, this property helps:

  • to generate face images with all combinations of facial attributes, even if some combinations do not exist in the training dataset
  • to handle the images different to the samples of the training set, such as corrupted images, caricatures, or black and white photos.

This property is a prerequisite for GAN inversion methods to edit a wider range of images.

Out-of-distribution generalizability has been demonstrated in many GAN inversion methods.

  • Zhu et al. propose a domain-specific GAN inversion approach to recover the input image at both the pixel and semantic levels.
    • Although trained only with the FFHQ dataset, their model can generalize to not only real face images from multiple face datasets but also paintings, caricatures, and black and white photos collected from the Internet.
  • Kang et al. propose a method to invert out-of-range images. Taking facial images as an example, out-of-range images could be the images with extreme poses or the corrupted images, which previous methods often fail to handle.
    • Being able to invert out-of-range images allows GAN inversion methods to be applied to wider domains rather than limited settings.

One notable drawback is that inverting images that contain unseen attributes can easily lead to unexpected results as they lie outside the domain of the pretrained image generators. This limits extending GAN inversion to broader applications such as image synthesis guided by uncommon textual descriptions.

Some recent approaches aim to alleviate this issue by transferring the GANs pretrained on one image domain to a new one, guided by certain references or semantics from one or few target images (few-shot and one-shot), pretrained language-image models (zero-shot), or both.

Some Papers

First Inversion for GAN | Hybrid

“Generative visual manipulation on the natural image manifold”, ECCV, 2016

  • Code: iGAN
  • model used: DCGAN
  • The objective is to find / search a generated image xx^* in the latent space that is close to real image xx.

First using the term inversion | Optimization-based

“Inverting the generator of a generative adversarial network”, NeurIPS Workshop 2016, TNNLS 2018

img

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def find_z(gen, x, nz, lr, exDir, maxEpochs=100):

#generator in eval mode
gen.eval()

#save the "original" images
save_image(x.data, join(exDir, 'original.png'), normalize=True)

if gen.useCUDA:
gen.cuda()
Zinit = Variable(torch.randn(x.size(0),nz).cuda(), requires_grad=True)
else:
Zinit = Variable(torch.randn(x.size(0),nz), requires_grad=True)

#optimizer
optZ = torch.optim.RMSprop([Zinit], lr=lr)

losses = {'rec': []}
for e in range(maxEpochs):

xHAT = gen.forward(Zinit)
recLoss = F.mse_loss(xHAT, x)

optZ.zero_grad()
recLoss.backward()
optZ.step()

losses['rec'].append(recLoss.data[0])
print '[%d] loss: %0.5f' % (e, recLoss.data[0])

#plot training losses
if e>0:
plot_losses(losses, exDir, e+1)
plot_log_losses(losses, exDir, e+1)

#visualise the final output
xHAT = gen.forward(Zinit)
save_image(xHAT.data, join(exDir, 'rec.png'), normalize=True)

return Zinit

Inversion for conditional GAN | Learning-based

“Invertible conditional GANs for image editing” NeurIPS Workshop 2016

Visualization of mode collapse | Hybrid

“Seeing what a GAN cannot generate” ICCV 2019

  • model used: StyleGAN, Progressive GAN
  • space used: Z, W
  • Inverting a GAN layer instead of the entire generator
    • “While previous work has investigated inversion of 5-layer DCGAN generators, we find that when moving to a 15-layer Progressive GAN, high-quality inversions are much more difficult to obtain.”
    • Purposed a layer-wise inversion method that is more effective for these large-scale GAN
    • instead solve a tractable subproblem of full inversion. Decompose the generator GG into layers G=Gf(gn((g1(z))))G = G_f(g_n(\cdots(g_1(z)))) where g1,...gng_1,...g_n are the several early layers of the generator.

we use this framework to analyze several recent GANs trained on multiple datasets and identify their typical failure cases.

img

First inversion for StyleGAN | Optimization-based

“Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space?”, ICCV 2019

  • space used: W
  • model used: StyleGAN

“Image2StyleGAN++: How to edit the embedded images?”, CVPR 2020

  • space used: W+
  • model used: StyleGAN

P and P+ Space | Optimization-based

Improved StyleGAN Embedding: Where are the Good Latents?, arXiv preprint 2020

  • space used: P

Applications of GAN Inversion

Finding an accurate solution to the inversion problem allows us to match the target image without compromising the editing capabilities in the downstream tasks.

GAN inversion does not require task-specific dense-labeled datasets and can be applied to many tasks such as :

  • image manipulation,
  • image interpolation,
  • image restoration,
  • style transfer,
  • novel-view synthesis,
  • adversarial defense
  • 3D reconstruction
  • image understanding
  • multimodal learning
  • medical imaging

Image Manipulation

Given an image xx, we can vary its latent code zz.

Then we can obtain zz' of the target image xx' by linearly transforming the latent representation from a trained GAN model GG.

This can be formulated in the framework of GAN inversion as the operation of adding a scaled difference vector:

x=G(z+αn)x' = G(z^* + \alpha \mathbf{n})

where

  • n\mathbf{n} is the normal direction corresponding to a particular semantic in the latent space
  • α\alpha is the step for manipulation.

If a latent code is moved in a certain direction, then the semantics contained in the output image should vary accordingly.

Challenges and Future Directions

Theoretical Understanding

While significant effort has been made on applying GAN inversion to image editing applications, much less attention is paid to a better theoretic understanding of the latent space.

Some recent methods treat the latent space as the manifold structure:

Evaluation Metrics

New perceptual quality metrics, which can better evaluate photorealistic and diverse images or identity consistent with the original image, remain to be explored.

  • lack of effective assessment tools to evaluate the difference between the predicted results and the expected outcome or to measure the inverted latent codes more directly