StyleGAN

Paper: A Style-Based Generator Architecture for Generative Adversarial Networks

Note: You need to first understand ProGAN before understanding StyleGAN.

StyleGAN is based on ProGAN but it changed the Generator.

Key feature:

offers control over the style of generated images at different levels of detail.
Based on ProGAN to produce high resolution image
- Capable of generating very high-resolution images even of 1024*1024 resolution
Control the generated images via style mixing
Style-based Generator
- Style-based Generator consisted of a Mapping network $f$ and a Synthesis network $g$
- An intermediate latent space $\mathcal{W}$ is introduced between the mapping network and the synthesis network
- Affine transforms produce styles that control the layers of the synthesis network $g$
- Adaptive instance normalization (AdaIN) to control the style locally in different places directly.
  - manipulates the per-channel mean and variance to control the style of an image effectively
Noise injection to introduce stochastic details / stochastic variation
Style injection in different conv layers to control the style locally in different places directly.
- Affine transform of latent space W + Adaptive instance normalization (AdaIN)

Our generator architecture makes it possible to control the image synthesis via scale-specific modifications to the styles. We can view the mapping network and affine transformations as a way to draw samples for each style from a learned distribution, and the synthesis network as a way to generate a novel image based on a collection of styles. The effects of each style are localized in the network, i.e., modifying a specific subset of the styles can be expected to affect only certain aspects of the image.

Style-based generator

Style-based generator is very different from traditional generator.

Traditionally the latent code is provided to the generator through an input layer, i.e., the first layer of a feedforward network (a).
- Note the generator (a) is the generator of ProGAN.
Style-based generator omits the input layer and started from a learned constant instead. The input is mapped to an intermediate latent space $W$ which controls the generator at each convolution layer through AdaIN(Adaptive instance norm).

Where:

A is a learned affine transform
- Specialize $w$ to styles $y = (y\_s, y\_b)$
B is a learned per-channel scaling factors to the noise input
- Noise is used to control the changes of fine details (lower level features) (Does not affect high level features)
- generate stochastic (stochastic means random) detail by introducing explicit noise inputs
AdaIN is the Adaptive instance normalization
- scale the injected intermediate latent space to control the generator at each convolution layer
The Mapping network $f$ $f$ consists of 8 layers of MLP
- Why?
The Synthesis network $g$ $g$ consists of 18 layers (2 layers for each resolution, $4\times4 => 1024\times1024$ $4 \times 4 => 1024 \times 1024$ )
- Output of last layer is converted to RGB using a separate 1x1 conv like ProGAN

We can view the mapping network and affine transformations as a way to draw samples for each style from a learned distribution, and the synthesis network as a way to generate a novel image based on a collection of styles. The effects of each style are localized in the network, i.e., modifying a specific subset of the styles can be expected to affect only certain aspects of the image.

For StyleGAN and StyleGAN2,

number of layers $L$ in synthesis network $g$ is determined by the output image size $R$

$L = 2 \log\_2R - 2$

it also has a maximum resolution of 1024×1024 with 18 layers

Why do we need a mapping network?

StackOverflow: How does Mapping Network in StyleGAN work?

Note $z$ and $w$ have the same dimensions.

However, $w$ is more disentangled than $z$ .

Finding a w from intermediate latent space W for an image allows specific image editing.

The intermediate latent space $W$ $W$ does not have to support sampling according to any fixed distribution
- The intermediate latent space $W$ more faithfully reflects the distribution of the training data compared to the standard Gaussian latent space.
- This mapping can be adapted to “unwarp” $W$ so the that the factors of variation become more linear.
- We expect the training to yield a less entangled $W$ $W$ in an unsupervised setting, i.e., when the factors of variation are not known in advance
  - The disentangled properties allow one to perform extensive image manipulations by leveraging a pretrained StyleGAN.
  - It should be easier to generate realistic images based on a disentangled representation than based on an entangled representation.

Z Space

The generative model in the GAN architecture learns to map the values (sampled from a normal or uniform distribution) to the generated images.

These values are called latent codes or latent representations (denoted by $z \in Z$ ).

The latent $Z$ space is applicable to all the unconditional GAN models.

However, the constraint of the $Z$ space subject to a normal distribution limits its representation capacity and disentanglement for the semantic attributes.

Limited representation capacity due to normal distribution

W and W+ Space

Recent GAN inversion methods mostly adopt the latent spaces used in StyleGANs. These latent spaces have higher degrees of freedom and thus are significantly more expressive than the $Z$ space.

StyleGAN get converts native $z$ to the mapped style vectors $w$ by a nonlinear mapping network $f$ implemented with an 8-layer MLP.

Due to the mapping network and affine transformations, the $W$ space contains more disentangled features than $Z$ space.

Adaptive instance normalization

Adaptive instance normalization is used such that our injected code $W$ $W$ can control the style locally in different places directly.
- each style controls only one convolution before being overridden by the next AdaIN operation
The idea is to normalize each channel to 0 mean and unit variance,
- each feature map $x\_i$ is normalized separately,
- and then apply scales and biases based on the style to affact the weightings to achive style transfer.
- Thus the dimensionality of y is twice the number of feature maps on that layer.

$\operatorname{AdaIN}\left(\mathbf{x}*{i}, \mathbf{y}\right)=\mathbf{y}*{s, i} \frac{\mathbf{x}*{i}-\mu\left(\mathbf{x}*{i}\right)}{\sigma\left(\mathbf{x}*{i}\right)}+\mathbf{y}*{b, i}$

Where:

$x\_i$ is the feature map
$y$ is the style
$y\_{s,i}$ is the scalar
$y\_{b,i}$ is the bias

The dimensionality of $y$ is equal to twice the number of feature maps on that layer.

By applying weights and bias to each feature map, the style will be changed.

Why inject the W code in different convolution layers?

Different resolution of Convolution could represent a different style
- (4x4 - 8x8) brings the high level aspects such as pose, general hairstyle
- (16x16 - 32x32) brings the smaller scale features such as hairstyle, eyes open/closed
- (64x64 - 1024x1024) brings mainly the color scheme and microstructure

Importance of Noise input | Stochastic Variation

Noise is used to control the changes of fine details (lower level features) (Does not affect high level features)

generate stochastic (stochastic means random) detail by introducing explicit noise inputs
The stochastic detail would greatly increase the quality of image

Mixing Regularization for Style Mixing

If we only use one z to pass through the mapping network to get w, the synthesis network may assume that adjacent styles are correlated.
Therefore, we can use different z points to pass through the mapping network to get different w, then mix the w.

A given percentage of images are generated using two random latent codes instead of one during training
- Run two latent codes $z\_1, z\_2$ through the mapping network, and have the corresponding $w\_1, w\_2$ control the styles
- This regularization technique prevents the network from assuming that adjacent styles are correlated.

def get_w(self, batch_size:int, style_mix=True):
    # number of generator blocks. We set 8 for demo only
    n_gen_blocks = 8
    # if Mix styles, form 2 w and merge them
    if style_mix:
        # Random cross-over point 
        cross_over_point = int(torch.rand(()).item() * n_gen_blocks)
        z1 = torch.randn(batch_size, self.d_latent).to(self.device)
        z2 = torch.randn(batch_size, self.d_latent).to(self.device)
        w1 = self.mapping_network(z1)
        w2 = self.mapping_network(z2)
        w1 = w1[None, :, :].expand(cross_over_point, -1, -1)
        w2 = w2[None, :, :].expand(n_gen_blocks - cross_over_point, -1, -1)
        return torch.cat((w1, w2), dim=0)
    else:
        z = torch.randn(batch_size, self.d_latent).to(self.device)
        w = self.mapping_network(z)
        return w[None, :, :].expand(n_gen_blocks, -1, -1)

Perceptual Path Length (PPL)

This regularization encourages a fixed-size step in w to result in a fixed-magnitude change in the image.

Used to measure how smooth the interpolation of latent vector is.
The idea is to want the path length to be short in some perceptual space.

Where

$d$ is the perceptual distance (L2)

As a basis for our metric, we use a perceptually-based pairwise image distance that is calculated as a weighted difference between two VGG16 embeddings, where the weights are fit so that the metric agrees with human perceptual similarity judgments. If we subdivide a latent space interpolation path into linear segments, we can define the total perceptual length of this segmented path as the sum of perceptual differences over each segment, as reported by the image distance metric.

Truncation Trick

Generally, Truncation Trick is a trick to boost the FID score.

When your latent is far away from mean,
- the quality of the image is usually unstable, but in high variation.
When your latent is close to mean,
- the quality of the image is usually stable, but in limited variation.
  it is known that drawing latent vectors from a truncated or otherwise shrunk sampling space tends to improve average image quality, although some amount of variation is lost.