ProGAN

Paper: Progressive Growing of GANs for Improved Quality, Stability, and Variation

Key features of ProGAN:

Train Generator and Discriminator Progressively
- From 4x4 to 1024x1024 (How?)
- Generator and Discriminator are Symmetrical
Minibatch std on Discriminator
Normalization with PixelNorm
Equalized Learning rate
Fading in Layers

The central idea of ProGAN is to train both generator and discriminator in gradually increasing resolutions, to provide the network the ability to learn lower level structure first and finer details later. For example, in anime face generation, the generator can learn to place the eyes in the right place and make them circular, before having to learn to paint the eyehole and eyelashes.

The Idea of “Progressive” actually first purposed in LAPGAN (Laplacian Pyramid GAN, 2015).

Multiple generators and discriminators were used

But in ProGAN only 1 Generator and 1 Discriminator is required.

Progressive growing (of model and layers)

start with low-resolution images, and then progressively increase the resolution by adding layers to the networks
- This incremental nature allows the training to first discover large-scale structure of the image distribution and then shift attention to increasingly finer scale detail, instead of having to learn all scales simultaneously.
Generator and Discriminator are symmetrical, such that they are mirror images of each other and always grow in synchrony

How progressive?

Start training the model on only for a number of epoches (lets say, 30) on 4x4 dimension
Then, adding a new layer to both generator and discriminator and train for another number of epoches (let say 30) on 8x8 dimension
Repeats the progress until 1024x1024

Why?

The generation of smaller images is substantially more stable because there is less class information and fewer modes
By increasing the resolution little by little we are continuously asking a much simpler question compared to the end goal of discovering a mapping from latent vectors to e.g. 1024^2 images.
Reduced training time. With progressively growing GANs most of the iterations are done at lower resolutions, and comparable result quality is often obtained up to 2–6 times faster, depending on the final output resolution.
the complex mapping from latents to high-resolution images is easier to learn in steps

Fading in Layers

When new layers are added to the networks, we fade them in smoothly
- This avoids sudden shocks to the already well-trained, smaller-resolution layers

2x is done by Bilinear upsampling, while 0.5x is done by average pooling with kernel size=2 and stride=2 (Discard 75% of pixels, becomes half in both weight and height).
toRGB and fromRGB are realized using 1x1 conv layer.
- toRGB in Generator turn vector from channel = in_channel to channel = 3.
- fromRGB in Discriminator turn vector from channel = 3 to channel = in_channel.

The fade-in operation is easy to implement.

# From the view of generator
def fade_in(self, alpha, upscaled_img, generated_img):
    # alpha should be scalar within [0, 1], and the images are in same shape
    return alpha * generated_img + (1-alpha) * upscaled_img

# From the view of discriminator
def fade_in(self, alpha, downscaled_img, out):
    # alpha should be scalar within [0, 1], and the images are in same shape
    return alpha*out + (1-alpha)*downscaled_img_img

Minibatch std on Discriminator

GANs have a tendency to capture only a subset of the variation found in training data, and Salimans et al. (2016) suggest “minibatch discrimination” as a solution.

compute feature statistics not only from individual images but also across the minibatch
- encouraging the minibatches of generated and training images to show similar statistics
- A separate set of statistics is produced for each example in a minibatch and it is concatenated to the layer’s output, so that the discriminator can use the statistics internally

def minibatch_std(self, x):
    # 1. Compute the std of the feautre map over the minibatch
    batch_std = torch.std(x, dim=0) # N x C x H x W -> N
    # 2. Get the mean over all pixels and all channels
    batch_mean = batch_std.mean()
    # 3. expend the mean # N -> N x 1 x H x W  
    batch_stat = batch_mean.repeat(x.shape[0], 1, x.shape[2], x.shape[3])
    # 4. concat it with the input
    return torch.cat([x, batch_stat], dim=1) #512 -> 513 [N x (C+1) x H x W]

Done by adding a minibatch layer towards the end of the discriminator in each resolution training, where the layer learns a large tensor that projects the input activation to an array of statistics

Our simplified solution has neither learnable parameters nor new hyperparameters. We first compute the standard deviation for each feature in each spatial location over the minibatch. We then average these estimates over all features and spatial locations to arrive at a single value. We replicate the value and concatenate it to all spatial locations and over the minibatch, yielding one additional (constant) feature map. This layer could be inserted anywhere in the discriminator, but we have found it best to insert it towards the end.

Normalization with PixelNorm on Generator

Pixelwise Feature Vector Normalization In Generator

Normalize the feature vector in each pixel to unit length in the generator after each convolutional layer
- a varient of local response normalization (AlexNet)
Avoid escalation of signal magnitudes
- Escalation of signal magnitudes is caused when discriminator gets too good too soon, which causes the generator to spike and try to play catch up.
- Escalation of signal magnitudes will cause unhealthy competition between generator and discriminator, causing e.g. Mode collapse
It can be formulated as the equation where epsilon = 10^-8.

$b_{x,y} = \frac{a_{x,y}}{\sqrt{\frac{1}{N}\sum^{N-1}_{j=0}(a^j_{x,y})^2 + \epsilon}}$

class PixelNorm(nn.Module):
    def __init__(self, epsilon=1e-8) -> None:
        super().__init__()
        self.epsilon = epsilon
    def forward(self, x):
        return x / torch.sqrt(torch.mean(x**2, dim=1, keepdim=True) + self.epsilon)

To disallow the scenario where the magnitudes in the generator and discriminator spiral out of control as a result of competition, we normalize the feature vector in each pixel to unit length in the generator after each convolutional layer.

We find it surprising that this heavy-handed constraint does not seem to harm the generator in any way, and indeed with most datasets it does not change the results much, but it prevents the escalation of signal magnitudes very effectively when needed.

Equalized Learning Rate in learning layers

The idea behind equalized learning rate is to scale the weights at each layer with a constant at runtimes
- keep the weights in the network at a similar scale during training
Scale the weights at runtime
- $\hat{w}_i = \frac{w_i}{c}$ $\overset{w}{^}_{i} = \frac{w _{i}}{c}$
  - where $w$ are the weights and $c$ is the per-layer normalization constant from He’s initializer (torch.nn.init.normal_).
In implementation , it is done by multiplying the weights.

An example of modifying nn.ConvTranspose2d to have Equalized Learning Rate:

class WSConvTranspose2d(nn.ConvTranspose2d):
    """
    Weight scaled Conv2d (Equalized Learning Rate)
    Done by multiplying the weights.
    https://github.com/akanimax/pro_gan_pytorch/blob/master/pro_gan_pytorch/custom_layers.py
    """
    def __init__(self, in_channels, out_channels, kernel_size, stride = 1, padding = 0, output_padding = 0, groups = 1, bias = True, dilation = 1, padding_mode='zeros', gain=2) -> None:
        super().__init__(in_channels, out_channels, kernel_size, stride, padding, output_padding, groups, bias, dilation, padding_mode)
        torch.nn.init.normal_(self.weight)
        if bias:
            torch.nn.init.zeros_(self.bias)
        # define scale for the weights
        self.scale = (gain/in_channels)**0.5
    
    def forward(self, x):
        return torch.conv_transpose2d(
            input=x,
            weight=self.weight * self.scale, # scale the weight on runtime
            bias=self.bias,
            stride=self.stride,
            padding=self.padding,
            output_padding=self.output_padding,
            groups=self.groups,
            dilation=self.dilation
        )

Someone reported that ProGAN cannot be trained without using Equalized Learning Rate.
- Directly train with kaiming_norm will result in failed training
The approach of Equalized Learning Rate is unique because usually modern optimizers such as RMSProp and Adam use the standard deviation of the gradient to normalize it. This is problematic in the case where the weight is very large or small, in which case the standard deviation is an insufficient normalizer.

The benefit of doing this dynamically (Equalized Learning Rate) instead of during initialization is somewhat subtle, and relates to the scale-invariance in commonly used adaptive stochastic gradient descent methods such as RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). These methods normalize a gradient update by its estimated standard deviation, thus making the update independent of the scale of the parameter. As a result, if some parameters have a larger dynamic range than others, they will take longer to adjust. This is a scenario modern initializers cause, and thus it is possible that a learning rate is both too large and too small at the same time.

Supp. Information

Used WGAN-GP instead of LSGAN

We find that LSGAN is generally a less stable loss function than WGAN-GP, and it also has a tendency to lose some of the variation towards the end of long runs. Thus we prefer WGAN-GP, but have also produced high-resolution images by building on top of LSGAN

Adam Optimizer with alpha = 0.001, beta1 = 0, beta2 = 0.99
Batch size 16 for resolutions 4x4 to 1024x1024:
- Res: 4, 8, 16, 32, 64, 128, 256, 512, 1024
- Batch: [16, 16, 16, 16, 16, 16, 14, 6, 3]