ProGAN

Paper: Progressive Growing of GANs for Improved Quality, Stability, and Variation

Key features of ProGAN:

  • Train Generator and Discriminator Progressively
    • From 4x4 to 1024x1024 (How?)
    • Generator and Discriminator are Symmetrical
  • Minibatch std on Discriminator
  • Normalization with PixelNorm
  • Equalized Learning rate
  • Fading in Layers

The central idea of ProGAN is to train both generator and discriminator in gradually increasing resolutions, to provide the network the ability to learn lower level structure first and finer details later. For example, in anime face generation, the generator can learn to place the eyes in the right place and make them circular, before having to learn to paint the eyehole and eyelashes.

img

The Idea of “Progressive” actually first purposed in LAPGAN (Laplacian Pyramid GAN, 2015).

  • Multiple generators and discriminators were used
img
  • But in ProGAN only 1 Generator and 1 Discriminator is required.

Progressive growing (of model and layers)

  • start with low-resolution images, and then progressively increase the resolution by adding layers to the networks
    • This incremental nature allows the training to first discover large-scale structure of the image distribution and then shift attention to increasingly finer scale detail, instead of having to learn all scales simultaneously.
  • Generator and Discriminator are symmetrical, such that they are mirror images of each other and always grow in synchrony

img

How progressive?

  • Start training the model on only for a number of epoches (lets say, 30) on 4x4 dimension
  • Then, adding a new layer to both generator and discriminator and train for another number of epoches (let say 30) on 8x8 dimension
  • Repeats the progress until 1024x1024

img

Why?

  • The generation of smaller images is substantially more stable because there is less class information and fewer modes

  • By increasing the resolution little by little we are continuously asking a much simpler question compared to the end goal of discovering a mapping from latent vectors to e.g. 1024^2 images.

  • Reduced training time. With progressively growing GANs most of the iterations are done at lower resolutions, and comparable result quality is often obtained up to 2–6 times faster, depending on the final output resolution.

  • the complex mapping from latents to high-resolution images is easier to learn in steps

Fading in Layers

  • When new layers are added to the networks, we fade them in smoothly
    • This avoids sudden shocks to the already well-trained, smaller-resolution layers

img

  • 2x is done by Bilinear upsampling, while 0.5x is done by average pooling with kernel size=2 and stride=2 (Discard 75% of pixels, becomes half in both weight and height).
  • toRGB and fromRGB are realized using 1x1 conv layer.
    • toRGB in Generator turn vector from channel = in_channel to channel = 3.
    • fromRGB in Discriminator turn vector from channel = 3 to channel = in_channel.

The fade-in operation is easy to implement.

1
2
3
4
5
6
7
8
9
# From the view of generator
def fade_in(self, alpha, upscaled_img, generated_img):
# alpha should be scalar within [0, 1], and the images are in same shape
return alpha * generated_img + (1-alpha) * upscaled_img

# From the view of discriminator
def fade_in(self, alpha, downscaled_img, out):
# alpha should be scalar within [0, 1], and the images are in same shape
return alpha*out + (1-alpha)*downscaled_img_img

Minibatch std on Discriminator

GANs have a tendency to capture only a subset of the variation found in training data, and Salimans et al. (2016) suggest “minibatch discrimination” as a solution.

  • compute feature statistics not only from individual images but also across the minibatch
    • encouraging the minibatches of generated and training images to show similar statistics
    • A separate set of statistics is produced for each example in a minibatch and it is concatenated to the layer’s output, so that the discriminator can use the statistics internally
1
2
3
4
5
6
7
8
9
def minibatch_std(self, x):
# 1. Compute the std of the feautre map over the minibatch
batch_std = torch.std(x, dim=0) # N x C x H x W -> N
# 2. Get the mean over all pixels and all channels
batch_mean = batch_std.mean()
# 3. expend the mean # N -> N x 1 x H x W
batch_stat = batch_mean.repeat(x.shape[0], 1, x.shape[2], x.shape[3])
# 4. concat it with the input
return torch.cat([x, batch_stat], dim=1) #512 -> 513 [N x (C+1) x H x W]
  • Done by adding a minibatch layer towards the end of the discriminator in each resolution training, where the layer learns a large tensor that projects the input activation to an array of statistics

Our simplified solution has neither learnable parameters nor new hyperparameters. We first compute the standard deviation for each feature in each spatial location over the minibatch. We then average these estimates over all features and spatial locations to arrive at a single value. We replicate the value and concatenate it to all spatial locations and over the minibatch, yielding one additional (constant) feature map. This layer could be inserted anywhere in the discriminator, but we have found it best to insert it towards the end.

Normalization with PixelNorm on Generator

Pixelwise Feature Vector Normalization In Generator

  • Normalize the feature vector in each pixel to unit length in the generator after each convolutional layer

    • a varient of local response normalization (AlexNet)
  • Avoid escalation of signal magnitudes

    • Escalation of signal magnitudes is caused when discriminator gets too good too soon, which causes the generator to spike and try to play catch up.
    • Escalation of signal magnitudes will cause unhealthy competition between generator and discriminator, causing e.g. Mode collapse
  • It can be formulated as the equation where epsilon = 10^-8.

bx,y=ax,y1Nj=0N1(ax,yj)2+ϵb_{x,y} = \frac{a_{x,y}}{\sqrt{\frac{1}{N}\sum^{N-1}_{j=0}(a^j_{x,y})^2 + \epsilon}}

1
2
3
4
5
6
class PixelNorm(nn.Module):
def __init__(self, epsilon=1e-8) -> None:
super().__init__()
self.epsilon = epsilon
def forward(self, x):
return x / torch.sqrt(torch.mean(x**2, dim=1, keepdim=True) + self.epsilon)

To disallow the scenario where the magnitudes in the generator and discriminator spiral out of control as a result of competition, we normalize the feature vector in each pixel to unit length in the generator after each convolutional layer.

We find it surprising that this heavy-handed constraint does not seem to harm the generator in any way, and indeed with most datasets it does not change the results much, but it prevents the escalation of signal magnitudes very effectively when needed.

Equalized Learning Rate in learning layers

  • The idea behind equalized learning rate is to scale the weights at each layer with a constant at runtimes

    • keep the weights in the network at a similar scale during training
  • Scale the weights at runtime

    • w^i=wic\hat{w}_i = \frac{w_i}{c}
      • where ww are the weights and cc is the per-layer normalization constant from He’s initializer (torch.nn.init.normal_).
  • In implementation , it is done by multiplying the weights.

An example of modifying nn.ConvTranspose2d to have Equalized Learning Rate:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class WSConvTranspose2d(nn.ConvTranspose2d):
"""
Weight scaled Conv2d (Equalized Learning Rate)
Done by multiplying the weights.
https://github.com/akanimax/pro_gan_pytorch/blob/master/pro_gan_pytorch/custom_layers.py
"""
def __init__(self, in_channels, out_channels, kernel_size, stride = 1, padding = 0, output_padding = 0, groups = 1, bias = True, dilation = 1, padding_mode='zeros', gain=2) -> None:
super().__init__(in_channels, out_channels, kernel_size, stride, padding, output_padding, groups, bias, dilation, padding_mode)
torch.nn.init.normal_(self.weight)
if bias:
torch.nn.init.zeros_(self.bias)
# define scale for the weights
self.scale = (gain/in_channels)**0.5

def forward(self, x):
return torch.conv_transpose2d(
input=x,
weight=self.weight * self.scale, # scale the weight on runtime
bias=self.bias,
stride=self.stride,
padding=self.padding,
output_padding=self.output_padding,
groups=self.groups,
dilation=self.dilation
)

  • Someone reported that ProGAN cannot be trained without using Equalized Learning Rate.
    • Directly train with kaiming_norm will result in failed training
  • The approach of Equalized Learning Rate is unique because usually modern optimizers such as RMSProp and Adam use the standard deviation of the gradient to normalize it. This is problematic in the case where the weight is very large or small, in which case the standard deviation is an insufficient normalizer.

The benefit of doing this dynamically (Equalized Learning Rate) instead of during initialization is somewhat subtle, and relates to the scale-invariance in commonly used adaptive stochastic gradient descent methods such as RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). These methods normalize a gradient update by its estimated standard deviation, thus making the update independent of the scale of the parameter. As a result, if some parameters have a larger dynamic range than others, they will take longer to adjust. This is a scenario modern initializers cause, and thus it is possible that a learning rate is both too large and too small at the same time.

Supp. Information

  • Used WGAN-GP instead of LSGAN

We find that LSGAN is generally a less stable loss function than WGAN-GP, and it also has a tendency to lose some of the variation towards the end of long runs. Thus we prefer WGAN-GP, but have also produced high-resolution images by building on top of LSGAN

  • Adam Optimizer with alpha = 0.001, beta1 = 0, beta2 = 0.99
  • Batch size 16 for resolutions 4x4 to 1024x1024:
    • Res: 4, 8, 16, 32, 64, 128, 256, 512, 1024
    • Batch: [16, 16, 16, 16, 16, 16, 14, 6, 3]