Conditional-GAN

The objective of a conditional GAN:

$\begin{aligned} \mathcal{L}_{\text {rGAN }}(G, D)=& \mathbb{E}_{x, y}[\log D(x, y)]+\\ & \mathbb{E}_{x, z}[\log (1-D(x, G(x, z))] \end{aligned}$

where $G$ tries to minimize this objective against an adversarial $D$ that tries to maximize it, i.e. $G^{*}=\arg \min _{G} \max _{D} \mathcal{L}_{c G A N}(G, D)$

Pix2Pix

Paper: Image-to-Image Translation with Conditional Adversarial Networks (CVPR 2017)

Official Github: https://github.com/phillipi/pix2pix

Key features of Pix2Pix:

Requires Paired images for training

Loss function of Pix2Pix

Not only learn the mapping from input image to output image, but also learn a loss function to train this mapping.
- The loss function is not hand-engineered

If we take a naïve approach and ask the CNN to minimize the Euclidean distance between predicted and ground truth pixels, it will tend to produce blurry results. This is because Euclidean distance is minimized by averaging all plausible outputs, which causes blurring.

It would be highly desirable if we could instead specify only a high-level goal, like “make the output indistinguishable from reality”, and then automatically learn a loss function appropriate for satisfying this goal.

The Objective of Pix2Pix:

$G^{*}=\arg \min _{G} \max _{D} \mathcal{L}_{c G A N}(G, D)+\lambda \mathcal{L}_{L 1}(G)$

where

$\mathcal{L}_{L_{1}}(G)=\mathbb{E}_{x, y, z}\left[\|y-G(x, z)\|_{1}\right]$

Previous approaches have found it beneficial to mix the GAN objective with a more traditional loss, such as L2 distance. The discriminator’s job remains unchanged, but the generator is tasked to not only fool the discriminator but also to be near the ground truth output in an L2 sense. The paper found that using L1 distance encourages less blurring.

Architecture of Pix2Pix

Both generator and discriminator use modules of the form Convolution-BatchNorm-ReLu. (LeakyReLu and ReLu)

Generator of Pix2Pix

The Generator used is a small variant of U-Net.

The U-Net is an encoder-decoder with skip connections between mirrored layers in the encoder and decoder stacks.

Discriminator of Pix2Pix

The Discriminator model is PatchGAN.

The PatchGAN only penalizes structure at the scale of patches. This discriminator tries to classify if each $N \times N$ $N \times N$ patch in an image is real or fake. This discriminator run convolutionally across the image, averaging all responses to provide the ultimate output of $D$ $D$
- $N$ $N$ can be much smaller than the full size of the image and still produce high quality results. This is advantageous because a smaller PatchGAN has fewer parameters, runs faster, and can be applied to arbitrarily large images.
  - Paper found $70 \times 70$ gives the best result, while lower $N$ values generate artifacts

Training Details of Pix2Pix

Alternate between one gradient descent step on $D$ , then one step on $G$ . Training $G$ to maximize $\log D(x, G(x,z))$ like the original GAN paper
Objective divided by 2 while optimizing $D$ , which slows down the rate at which $D$ learns relative to $G$
Uses Adam optimizer (learning rate = 0.0002, beta1 = 0.5, beta2 = 0.999)

CycleGAN

Paper: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (ICCV 2017)

Official Github: https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

Key features of CycleGAN:

Requires Unpaired images for training
Uses 2 Generators and 2 Discriminators
Both direct mapping and Inverse mapping
- Given any two unordered image collections $X$ and $Y$ , the algorithm learns to automatically “translate” a image from one domain into the other and vice versa.
- A Cycle-consistency loss is introduced to enforce $F(G(X)) \approx X$ (and vice versa).

For many tasks, paired training data will not be available, this approach could learn to translate an image from a source domain $X$ to a target domain $Y$ in the absence of paired examples.

Loss function of CycleGAN

Adversarial Loss of CycleGAN

Note there will be another discriminator, and 2 identical generators for the same objective.

The paper first said:

The adversarial loss part in the objective of CycleGAN:

$\begin{aligned} \mathcal{L}_{\text {GAN }}(G, D_Y, X, Y)= \; \; & \mathbb{E}_{x\sim p_{data(y)}}[\log D_Y(y)]+\\ & \mathbb{E}_{x\sim p_{data(x)}}[\log (1-D_Y(G(x))] \end{aligned}$

Which is same equation of the normal GAN use. In implementation, it is a BCE loss with logit.

However in the later part the paper said they used the loss from LSGAN for the adversarial loss.

Therefore adversarial loss part in the objective of CycleGAN should be:

$\begin{aligned} \min _{D} V_{\mathrm{LSGAN}}(D) &=\frac{1}{2} \mathbb{E}_{\boldsymbol{x} \sim p_{\mathrm{data}}(\boldsymbol{x})}\left[(D(\boldsymbol{x})-b)^{2}\right]+\frac{1}{2} \mathbb{E}_{\boldsymbol{z} \sim p_{\boldsymbol{z}}(\boldsymbol{z})}\left[(D(G(\boldsymbol{z}))-a)^{2}\right] \\ \min _{G} V_{\mathrm{LSGAN}}(G) &=\frac{1}{2} \mathbb{E}_{\boldsymbol{z} \sim p_{\boldsymbol{z}}(\boldsymbol{z})}\left[(D(G(\boldsymbol{z}))-c)^{2}\right] \end{aligned}$

where:

$a$ is the label for fake sample
$b$ is the label for real sample
$c$ denotes the value that the Generator wants the Discriminator to believe for a fake sample.

LSGAN in implementation uses a MSE loss (without sigmoid in the discriminator).

Cycle-Consistency Loss of CycleGAN

$\begin{aligned} \mathcal{L}_{\text {cyc}}(G, F) = \; \; & \mathbb{E}_{x \sim p_{\operatorname{data}}(x)}\left[\|F(G(x))-x\|_{1}\right]+\\ & \mathbb{E}_{y \sim p_{\text {data }}(y)}\left[\|G(F(y))-y \|_{1}\right] \end{aligned}$

Cycle-Consistency Loss learns an inverse mapping from the output domain back to the input and checks if the input can be reconstructed. In implementation, it is a L1 loss.

Full Objective of CycleGAN

Therefore the Full Objective of CycleGAN:

$\begin{aligned} \mathcal{L}\left(G, F, D_{X}, D_{Y}\right) &=\mathcal{L}_{\mathrm{GAN}}\left(G, D_{Y}, X, Y\right) \\ &+\mathcal{L}_{\mathrm{GAN}}\left(F, D_{X}, Y, X\right) \\ &+\lambda \mathcal{L}_{\mathrm{cyc}}(G, F) \end{aligned}$

Where $\lambda = 10$ , according to the paper
The $\mathcal{L}_{\mathrm{GAN}}$ is the adversarial loss, the function is depending on which adversarial loss you use.

“I think LSGAN is a more stable loss compared to vanilla GAN. It has a better gradient property. You are free to use LSGAN in your task. Maybe you want to change --lambda_L1 to 10 or 25, as LSGAN’s GAN loss has a larger range compared to vanilla GANs.”

Identity Loss of CycleGAN

For Photo generation from paintings, the Identity Loss encourage the mapping to preserve color composition between the input and output. In implementation, it is a L1 loss.

$\begin{aligned} \mathcal{L}_{\text{identity}}(G,F) = \; \; & \mathbb{E}_{y\sim p_{data(y)}}\left[\|G(y)-y\|_{1}\right] +\\ & \mathbb{E}_{x\sim p_{data(x)}}\left[\|F(x)-x\|_{1}\right] \end{aligned}$

Without $\mathcal{L}_{\text{identity}}$ , the generator $G$ and $F$ are free to change the tint of input images when there is no need to.

So we need not to use this loss when we don’t care the coloring.

We can setup our total loss as such formula so we can tweak the values easily:

Here is an example of Generator loss implementation.

$\begin{aligned} \text{Generator loss} &=(\text{Least Squares Adversarial Loss A-to-B} \\ &+\text{Cycle-Consistency Loss A-to-B-to-A} \\ &+\text{Identity Loss A-to-B}) \\ &+(\text{Least Squares Adversarial Loss B-to-A} \\ &+\text{Cycle-Consistency Loss B-to-A-to-B} \\ &+\text{Identity Loss B-to-A}) \\ \end{aligned}$

# Summary
G_loss = (
loss_G_A + loss_G_B
+ cycle_A_loss * lambda_cycle
+ cycle_B_loss * lambda_cycle
+ identity_A_loss * lambda_identity
+ identity_B_loss * lambda_identity
)

Since Identity Loss is optional, we can set lambda_identity to 0 when identity loss is not used.

Architecture of CycleGAN

There are two generators, $G$ and $F$
There are two discriminators, $D_X$ and $D_Y$
Generator networks are ResNet-9 (U-Net will also give a good result)
Discriminator networks are $70 \times 70$ PatchGANs (same as Pix2Pix)
InstanceNorm instead of BatchNorm everywhere
ReLU used only in the generator
Reflection padding was used to reduce artifacts

Training Details of CycleGAN

Replaced the negative log likelihood objective by a least-squares loss in $\mathcal{L}_{\mathrm{GAN}}$
- More stable during training and generates higher quality results
Batch size of 1 (Could be because 2 discriminators + 2 generators took more VRAM)
Adam optimizer (learning rate = 0.0002)
Keep the same learning rate for the first 100 epochs and linearly decay the rate to 0 over the next 100 epochs

Limitations of CycleGAN

The results are far from uniformly positive
- On translations tasks that involve color and texture changes, the method often succeeds
- On translations tasks that require geometric changes, the method with little success (e.g. dog $\rightarrow$ cat transfiguration). The learned translation degenerates into making minimal changes to the input. This failure might be caused by generator architectures which are tailored for good performance on the appearance changes.
CycleGAN is more memory-intensive than pix2pix as it requires two generators and two discriminators.
- simultaneously training two GAN models often converges slowly, resulting in a time-consuming training process.

Problem of Cycle-consistency

Cycle-consistency assumes that the relationship between the two domains is a bijection, which is often too restrictive. Perfect reconstruction is difficult to achieve, especially when images from one domain have additional information compared to the other domain.

Problem of L1 Loss

L1 loss is a per-pixel reconstruction metric. This do not reflect human perceptual preferences and can lead to blurry results (Even though L1 performs much less blurry results than L2).