Paper Review - CartoonGAN
CartoonGAN
Paper: CartoonGAN: Generative Adversarial Networks for Photo Cartoonization (CVPR2018)
Official Github (Lua Torch Version): https://github.com/FlyingGoblin/CartoonGAN
Github (PyTorch Version): https://github.com/znxlwm/pytorch-CartoonGAN
Github (Tensorflow Version): https://github.com/FilipAndersson245/cartoon-gan
Github (My PyTorch implementation): https://github.com/vinesmsuic/CartoonGAN-PyTorch
“From the perspective of computer vision algorithms, the goal of cartoon stylization is to map images in the photo manifold into the cartoon manifold while keeping the content unchanged.”
The paper, as its name suggested, is to perform Image Cartoonization. The paper mentioned the properties of cartoon:
- (1) cartoon styles have unique characteristics with high level simplification and abstraction.
- cartoon images are highly simplified and abstracted from real-world photos. It does not equal to apply textures such as brush strokes in many other styles.
- (2) cartoon images tend to have clear edges, smooth color shading and relatively simple textures, which exhibit significant challenges for texture-descriptor-based loss functions used in existing methods.
Key features of CartoonGAN:
- Requires Unpaired images for training
- Produce high-quality cartoon stylization (compare to CycleGAN and NST)
- In terms of content preservation and style creation
- Less training time than CycleGAN because CartoonGAN only uses 1 generator and 1 discriminator
- A different adversarial loss due to the involvement of edge-smoothed dataset
- A new initialization phase to improve the convergence of the network (Pre-train the generator network with only content loss)
Loss functions of CartoonGAN
Edge-promoting Adversarial loss of CartoonGAN
The paper found the training of Discriminator is not sufficient if we simply put True Cartoon images and Generated Cartoon images.
“we observe that simply training the discriminator to separate generated and true cartoon images is not sufficient for transforming photos to cartoons. This is because the presentation of clear edges is an important characteristic of cartoon images, but the proportion of these edges is usually very small in the whole image. Therefore, an output image without clearly reproduced edges but with correct shading is likely to confuse the discriminator trained with a standard loss.”
Since the Cartoon images have clear edges, the Discriminator has to be focus on the edges and able to classify fake cartoon without edges (even with correct shading). The Generator has to be guided to convert the input into the correct manifold. Thus the paper proposed a method to create a edge-smoothed version of the original cartoon image dataset as a guidance. The edge-smoothed version dataset is get by applying:
- (1) detect edge pixels using a standard Canny edge detector
- (2) dilate the edge regions
- (3) apply a Gaussian smoothing in the dilated edge regions
Here is the implementation of edge-smoothing:
1 | import cv2 |
Comparision:
With the new dataset, it can be used to help the Discriminator to learn.
The goal of training the discriminator is to maximize the probability of assigning the correct label to fake generated cartoon image, the edge-smoothed (without clear edges) version of cartoon images, and the real cartoon images. Thus the Generator can be guided to convert the input into the correct manifold.
Therefore the edge-promoting adversarial loss is formulaed as:
Where:
- is a real cartoon image.
- is a edge-smoothed cartoon image.
- is a photo.
- sparse regularization is used (paper stated it is able to cope with such changes much better than the standard ℓ2 norm).
- is a fake cartoon image that took a photo as input.
In implementation, the adversarial loss is conducted with LSGAN loss (MSE).
Content loss of CartoonGAN
Content loss introduced to ensure the resulting images retain semantic content of the input.
CartoonGAN uses a high-level feature map from a VGG network that pre-trained on ImageNet. It can preserve the content of objects.
Where:
- refers to the feature maps of a specific VGG layer.
- is a photo.
- is a fake cartoon image that took a photo as input.
- the paper used the feature maps in
conv4_4
layer from a VGG network. - sparse regularization is used here
Total Objective function of CartoonGAN
where the paper set .
Initialization Phase of CartoonGAN
CartoonGAN proposed a new initialization phase to improve the convergence of the network.
GAN model is highly nonlinear, with random
initialization, the optimization can be easily trapped at suboptimal local minimum.
The new initialization phase is done by:
- Pre-train the generator network with only content loss for epochs, letting the generator to only reconstructs the content of input images.
According to the paper,
The paper found this initialization phase helps CartoonGAN fast converge to a good configuration, without premature convergence.
Architecture of CartoonGAN
Architecture of Generator and Discriminator in CartoonGAN
Refer to figure 2 of the Paper.
Key point:
- Generator: 8 residual blocks Encoder-Decoder like network
- Batch Norm, ReLU
- Discriminator: PatchGAN
- Batch Norm, LeakyReLU
Generator Implementation
1 | import torch |
Discriminator Implementation
1 | import torch |
Supplemental Information
Sensitivity of Parameters
Feature maps from VGG
- The author mentioned in supplement paper that he tried using either layers
conv3_2
orconv4_4
produces visually similar results.
the weighting factor of Content Loss
- The idea is that:
- If more weights of content loss, the style is hard to apply on the images
- If less weights of content loss, the style is applied on the images but less content is preserved
- If content loss has too less weight, the model will fail to preserve the content and result in training fail.
- If content loss has too much weight, the model will fail to apply the style and generator gives the input as output.
Limitations of CartoonGAN
- Contains a lot of obvious “artifacts”
- Generate low resolution outputs
- People report Checkerboard effects
- I believe this problem happens in some dataset. In some dataset the checkerboard effects are reduced
- A possibly relevent post: How to avoid checkerboard pattern in your generated images?
- I believe this problem happens in some dataset. In some dataset the checkerboard effects are reduced
- Many people claimed could not reproduce the results?
- I believe this problem comes from the weighting factor of Content Loss.
- The author mentioned in supplement paper that cartoonization results of dark images are not very recognizable,
- mainly because the input images are of low contrast, especially their background.