This paper, as its name suggested, is to perform Image Cartoonization. It also make use of GANs and performs better than CartoonGAN in my opinion. The paper mentioned the properties of cartoon as:
(1) Global structures composed of sparse color blocks
(2) Details outlined by sharp and clear edges
(3) Flat and smooth surfaces
Although a black-box model can also perform cartoonization, the stylization quality and generality are not optimal and stable. The white-box model divided images into surface representation, structure representation, and textured representation. The equation adjusts and balances the weightings of the loss from the three features to produce different artistic styles of cartoonized image.
Key features of White-box Cartoonization:
Requires Unpaired images for training
Produce high-quality cartoon stylization (compare to CartoonGAN)
Significantly fewer artifacts than CartoonGAN
Unlike previous black-box models that guide network training with
loss terms, this model decompose images into several representations, which enforces network to learn different features with separate objectives, making the learning process controllable and tunable.
Identified three white-box representations from cartoon images:
surface representation to represent the smooth surface of images
structure representation to represent the sparse color-blocks and flatten global content in the celluloid style
texture representation to represent high-frequency texture, contours, and details of images
Proposed a GAN framework with 1 generator G and 2 discriminators Ds and Dt
Ds aims to distinguish between surface representation extracted from model outputs and cartoons
Dt aims to distinguish between texture representation extracted from outputs and cartoons
Pre-train the generator network with only content loss (Same as CartoonGAN)
The 3 White-box Representations
The representations are extracted through traditional hand-crafted methods (non-network methods).
“The separately extracted cartoon representations enable the cartooniaztion problem to be optimized end-to-end within a Generative Neural Networks (GAN) framework, making it scalable and controllable for practical use cases and easy to meet diversified artistic demands with taskspecific fine-tuning.”
Surface representation
Extract a weighted low-frequency component from an image.
Preserve color composition and surface texture
Ignore edges, textures and details
“This Surface representation design is inspired by the cartoon painting behavior where artists usually draw composition drafts before the details are retouched, and is used to achieve a flexible and learnable feature representation for smoothed surfaces.”
In implementation, it is done by Guided Filtering which uses a differentiable guided filter to extract smooth surface (textures and details removed).
Apply an adaptive coloring algorithm on each segmented regions to generate sparse visual effects.
seize the global structural information
sparse color blocks in celluloid cartoon style
“This Structure representation design is motivated to emulate the celluloid cartoon style, which is featured by clear boundaries and sparse color blocks.”
In implementation, it is done by Super-pixel segmentation (Felzenszwalb’s Algorithm) and then apply Selective Search to merge segmented regions and extract a sparse segmentation map.
The paper used an adaptive coloring algorithm instead of standard coloring algorithms. They found using standard superpixel algorithms which color each segmented region with an average of the pixel value are not good. They found this lowers global contrast, darkens images, and causes hazing effect on the final results. The adaptive coloring algorithm can be formulated as:
The paper found using this γ1=20,γ2=40 and μ=1.2 setting could effectively enhances the contrast of images and reduces hazing effect on their processed dataset.
Shift the color of the image to generate random intensity maps with luminance and color information removed
Retains high-frequency textures
Decreases the influence of color and luminance
"This Texture representation design is motivated by a cartoon painting method where artists firstly draw a line sketch with contours and details, and then apply color on it. It guides the network to learn the high-frequency textural details independently with the color and luminance patterns excluded.
In implementation, it is done by Random Color Shift. The paper proposed an random color shift algorithm Frcs to extract single-channel texture representation from color images.
Frcs(Irgb)=(1−α)(β1×Ir+β2×Ig+β3×Ib)+α×Y
Where:
I is an image
r,g,b represent the color channel red, green, blue
Y is a standard grayscale image converted from RGB color image I.
As we mentioned, This paper Proposed a GAN framework with 1 generator G and 2 discriminators Ds and Dt where:
Generator G aims to convert an input into cartoon image by
learning the information stored in the extracted surface representations
learning the clear contours and fine textures stored in the texture representations
Discriminator Ds aims to distinguish between surface representation extracted from model outputs and cartoons
Discriminator Dt aims to distinguish between texture representation extracted from outputs and cartoons
The framework also involved a Pre-trained VGG network to extract high-level features and to impose spatial constrain on global contents between extracted structure representations and outputs, and also between input photos and outputs.
Here we denote:
Ds is the surface discriminator
Dt is the texture discriminator
Ic is a cartoon image
Ip is a photo image
G(Ip) is a fake cartoon image generated from photo image
Surface loss of White-box Cartoonization
Surface loss of White-box Cartoonization is used to guide the Generator learning the information stored in the extracted surface representations, with the help of surface discriminator.
Fdgf(I,I) is the output of differentiable guided filter mentioned above. The filter take an image as input and take the input itself as guide map to return extracted surface representation (textures and details removed).
Structure loss of White-box Cartoonization
Structure loss of White-box Cartoonization is used to enforce spatial constrain between results and extracted structure representation. This is done by using the high-level features extracted by a pre-trained VGG16 network.
Lstructure =∥VGGn(G(Ip))−VGGn(Fst(G(Ip)))∥
Where:
Fst(I) is the extracted Structure representation. Output of (Felzenszwalb’s Algorithm + Selective Search).
l1 sparse regularization is used here
Texture loss of White-box Cartoonization
Texture loss of White-box Cartoonization is used to guide the Generator learning the clear contours and fine textures stored in the texture representations, with the help of texture discriminator.
Frcs(I) is the output of random color shift algorithm mentioned above.
Content loss of White-box Cartoonization
“The content loss is used to ensure that the cartoonized results and input photos are semantically invariant, and the sparsity of L1 norm allows for local features to be cartoonized. Similar to the structure loss, it is calculated on pre-trained VGG16 feature space.”
Lcontent =∥VGGn(G(Ip))−VGGn(Ip)∥
l1 sparse regularization is used here
Total-variation loss of White-box Cartoonization
Total-variation loss of White-box Cartoonization is used to impose spatial smoothness on generated images. It also reduces high-frequency noises such as salt-and-pepper noise.
Ltv=H×W×C1∥∇x(G(Ip))+∇y(G(Ip))∥
Where:
H,W,C represents the spatial dimensions of images Height, Width, Channel
import torch import torch.nn as nn from torch.nn.utils.parametrizations import spectral_norm
# PyTorch implementation by vinesmsuic # Referenced from official tensorflow implementation: https://github.com/SystemErrorWang/White-box-Cartoonization/blob/master/train_code/network.py # slim.convolution2d uses constant padding (zeros). # Paper used spectral_norm
classBlock(nn.Module): def__init__(self, in_channels, out_channels, kernel_size, stride, padding): super().__init__() self.sn_conv = spectral_norm(nn.Conv2d( in_channels, out_channels, kernel_size, stride, padding, padding_mode="zeros"# Author's code used slim.convolution2d, which is using SAME padding (zero padding in pytorch) )) self.LReLU = nn.LeakyReLU(negative_slope=0.2, inplace=True)
defforward(self, x): x = self.sn_conv(x) x = self.LReLU(x)
#No sigmoid for LSGAN adv loss #return torch.sigmoid(x)
Training Details of White-box Cartoonization
Adam Optimzer for both generator and discriminators
Learning rate = 2×10−4
Batch size = 16
Generator is pretrained with only content loss for N=50000 iterations, and then jointly optimize the GAN based framework. Training
is stopped after 100000 iterations or on convergency.
Dataset
Human face and landscape data are collected for generalization on diverse scenes. For real-world photos, we collect 10000 images from the FFHQ dataset for the human face and 5000 images from the dataset in for landscape. For cartoon images, we collect 10000 images from animations for the human face and 10000 images for landscape. Producers of collected animations include Kyoto animation, P.A.Works, Shinkai Makoto, Hosoda Mamoru, and Miyazaki Hayao. For the validation set, we collect 3011 animation images and 1978 real-world photos. Images shown in the main paper are collected from the DIV2K dataset, and images in user study are collected from the Internet and Microsoft COCO dataset. During training, all images are resized to 256*256 resolution, and face images are feed only once in every five iterations.
Training images are resized to 256×256
Face images are feed only once in every five iterations