AnimeGAN

(ISICA 2019)

Paper: AnimeGAN: A Novel Lightweight GAN for Photo Animation

Official Github (Tensorflow implementation): https://github.com/TachibanaYoshino/AnimeGAN

Github (PyTorch implementation): https://github.com/ptran1203/pytorch-animeGAN

Key feature:

Purposed three loss functions to guide the generator to output better animation visual effects.
- grayscale style loss
- grayscale adversarial loss
- color reconstruction loss
  - The use of Huber loss and $l1$ Loss for YUV format
The use of depthwise separable convolutions and inverted residual blocks (IRBs) in generator
Can be trained with unpaired data
Different learning rate for generator and discriminator
derived 3 dataset from the original anime dataset $S_{data}(a)$ $S_{d a t a} (a)$
- $S_{data}(x)$ : Grayscale image of $S_{data}(a)$
- $S_{data}(e)$ : $S_{data}(a)$ but removed edges
- $S_{data}(y)$ $S_{d a t a} (y)$ : Grayscale image of $S_{data}(e)$ $S_{d a t a} (e)$
  - The reason is to avoid the influence of the color of the images in $S_{data}(e)$ on the color of the generated images

Loss functions of AnimeGAN

Adversarial loss of AnimeGAN

LSGAN adversarial loss is used for $L_{\text {adv }}$ .

Content loss of AnimeGAN

Content loss introduced to ensure the resulting images retain semantic content of the input.

AnimeGAN uses a high-level feature map from a VGG network that pre-trained on ImageNet. It can preserve the content of objects.

$L_{\text {con }}(G, D)=E_{p_{i} \sim S_{\text {data }}(p)}\left[\left\|V G G_{l}\left(p_{i}\right)-V G G_{l}\left(G\left(p_{i}\right)\right)\right\|_{1}\right]$

Where:

$l$ $l$ refers to the feature maps of a specific VGG layer.
- the paper used the feature maps in conv4_4 layer from a VGG network. (Same as CartoonGAN and WBCartoonization)
$p_{i} \in S_{d a t a}(p)$ is a photo.
$G(p_{i})$ is a fake cartoon image that took a photo as input.
$l1$ sparse regularization is used here

Grayscale loss of AnimeGAN

The Gram matrix is used to get more vivid style images. AnimeGAN used Gram matrix to make the generated image have the texture of the anime images instead of the color of the anime images.

$\begin{array}{r} L_{\text {gra }}(G, D)=E_{p_{i} \sim S_{\text {data }}(p)}, E_{x_{i} \sim S_{\text {data }}(x)}\left[\| \operatorname{Gram}\left(V G G_{l}\left(G\left(p_{i}\right)\right)\right)\right. \left.-\operatorname{Gram}\left(V G G_{l}\left(x_{i}\right)\right) \|_{1}\right] \end{array}$

Where:

$l$ $l$ refers to the feature maps of a specific VGG layer.
- the paper used the feature maps in conv4_4 layer from a VGG network. (Same as CartoonGAN and WBCartoonization)
$p_{i} \in S_{d a t a}(p)$ is a photo.
$p_{i} \in S_{d a t a}(x)$ is a grayscale anime image.
$G(p_{i})$ is a fake cartoon image that took a photo as input.
$l1$ sparse regularization is used here

Color reconstruction loss of AnimeGAN

Images are converted from RGB to YUV for the color reconstruction loss.

$\begin{array}{r} L_{c o l}(G, D)=E_{p_{i} \sim S_{d a t a}(p)}\left[\left\|Y\left(G\left(p_{i}\right)\right)-Y\left(p_{i}\right)\right\|_{1}+\left\|U\left(G\left(p_{i}\right)\right)-U\left(p_{i}\right)\right\|_{H}\right. \left.+\left\|V\left(G\left(p_{i}\right)\right)-V\left(p_{i}\right)\right\|_{H}\right] \end{array}$

Where:

$H$ Represents Huber Loss
$l1$ loss is used for the $Y$ channel
Huber Loss is used for $U$ and $V$ channels

Total Objective function of AnimeGAN

Generator

$\begin{array}{r} L(G)=\omega_{a d v} E_{p_{i} \sim S_{\text {data }}(p)}\left[\left(G\left(p_{i}\right)-1\right)^{2}\right]+\omega_{\text {con }} L_{\text {con }}(G, D) +\omega_{\text {gra }} L_{\text {gra }}(G, D)+\omega_{\text {col }} L_{\text {col }}(G, D) \end{array}$

Discriminator

$\begin{array}{r} L(D)=\omega_{a d v}\left[E_{a_{i} \sim S_{d a t a}(a)}\left[\left(D\left(a_{i}\right)-1\right)^{2}\right]+E_{p_{i} \sim S_{d a t a}(p)}\left[\left(D\left(G\left(p_{i}\right)\right)\right)^{2}\right]\right. \left.+E_{x_{i} \sim S_{d a t a}(x)}\left[\left(D\left(x_{i}\right)\right)^{2}\right]+0.1 E_{y_{i} \sim S_{d a t a}(y)}\left[\left(D\left(y_{i}\right)\right)^{2}\right]\right] \end{array}$

0.1 scaling factor is applied to avoid the edges of the generated image being too sharp

Total

$L(G, D)=\omega_{a d v} L_{a d v}(G, D)+\omega_{c o n} L_{c o n}(G, D)+\omega_{\text {gra }} L_{\text {gra }}(G, D)+\omega_{\text {col }} L_{\text {col }}(G, D)$

where the paper set the weight factors:

$\omega_{adv}=300$
$\omega_{con}=1.5$
$\omega_{gra}=3$
$\omega_{col}=10$

compared to AnimeGAN with $\omega_{col}=10$ , the images generated by AnimeGAN with $\omega_{col}=50$ have more realistic content but the animation style of the images is not obvious. Therefore, when $\omega_{col}=10$ and $\omega_{adv}=300$ , the images generated by AnimeGAN have the satisfactory animated visual effects.

Architecture of AnimeGAN

Refer to the paper’s figure 1.

Training Detail of AnimeGAN

Initialization phase: Learning rate = 0.0001 for the generator , Adam Optim
Training phase:
- Generator learning rate = 0.00008 , Adam Optim
- Discriminator learning rate = 0.00016 , Adam Optim
Training epochs = 100
Batch size = 4
Training image size are $256\times256$

Some suggestions from the author’s github:

since the real photos in the training set are all landscape photos, if you want to stylize the photos with people as the main body, you may as well add at least 3000 photos of people in the training set and retrain to obtain a new model.

In order to obtain a better face animation effect, when using 2 images as data pairs for training, it is suggested that the faces in the photos and the faces in the anime style data should be consistent in terms of gender as much as possible.

The generated stylized images will be affected by the overall brightness and tone of the style data, so try not to select the anime images of night as the style data, and it is necessary to make an exposure compensation for the overall style data to promote the consistency of brightness and darkness of the entire style data.

AnimeGAN v2

Official Github (Tensorflow implementation): https://github.com/TachibanaYoshino/AnimeGANv2

Github (PyTorch implementation): https://github.com/bryandlee/animegan2-pytorch

Key feature compare to AnimeGAN:

AnimeGANv2 added the total variation loss in the generator loss.
- Solve the problem of high-frequency artifacts in the generated image.
- easy to train and directly achieve the effects in the paper.
- Further reduce the number of parameters of the generator network. (generator size: 8.17 Mb), The lite version has a smaller generator model.
- Use new high-quality style data, which come from BD movies as much as possible.