AnimeGAN

(ISICA 2019)

Paper: AnimeGAN: A Novel Lightweight GAN for Photo Animation

Official Github (Tensorflow implementation): https://github.com/TachibanaYoshino/AnimeGAN

Github (PyTorch implementation): https://github.com/ptran1203/pytorch-animeGAN

img

Key feature:

  • Purposed three loss functions to guide the generator to output better animation visual effects.
    • grayscale style loss
    • grayscale adversarial loss
    • color reconstruction loss
      • The use of Huber loss and l1l1 Loss for YUV format
  • The use of depthwise separable convolutions and inverted residual blocks (IRBs) in generator
  • Can be trained with unpaired data
  • Different learning rate for generator and discriminator
  • derived 3 dataset from the original anime dataset Sdata(a)S_{data}(a)
    • Sdata(x)S_{data}(x) : Grayscale image of Sdata(a)S_{data}(a)
    • Sdata(e)S_{data}(e) : Sdata(a)S_{data}(a) but removed edges
    • Sdata(y)S_{data}(y) : Grayscale image of Sdata(e)S_{data}(e)
      • The reason is to avoid the influence of the color of the images in Sdata(e)S_{data}(e) on the color of the generated images

Loss functions of AnimeGAN

Adversarial loss of AnimeGAN

  • LSGAN adversarial loss is used for Ladv L_{\text {adv }}.

Content loss of AnimeGAN

Content loss introduced to ensure the resulting images retain semantic content of the input.

AnimeGAN uses a high-level feature map from a VGG network that pre-trained on ImageNet. It can preserve the content of objects.

Lcon (G,D)=EpiSdata (p)[VGGl(pi)VGGl(G(pi))1]L_{\text {con }}(G, D)=E_{p_{i} \sim S_{\text {data }}(p)}\left[\left\|V G G_{l}\left(p_{i}\right)-V G G_{l}\left(G\left(p_{i}\right)\right)\right\|_{1}\right]

Where:

  • ll refers to the feature maps of a specific VGG layer.
    • the paper used the feature maps in conv4_4 layer from a VGG network. (Same as CartoonGAN and WBCartoonization)
  • piSdata(p)p_{i} \in S_{d a t a}(p) is a photo.
  • G(pi)G(p_{i}) is a fake cartoon image that took a photo as input.
  • l1l1 sparse regularization is used here

Grayscale loss of AnimeGAN

The Gram matrix is used to get more vivid style images. AnimeGAN used Gram matrix to make the generated image have the texture of the anime images instead of the color of the anime images.

Lgra (G,D)=EpiSdata (p),ExiSdata (x)[Gram(VGGl(G(pi)))Gram(VGGl(xi))1]\begin{array}{r} L_{\text {gra }}(G, D)=E_{p_{i} \sim S_{\text {data }}(p)}, E_{x_{i} \sim S_{\text {data }}(x)}\left[\| \operatorname{Gram}\left(V G G_{l}\left(G\left(p_{i}\right)\right)\right)\right. \left.-\operatorname{Gram}\left(V G G_{l}\left(x_{i}\right)\right) \|_{1}\right] \end{array}

Where:

  • ll refers to the feature maps of a specific VGG layer.
    • the paper used the feature maps in conv4_4 layer from a VGG network. (Same as CartoonGAN and WBCartoonization)
  • piSdata(p)p_{i} \in S_{d a t a}(p) is a photo.
  • piSdata(x)p_{i} \in S_{d a t a}(x) is a grayscale anime image.
  • G(pi)G(p_{i}) is a fake cartoon image that took a photo as input.
  • l1l1 sparse regularization is used here

Color reconstruction loss of AnimeGAN

Images are converted from RGB to YUV for the color reconstruction loss.

Lcol(G,D)=EpiSdata(p)[Y(G(pi))Y(pi)1+U(G(pi))U(pi)H+V(G(pi))V(pi)H]\begin{array}{r} L_{c o l}(G, D)=E_{p_{i} \sim S_{d a t a}(p)}\left[\left\|Y\left(G\left(p_{i}\right)\right)-Y\left(p_{i}\right)\right\|_{1}+\left\|U\left(G\left(p_{i}\right)\right)-U\left(p_{i}\right)\right\|_{H}\right. \left.+\left\|V\left(G\left(p_{i}\right)\right)-V\left(p_{i}\right)\right\|_{H}\right] \end{array}

Where:

  • HH Represents Huber Loss
  • l1l1 loss is used for the YY channel
  • Huber Loss is used for UU and VV channels

Total Objective function of AnimeGAN

Generator

L(G)=ωadvEpiSdata (p)[(G(pi)1)2]+ωcon Lcon (G,D)+ωgra Lgra (G,D)+ωcol Lcol (G,D)\begin{array}{r} L(G)=\omega_{a d v} E_{p_{i} \sim S_{\text {data }}(p)}\left[\left(G\left(p_{i}\right)-1\right)^{2}\right]+\omega_{\text {con }} L_{\text {con }}(G, D) +\omega_{\text {gra }} L_{\text {gra }}(G, D)+\omega_{\text {col }} L_{\text {col }}(G, D) \end{array}

Discriminator

L(D)=ωadv[EaiSdata(a)[(D(ai)1)2]+EpiSdata(p)[(D(G(pi)))2]+ExiSdata(x)[(D(xi))2]+0.1EyiSdata(y)[(D(yi))2]]\begin{array}{r} L(D)=\omega_{a d v}\left[E_{a_{i} \sim S_{d a t a}(a)}\left[\left(D\left(a_{i}\right)-1\right)^{2}\right]+E_{p_{i} \sim S_{d a t a}(p)}\left[\left(D\left(G\left(p_{i}\right)\right)\right)^{2}\right]\right. \left.+E_{x_{i} \sim S_{d a t a}(x)}\left[\left(D\left(x_{i}\right)\right)^{2}\right]+0.1 E_{y_{i} \sim S_{d a t a}(y)}\left[\left(D\left(y_{i}\right)\right)^{2}\right]\right] \end{array}

  • 0.1 scaling factor is applied to avoid the edges of the generated image being too sharp

Total

L(G,D)=ωadvLadv(G,D)+ωconLcon(G,D)+ωgra Lgra (G,D)+ωcol Lcol (G,D)L(G, D)=\omega_{a d v} L_{a d v}(G, D)+\omega_{c o n} L_{c o n}(G, D)+\omega_{\text {gra }} L_{\text {gra }}(G, D)+\omega_{\text {col }} L_{\text {col }}(G, D)

where the paper set the weight factors:

  • ωadv=300\omega_{adv}=300
  • ωcon=1.5\omega_{con}=1.5
  • ωgra=3\omega_{gra}=3
  • ωcol=10\omega_{col}=10

compared to AnimeGAN with ωcol=10\omega_{col}=10, the images generated by AnimeGAN with ωcol=50\omega_{col}=50 have more realistic content but the animation style of the images is not obvious. Therefore, when ωcol=10\omega_{col}=10 and ωadv=300\omega_{adv}=300, the images generated by AnimeGAN have the satisfactory animated visual effects.

Architecture of AnimeGAN

Refer to the paper’s figure 1.

img

Training Detail of AnimeGAN

  • Initialization phase: Learning rate = 0.0001 for the generator , Adam Optim
  • Training phase:
    • Generator learning rate = 0.00008 , Adam Optim
    • Discriminator learning rate = 0.00016 , Adam Optim
  • Training epochs = 100
  • Batch size = 4
  • Training image size are 256×256256\times256

Some suggestions from the author’s github:

  1. since the real photos in the training set are all landscape photos, if you want to stylize the photos with people as the main body, you may as well add at least 3000 photos of people in the training set and retrain to obtain a new model.
  2. In order to obtain a better face animation effect, when using 2 images as data pairs for training, it is suggested that the faces in the photos and the faces in the anime style data should be consistent in terms of gender as much as possible.
  3. The generated stylized images will be affected by the overall brightness and tone of the style data, so try not to select the anime images of night as the style data, and it is necessary to make an exposure compensation for the overall style data to promote the consistency of brightness and darkness of the entire style data.

AnimeGAN v2

Official Github (Tensorflow implementation): https://github.com/TachibanaYoshino/AnimeGANv2

Github (PyTorch implementation): https://github.com/bryandlee/animegan2-pytorch

Key feature compare to AnimeGAN:

  • AnimeGANv2 added the total variation loss in the generator loss.
    • Solve the problem of high-frequency artifacts in the generated image.
    • easy to train and directly achieve the effects in the paper.
    • Further reduce the number of parameters of the generator network. (generator size: 8.17 Mb), The lite version has a smaller generator model.
    • Use new high-quality style data, which come from BD movies as much as possible.