RePaint

RePaint: Inpainting using Denoising Diffusion Probabilistic Models (CVPR 2022)

Code: https://github.com/andreas128/RePaint

Most existing Inpainting approaches train for a certain distribution of masks, which limits their generalization capabilities to unseen mask types. Furthermore, training with pixel-wise and perceptual losses often leads to simple textural extensions towards the missing areas instead of semantically meaningful generation. In this work, we propose RePaint: A Denoising Diffusion Probabilistic Model (DDPM) based inpainting approach that is applicable to even extreme masks. We employ a pretrained unconditional DDPM as the generative prior.

  • Pretrained unconditional DDPM is used
  • Only alter the reverse diffusion iterations by
    • samping the unmasked regions using the given image information
  • Modified GLIDE code

Advantages:

  • Allows our network to generalise to any mask during inference.
  • Enables our network to learn more semantic generation capabilities since it has a powerful DDPM image synthesis prior.
  • Work for extreme mask cases

Method

img

Conditioning on the known region

We put mask on and set it as unknown regions. We denote

  • Known regions: (1m)x(1-m) \odot x

  • Unknown regions: mxm \odot x

  • we can alter the known regions (1m)x(1-m) \odot x since every reverse step from xtx_t to xt1x_{t_1} depends solely on xtx_t, as long as we keep the correct properties of the corresponding distribution.

In the inference stage to a trained unconditional DDPM, the each single reverse step is modified such that the masked xt1qx_{t-1}\sim q (denoted as mxt1knownm \odot x^{\text{known}}_{t-1}) and the predicted unknown part of xt1pθx_{t-1}\sim p_\theta (denoted as (1m)xt1unknown(1-m)\odot x^{\text{unknown}}_{t-1}) are added together.

xt1knownN(αˉtx0,(1αˉt)I)x^{\text{known}}_{t-1} \sim \mathcal{N}(\sqrt{\bar{\alpha}}_tx_0, (1-\bar{\alpha}_t)I)

xt1unknownN(μθ(xt,t),θ(xt,t))x^{\text{unknown}}_{t-1} \sim \mathcal{N}(\mu_\theta(x_t, t), \sum_\theta(x_t, t))

xt1=mxt1known+(1m)xt1unknownx_{t-1} = m \odot x^{\text{known}}_{t-1} + (1-m)\odot x^{\text{unknown}}_{t-1}

Resampling

When directly applying the method described in above, we observe that only the content type matches with the known regions.

  • Although the inpainted region matches the texture of the neighboring region, it is semantically incorrect
  • DDPM is leveraging on the context of the known region, but not harmonizing it well with the rest of the image

The model predicts xt1x_{t−1} using xtx_t, which comprises the output of the DDPM and the sample from the known region. However, the sampling of the known pixels using forward process is performed without considering the generated parts of the image, which introduces disharmony.

Although the model tries to harmonize the image again in every step, it can never fully converge because the same issue occurs in the next step. Moreover, in each reverse step, the maximum change to an image declines due to the variance schedule of βt. Thus, the method cannot correct mistakes that lead to disharmonious boundaries in the subsequent steps due to restricted flexibility.

We want our model to consider the generated parts of the image as well. Therefore, the author introduced a resampling approach. The approach make used of the property of DDPM to harmonize the input of the model.

  • The model needs more time to harmonize the conditional information xt1knownx^{\text{known}}_{t-1} with the generated information xt1unknownx^{\text{unknown}}_{t-1} in one step before advancing to the next denoising step.
  • Originially in denoising steps: xtxt1x_t \rightarrow x_{t-1}
  • we diffuse the output xt1x_{t-1} back to xtx_t by sampling from the forward process. Although this operation scales back the output and adds noise, some information merged in the generated region xt1unknownx^{\text{unknown}}_{t-1} is still preserved in xtunknownx^{\text{unknown}}_{t} .
    • Therefore, the new xtunknownx^{\text{unknown}}_{t} is more harmonized with xtknownx^{\text{known}}_{t} and contains conditional information from it.

img

But new problem raised. The resampling operation can only harmonize one step, it might not be able to merge the semantic information over the entire denoising process.

  • we denote the time horizon of this operation as jump length jj
    • For example, for a chosen jump length of j=10j = 10 , we apply 10 forward transitions before applying 10 reverse transitions.
      • for jump length j=1j = 1, the DDPM is more likely to output a blurry image.
    • the resampling also increases the runtime of the reverse diffusion.
    • smaller jump lengths jj tend to produce blurrier image
    • increased number of resamplings rr improves the overall image consistency (performance)

The author found there is no visible improvement at slowing down the diffusion process.

img

Therefore the Resampling method is applied into the schedule, which can be illustrated as code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
t_T = 250
jump_len = 10 #j
jump_n_sample = 10 #r

def get_schedule(t_T = t_T, jump_len = jump_len, jump_n_sample = jump_n_sample):
jumps = {}
for j in range(0, t_T - jump_len, jump_len):
jumps[j] = jump_n_sample - 1

t = t_T
ts = []

while t >= 1:
t = t-1
ts.append(t)

#=================================== Schedule for ReSampling
if jumps.get(t, 0) > 0:
jumps[t] = jumps[t] - 1

for _ in range(jump_len):
t = t + 1
ts.append(t)
#===================================
ts.append(-1)
_check_times(ts, -1, t_T)
return ts

def _check_times(times, t_0, t_T):
# Check end
assert times[0] > times[1], (times[0], times[1])
# Check beginning
assert times[-1] == -1, times[-1]
# Steplength = 1
for t_last, t_cur in zip(times[:-1], times[1:]):
assert abs(t_last - t_cur) == 1, (t_last, t_cur)
# Value range
for t in times:
assert t >= t_0, (t, t_0)
assert t <= t_T, (t, t_T)

img

1
2
3
4
5
6
7
8
9
10
11
12
# Get the time schedule with the parameters time T, jump length, and number of resampling
times = get_schedule(t_T = 250, jump_len = 10, jump_n_sample = 10)

x = random_noise()

for t_last, t_cur in zip(times[:-1], times[1:]):
if t_cur < t_last: # Denoising Steps from t to t-1, nothing special
# Reverse Diffusion (The Inpaint version)
x = reverse_diffusion(x, t, x_known)
else: # Harmonize Steps from t-1 back to t <======== ReSampling (New)
# Apply Forward Diffusion
x = forward_diffusion(x, t)

img

  • there is no visible improvement at slowing down the diffusion process (Increasing T).
  • Better performance at applying the larger jump j = 10 length than smaller step length steps

Possible Applications

Face anonymization

RePaint could be used for the anonymization of faces.

  • For example, one could remove the information about the identity of people shown at public events and hallucinate
    artificial faces for data protection.

Super-resolution | Image Upscaling

Limitations

  • the algorithm might be biased towards the dataset
    • since it relies on an unconditional pretrained DDPM
  • Difficult to apply it real-time
    • DDPM optimization process is significantly slower than GAN-based and Autoregressive-based methods