Nvidia filling the blanks: A Partial Convolutions Research Paper

So, it’s 2018 and Nvidia Researchers are at it again. This time, with a revolutionary image in-painting and essentially, hole filling and quality enhancing algorithm.

Comparison to other techniques

This research paper (arXiv:1804.07723v1 [cs.CV]) came out on 20 April 2018. It focuses it’s point on the fact that recent in-painting approaches which do no use deep learning, use image statistics of the remaining image to fill in the hole.

Patch Match , one of the state-of-the-art methods, iteratively searches for the best ﬁtting patches to ﬁll in the holes. While this approach generally produces smooth results, it is limited by the available image statistics and has no concept of visual semantics.

PatchMatch was able to smoothly ﬁll in the missing components of the painting using image patches from the surrounding shadow and wall, but a semantically-aware approach would make use of patches from the painting instead.

Deep neural networks learn semantic priors and meaningful hidden representations in an end-to-end fashion, which have been used for recent image in-painting eﬀorts. These networks employ convolutional ﬁlters on images, replacing the removed content with a ﬁxed value.

Other techniques like Iizuka et al. uses fast marching and Poisson image blending , while Yu et al. employ a following-up reﬁnement network to reﬁne their raw network predictions. Another limitation of many recent approaches is the focus on rectangular shaped holes, often assumed to be center in the image. We ﬁnd these limitations may lead to over-fitting to the rectangular holes, and ultimately limit the utility of these models in application.

In order to focus on the more practical irregular hole use case, we collect a large benchmark of images with irregular masks of varying sizes. In our analysis, we look at the eﬀects of not just the size of the hole, but also whether the holes are in contact with the image border.

What’s different in this model?

The researchers have proposed the following modifications to the standard U-Net like structures.

Use partial convolutions with an automatic mask update step for achieving state-of-the-art on image in-painting.
While previous works fail to achieve good in-painting results with skip links in a U-Net with typical convolutions, they demonstrate that substituting convolutional layers with partial convolutions and mask updates can achieve state-of-the-art in-painting results.
They have proposed a large irregular mask dataset.

The researches gave a rather lucid reason for the mask update . Allow me to mention it here as it is:

To properly handle irregular masks, we propose the use of a Partial Convolutional Layer, comprising a masked and re-normalized convolution operation followed by a mask-update step. The concept of a masked and re-normalized convolution is also referred to as segmentation-aware convolutions in for the image segmentation task, however they did not make modiﬁcations to the input mask. Our use of partial convolutions is such that given a binary mask our convolutional results depend only on the non-hole regions at every layer. Our main extension is the automatic mask update step, which removes any masking where the partial convolution was able to operate on an unmasked value. Given suﬃcient layers of successive updates, even the largest masked holes will eventually shrink away, leaving only valid responses in the feature map. The partial convolutional layer ultimately makes our model agnostic to placeholder hole values.

The Model Approach and Architecture

The proposed model uses stacked partial convolutional operations and mask updating steps to perform image in-painting. Let’s start with defining convolution and mask-update mechanism.

For brevity, we refer to our partial convolution operation and mask update function jointly as the Partial Convolutional Layer.

Let W be the convolution ﬁlter weights for the convolution ﬁlter and b its the corresponding bias. X are the feature values (pixels values) for the current convolution (sliding) window and M is the corresponding binary mask. The partial convolution at every location, similarly deﬁned in , is expressed as:

Partial Convolution mechanism

After each partial convolution operation, we then update our mask. Our unmasking rule is simple: if the convolution was able to condition its output on at least one valid input value, then we remove the mask for that location. This is expressed as:

Mask Update Scheme

and can easily be implemented in any deep learning framework as part of the forward pass. With suﬃcient successive applications of the partial convolution layer, any mask will eventually be all ones, if the input contained any valid pixels.

Network Design

The network design is largely based on UNet like architectures using just one minor tuning, which is replacing all convolutional layers with partial convolutional ones.

The network architecture

Elaborating about the network architecture, it is important to mention that PConv 1 to PConv 8 is the encoding network and the following ones having UpSampling skip links is the decoding architecture of the same.

The BatchNorm column indicates whether PConv is followed by a Batch Normalization layer. The Non-linearity column shows whether and what non-linearity layer is used (following the BatchNorm if BatchNorm is used).

Loss Functions

From the excerpts of the research paper:

Our loss functions target both per-pixel reconstruction accuracy as well as composition, i.e. how smoothly the predicted hole values transition into their surrounding context.

Given input image with hole I_in, initial binary mask M (0 for holes)the network prediction I_out, and the ground truth image I_gt, we ﬁrst deﬁne our per pixel losses L_hole = k(1−M)⊙(I_out −I_gt)k1 and L_valid = kM ⊙(I_out −I_gt)k1. These are the L1 losses on the network output for the hole and the non-hole pixels respectively.

Perpetual Loss has been calculated using :

where Ψn is the activation map of the nth selected layer.

Perpetual Loss

While the style losses has been taken into considerations and used as :

Style Losses

Our ﬁnal loss term is the total variation (TV) loss L_tv: which is the smoothing penalty on P, where P is the region of 1-pixel dilation of the hole region.

Smoothing Penalty

So, the Total loss (after coefficient hyper parameter tuning) comes out to be:

Total Loss

Testing and Results in Hole Filling

Comparisons among PConv

One can easily have a look at the researchers result (PConv) and see how it fares against other models using 256*256 pixel dimensions.

Important thing to note is that ImageNet and Places2 models train for 10 days, whereas CelebA-HQ trains in 3 days. All ﬁne-tuning is performed in one day.

So, you can see the amount of time it takes to train such models even after using NVIDIA V100 GPU (16GB) with a batch size of 6!

Model Benchmarks

I would like to acknowledge the time sensitivity while carrying out the precision results as a personal opinion. Considering that, this PConv model is really convincing as a perfect alternative in both L1 scores and IScores.

The researchers have claimed that:

Our method outperforms the other methods in most cases across diﬀerent time periods and hole-to-image area ratios.

Graphical Benchmarks

and looking at the graphical benchmarks, I would not really argue.

Other Uses

Image super resolution task

Resolution Enhance Task Results

Yes, this algorithm can be used to enhance image resolution too. Let the following image tell it’s own story. I will leave this one as mystery 😉

Mask Updates with one-one mappings

Limitations

I would like to cite the research paper to state that the model, in itself won’t be a victim of catastrophic performance degradation as holes increase in size, but it does fail for some sparsely structured images such as the bars on the doors and, like most methods, struggles on the largest of holes.

Note : I have cited the following resource for this article.

Until next time, happy learning!