Infinite Nature: Fly Into a 2D Image and Explore it as a Drone

Written by whatsai | Published 2021/05/03
Tech Story Tags: view-synthesis | image-synthesis | ai | machine-learning | artificial-intelligence | ml | computer-vision | hackernoon-top-story | web-monetization

TLDR The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape. This task is extremely complex and will improve over time. The goal of this task is to generate an entire bird-view video in the wanted direction from a single picture. This is done using a state-of-the-art network called MiDaS, which I will not enter into. It can generate what it would look like to imagine in just a couple of papers down the line how useful this technology can be for video games or flight simulators.via the TL;DR App

The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape!

Watch the video

References

Read the full article: https://www.louisbouchard.me/infinite-nature/
Paper: Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N. and Kanazawa, A., 2020. Infinite Nature: Perpetual View Generation of Natural

Scenes from a Single Image, https://arxiv.org/pdf/2012.09855.pdf

Project link: https://infinite-nature.github.io/

Code: https://github.com/google-research/google-research/tree/master/infinite_nature

Colab demo: https://colab.research.google.com/github/google-research/google-research/blob/master/infinite_nature/infinite_nature_demo.ipynb#scrollTo=sCuRX1liUEVM

Video Transcript

00:00
This week's paper is about a new task called "Perpetual View Generation," where the goal
00:05
is to take an image to fly into it and explore the landscape.
00:09
This is the first solution for this problem, but it is extremely impressive considering
00:13
we only feed one image into the network, and it can generate what it would look like to
00:18
fly into it like a bird.
00:20
Of course, this task is extremely complex and will improve over time.
00:24
As two-minute papers would say, imagine in just a couple of papers down the line how
00:28
useful this technology can be for video games or flight simulators!
00:32
I'm amazed to see how well it already works, even if this is the paper introducing this
00:37
new task.
00:38
Especially considering how complex this task is.
00:41
And not only because it has to generate new viewpoints like GANverse3D is doing, which
00:46
I covered in a previous video, but it also has to generate a new image at
00:50
each frame, and once you pass a couple of dozen frames, you will have close to nothing
00:54
left from the original image to use.
00:57
And yes, this can be done over hundreds of frames while still looking a lot better than
01:02
current view synthesis approaches.
01:04
Let's see how they can generate an entire bird-view video in the wanted direction from
01:09
a single picture and how you can try it yourself right now without having to set up anything!
01:15
To do that, they have to use the geometry of the image, so they first need to produce
01:19
a disparity map of the image.
01:22
This is done using a state-of-the-art network called MiDaS, which I will not enter into,
01:27
but this is the output it gives.
01:29
This disparity map is basically an inverse depth map, informing the network of the depths
01:34
inside the scene.
01:35
Then, we enter the real first step of their technique, which is the renderer.
01:39
The goal of this renderer is to generate a new view based on the old view.
01:44
This new view will be the next frame, and as you understood, the old view is the input
01:49
image.
01:50
This is done using a differentiable renderer.
01:53
Differentiable just because we can use backpropagation to train it, just like we traditionally do
01:58
with the conventional deep nets, you know.
02:01
This renderer takes the image and disparity map to produce a three-dimensional mesh representing
02:06
the scene.
02:07
Then, we simply use this 3D mesh to generate an image from a novel viewpoint, P1 in this
02:12
case.
02:13
This gives us this amazing new picture that looks just a bit zoomed, but it is not simply
02:18
zoomed in.
02:19
There are some pink marks on the rendered image and black marks on the disparity map,
02:24
as you can see.
02:25
They correspond to the occluded regions and regions outside the field of view in the previous
02:29
image used as input to the renderer since this renderer only generates a new view but
02:35
is unable to invent unseen details.
02:38
This leads us to quite a problem, how can we have a complete and realistic image if
02:43
we do not know what goes there?
02:46
Well, we can use another network that will also take this new disparity map and image
02:50
as input to 'refine' it.
02:53
This other network called SPADE is also a state-of-the network, but for conditional
02:58
image synthesis.
02:59
Here, it is a conditional image synthesis network because we need to tell our network
03:04
some conditions, which in this case are the pink and black missing parts.
03:08
We basically send this faulty image to the second network to fill in holes and add the
03:13
necessary details.
03:14
You can see this SPADE network as a GAN architecture where the image is first encoded into a latent
03:19
code that will give us the style of the image.
03:23
Then, this code is decoded to generate a new version of the initial image, simply filling
03:28
the missing parts with new information following the same style present in the encoded information.
03:34
And voilĂ !
03:36
You have your new frame and its reverse depth map.
03:39
You can now simply repeat the process over and over to get all future frames, which now
03:43
looks like this.
03:45
Using this output as input in the next iteration, you can produce an infinity of iterations,
03:50
always following the wanted viewpoint and the precedent frame context!
03:53
photo: fig 2 -> repeat [step 3...] -> video examples
03:54
As you know, such powerful algorithms frequently need data and annotation to be trained on,
03:59
and this one isn't the exception.
04:01
To do so, they needed aerial footage of nature taken from drones, which they took from youtube,
04:07
manually curated, and pre-processed them to create their own dataset.
04:11
Fortunately for other researchers wanting to attack this challenge, you don't have to
04:15
do the same thing since they released this dataset of aerial footage of natural coastal
04:20
scenes used to train their algorithm.
04:21
It is available for download on their project page, which is linked in the description below.
04:26
As I mentioned, you can even try it yourself as they made the code publicly available,
04:31
but they also created a demo you can try right now on google colab.
04:35
The link is in the description below.
04:37
You just have to run the first few cells like this, which will install the code and dependencies,
04:42
load their model, and there you go.
04:44
You can now free-fly around the images they have and even upload your own!
04:48
Of course, all the steps I just mentioned were already there.
04:51
Simply run the code and enjoy!
04:53
You can find the article covering this paper on my newly created website, as well as our
04:58
discord community, my guide to learning machine learning, and more exciting stuff I will be
05:03
sharing on there.
05:04
Feel free to become a free member and get notified of new articles I share!
05:09
Congratulations to the winners of the NVIDIA GTC giveaway, all appearing on the screen
05:13
right now.
05:14
You should have received an email from me with the DLI code!
05:18
Thank you for watching.



Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.
Published by HackerNoon on 2021/05/03