Infinite Nature: Fly Into a 2D Image and Explore it as a Drone

Written by whatsai | Published 2021/05/03
Tech Story Tags: view-synthesis | image-synthesis | ai | machine-learning | artificial-intelligence | ml | computer-vision | hackernoon-top-story | web-monetization

TLDR

The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape. This task is extremely complex and will improve over time. The goal of this task is to generate an entire bird-view video in the wanted direction from a single picture. This is done using a state-of-the-art network called MiDaS, which I will not enter into. It can generate what it would look like to imagine in just a couple of papers down the line how useful this technology can be for video games or flight simulators.via the TL;DR App

The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape!

Watch the video

References

Read the full article: https://www.louisbouchard.me/infinite-nature/
Paper: Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N. and Kanazawa, A., 2020. Infinite Nature: Perpetual View Generation of Natural

Scenes from a Single Image, https://arxiv.org/pdf/2012.09855.pdf

Project link: https://infinite-nature.github.io/

Code: https://github.com/google-research/google-research/tree/master/infinite_nature

Colab demo: https://colab.research.google.com/github/google-research/google-research/blob/master/infinite_nature/infinite_nature_demo.ipynb#scrollTo=sCuRX1liUEVM

Video Transcript

00:00

This week's paper is about a new task called "Perpetual View Generation," where the goal

00:05

is to take an image to fly into it and explore the landscape.

00:09

This is the first solution for this problem, but it is extremely impressive considering

00:13

we only feed one image into the network, and it can generate what it would look like to

00:18

fly into it like a bird.

00:20

Of course, this task is extremely complex and will improve over time.

00:24

As two-minute papers would say, imagine in just a couple of papers down the line how

00:28

useful this technology can be for video games or flight simulators!

00:32

I'm amazed to see how well it already works, even if this is the paper introducing this

00:37

new task.

00:38

Especially considering how complex this task is.

00:41

And not only because it has to generate new viewpoints like GANverse3D is doing, which

00:46

I covered in a previous video, but it also has to generate a new image at

00:50

each frame, and once you pass a couple of dozen frames, you will have close to nothing

00:54

left from the original image to use.

00:57

And yes, this can be done over hundreds of frames while still looking a lot better than

01:02

current view synthesis approaches.

01:04

Let's see how they can generate an entire bird-view video in the wanted direction from

01:09

a single picture and how you can try it yourself right now without having to set up anything!

01:15

To do that, they have to use the geometry of the image, so they first need to produce

01:19

a disparity map of the image.

01:22

This is done using a state-of-the-art network called MiDaS, which I will not enter into,

01:27

but this is the output it gives.

01:29

This disparity map is basically an inverse depth map, informing the network of the depths

01:34

inside the scene.

01:35

Then, we enter the real first step of their technique, which is the renderer.

01:39

The goal of this renderer is to generate a new view based on the old view.

01:44

This new view will be the next frame, and as you understood, the old view is the input

01:49

image.

01:50

This is done using a differentiable renderer.

01:53

Differentiable just because we can use backpropagation to train it, just like we traditionally do

01:58

with the conventional deep nets, you know.

02:01

This renderer takes the image and disparity map to produce a three-dimensional mesh representing

02:06

the scene.

02:07

Then, we simply use this 3D mesh to generate an image from a novel viewpoint, P1 in this

02:12

case.

02:13

This gives us this amazing new picture that looks just a bit zoomed, but it is not simply

02:18

zoomed in.

02:19

There are some pink marks on the rendered image and black marks on the disparity map,

02:24

as you can see.

02:25

They correspond to the occluded regions and regions outside the field of view in the previous

02:29

image used as input to the renderer since this renderer only generates a new view but

02:35

is unable to invent unseen details.

02:38

This leads us to quite a problem, how can we have a complete and realistic image if

02:43

we do not know what goes there?

02:46

Well, we can use another network that will also take this new disparity map and image

02:50

as input to 'refine' it.

02:53

This other network called SPADE is also a state-of-the network, but for conditional

02:58

image synthesis.

02:59

Here, it is a conditional image synthesis network because we need to tell our network

03:04

some conditions, which in this case are the pink and black missing parts.

03:08

We basically send this faulty image to the second network to fill in holes and add the

03:13

necessary details.

03:14

You can see this SPADE network as a GAN architecture where the image is first encoded into a latent

03:19

code that will give us the style of the image.

03:23

Then, this code is decoded to generate a new version of the initial image, simply filling

03:28

the missing parts with new information following the same style present in the encoded information.

03:34

And voilà!

03:36

You have your new frame and its reverse depth map.

03:39

You can now simply repeat the process over and over to get all future frames, which now

03:43

looks like this.

03:45

Using this output as input in the next iteration, you can produce an infinity of iterations,

03:50

always following the wanted viewpoint and the precedent frame context!

03:53

photo: fig 2 -> repeat [step 3...] -> video examples

03:54

As you know, such powerful algorithms frequently need data and annotation to be trained on,

03:59

and this one isn't the exception.

04:01

To do so, they needed aerial footage of nature taken from drones, which they took from youtube,

04:07

manually curated, and pre-processed them to create their own dataset.

04:11

Fortunately for other researchers wanting to attack this challenge, you don't have to

04:15

do the same thing since they released this dataset of aerial footage of natural coastal

04:20

scenes used to train their algorithm.

04:21

It is available for download on their project page, which is linked in the description below.

04:26

As I mentioned, you can even try it yourself as they made the code publicly available,

04:31

but they also created a demo you can try right now on google colab.

04:35

The link is in the description below.

04:37

You just have to run the first few cells like this, which will install the code and dependencies,

04:42

load their model, and there you go.

04:44

You can now free-fly around the images they have and even upload your own!

04:48

Of course, all the steps I just mentioned were already there.

04:51

Simply run the code and enjoy!

04:53

You can find the article covering this paper on my newly created website, as well as our

04:58

discord community, my guide to learning machine learning, and more exciting stuff I will be

05:03

sharing on there.

05:04

Feel free to become a free member and get notified of new articles I share!

05:09

Congratulations to the winners of the NVIDIA GTC giveaway, all appearing on the screen

05:13

right now.

05:14

You should have received an email from me with the DLI code!

05:18

Thank you for watching.

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.

Published by HackerNoon on 2021/05/03