3D Articulated Shape Reconstruction from Videos

Written by whatsai | Published 2021/05/20
Tech Story Tags: artificial-intelligence | ai | machine-learning | computer-vision | 3d-reconstruction | hackernoon-top-story | youtubers | youtube-transcripts | web-monetization

TLDR

3D Articulated Shape Reconstruction from Videos is a new method for generating 3D models of humans or animals moving from only a short video as input. Google Research, along with Carnegie Mellon University, just published a paper called LASR.Learn more about the project below.Watch the video with Louis Bouchard at the bottom of the page. Read the full article: https://www.louisbouchard.ai/3d-reconstruction-from-videos.via the TL;DR App

With LASR, you can generate 3D models of humans or animals moving from only a short video as input.

This task is called 3D reconstruction, and Google Research, along with Carnegie Mellon University, just published a paper called LASR: Learning Articulated Shape Reconstruction from a Monocular Video.

Learn more about the project below.

Watch the video

References

References:►Read the full article: https://www.louisbouchard.ai/3d-reconstruction-from-videos►Gengshan Yang et al., (2021), LASR: Learning Articulated Shape Reconstruction from a Monocular Video, CVPR, https://lasr-google.github.io/

Video Transcript

00:00

How hard is it for a machine to understand an image?

00:03

Researchers have made a lot of progress in image classification, image detection, and

00:08

image segmentation.

00:09

These three tasks iteratively deepen our understanding of what's going on in an image.

00:14

In the same order, classification tells us what's in the image.

00:18

Detection tells us where it is approximately, and segmentation precisely tells us where

00:23

it is.

00:24

Now, an even more complex step would be to represent this image in the real world.

00:29

In other words, it would be to represent an object taken from an image or video into a

00:34

3D surface, just like GANverse3D can do for inanimate objects, as I showed in a recent

00:41

video.

00:42

This demonstrates a deep understanding of the image or video by the model, representing

00:46

the complete shape of an object, which is why it is such a complex task.

00:51

Even more challenging is to do the same thing on nonrigid shapes.

00:55

Or rather, on humans and animals, objects that can be weirdly shaped and even deformed

01:00

to a certain extent.

01:02

This task of generating a 3D model based on a video or images is called 3D reconstruction,

01:08

and Google Research, along with Carnegie Mellon University just published a paper called LASR:

01:14

Learning Articulated Shape Reconstruction from a Monocular Video.

01:18

As the name says, this is a new method for generating 3D models of humans or animals

01:23

moving from only a short video as input.

01:26

Indeed, it actually understands that this is an odd shape, that it can move, but still

01:31

needs to stay attached as this is still one "object" and not just many objects together.

01:37

Typically, 3D modeling techniques needed data prior.

01:40

In this case, the data prior was an approximate shape of the complex objects, which looks

01:45

like this...

01:46

As you can see, it had to be quite similar to the actual human or animal, which is not

01:51

very intelligent.

01:52

With LASR, you can produce even better results.

01:55

With no prior at all, it starts with just a plain sphere whatever the object to reconstruct.

02:01

You can imagine what this means for generalizability and how powerful this can be when you don't

02:06

have to explicitly tell the network both what the object is and how it "typically" looks.

02:11

This is a significant step forward!

02:13

But how does it work?

02:15

As I said, it only needs a video, but there are still some pre-processing steps to do.

02:19

Don't worry.

02:20

These steps are quite well-understood in computer vision.

02:22

As you may recall, I mentioned image segmentation at the beginning of the video.

02:27

We need this segmentation of an object that can be done easily using a trained neural

02:30

network.

02:31

Then, we need the optical flow for each frame, which is the motion of objects between consecutive

02:37

frames of the video.

02:38

This is also easily found using computer vision techniques and improved with neural networks,

02:43

as I covered not even a year ago on my channel.

02:46

They start the rendering process with a sphere assuming it is a rigid object, so an object

02:51

that does not have articulations.

02:53

With this assumption, they optimize the shape and the camera viewpoint understanding of

02:57

their model iteratively for 20 epochs.

03:00

This rigid assumption is shown here with the number of bones equal to zero, meaning that

03:05

nothing can move separately.

03:07

Then, we get back to real life, where the human is not rigid.

03:11

Now, the goal is to have an accurate 3d model that can move realistically.

03:16

This is achieved by increasing the number of bones and vertices to make the model more

03:20

and more precise.

03:21

Here the vertices are 3-dimensional pixels where the lines and volumes of the rendered

03:26

object connect, and the bones are, well, they are basically bones.

03:30

These bones are all the parts of the objects that move during the video with either translations

03:35

or rotations.

03:36

Both the bones and vertices are incrementally augmented until we reach stage 3, where the

03:41

model has learned to generate a pretty accurate render of the current object.

03:45

Here, they also need a model to render this object, which is called a differentiable renderer.

03:50

I won't dive into how it works as I already covered it in previous videos, but basically,

03:55

it is a model able to create a 3-dimensional representation of an object.

03:59

It has the particularity to be differentiable.

04:03

Meaning that you can train this model in a similar way as a typical neural network with

04:07

back-propagation.

04:08

Here, everything is trained together, optimizing the results following the four stages we just

04:13

saw improving the rendered result at each stage.

04:17

The model then learns just like any other machine learning model using gradient descent

04:22

and updating the model's parameters based on the difference between the rendered output

04:27

and the ground-truth video measurements.

04:29

So it doesn't even need to see a ground-truth version of the rendered object.

04:34

It only needs the video, segmentation, and optical flow results to learn by transforming

04:39

back the rendered object into a segmented image and its optical flow and comparing it

04:44

to the input.

04:45

What is even better is that all this is done in a self-supervised learning process.

04:50

Meaning that you give the model the videos to train on with their corresponding segmentation

04:55

and optical flow results, and it iteratively learns to render the objects during training.

05:00

No annotations are needed at all!

05:03

And Voilà, you have your complex 3D renderer without any special training or ground truth

05:08

needed!

05:09

If gradient descent, epoch, parameters, or self-supervised learning are still unclear

05:14

concepts to you, I invite you to watch the series of short videos I made explaining the

05:18

basics of machine learning.

05:20

As always, the full article is available on my website louisbouchard.ai, with many other

05:25

great papers explained and information.

05:28

Thank you for watching.

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.

Published by HackerNoon on 2021/05/20