3D Articulated Shape Reconstruction from Videos

Written by whatsai | Published 2021/05/20
Tech Story Tags: artificial-intelligence | ai | machine-learning | computer-vision | 3d-reconstruction | hackernoon-top-story | youtubers | youtube-transcripts | web-monetization

TLDR 3D Articulated Shape Reconstruction from Videos is a new method for generating 3D models of humans or animals moving from only a short video as input. Google Research, along with Carnegie Mellon University, just published a paper called LASR.Learn more about the project below.Watch the video with Louis Bouchard at the bottom of the page. Read the full article: https://www.louisbouchard.ai/3d-reconstruction-from-videos.via the TL;DR App

With LASR, you can generate 3D models of humans or animals moving from only a short video as input.
This task is called 3D reconstruction, and Google Research, along with Carnegie Mellon University, just published a paper called LASR: Learning Articulated Shape Reconstruction from a Monocular Video.
Learn more about the project below.

Watch the video

References

References:►Read the full article: https://www.louisbouchard.ai/3d-reconstruction-from-videos►Gengshan Yang et al., (2021), LASR: Learning Articulated Shape Reconstruction from a Monocular Video, CVPR, https://lasr-google.github.io/

Video Transcript

00:00
How hard is it for a machine to understand an image?
00:03
Researchers have made a lot of progress in image classification, image detection, and
00:08
image segmentation.
00:09
These three tasks iteratively deepen our understanding of what's going on in an image.
00:14
In the same order, classification tells us what's in the image.
00:18
Detection tells us where it is approximately, and segmentation precisely tells us where
00:23
it is.
00:24
Now, an even more complex step would be to represent this image in the real world.
00:29
In other words, it would be to represent an object taken from an image or video into a
00:34
3D surface, just like GANverse3D can do for inanimate objects, as I showed in a recent
00:41
video.
00:42
This demonstrates a deep understanding of the image or video by the model, representing
00:46
the complete shape of an object, which is why it is such a complex task.
00:51
Even more challenging is to do the same thing on nonrigid shapes.
00:55
Or rather, on humans and animals, objects that can be weirdly shaped and even deformed
01:00
to a certain extent.
01:02
This task of generating a 3D model based on a video or images is called 3D reconstruction,
01:08
and Google Research, along with Carnegie Mellon University just published a paper called LASR:
01:14
Learning Articulated Shape Reconstruction from a Monocular Video.
01:18
As the name says, this is a new method for generating 3D models of humans or animals
01:23
moving from only a short video as input.
01:26
Indeed, it actually understands that this is an odd shape, that it can move, but still
01:31
needs to stay attached as this is still one "object" and not just many objects together.
01:37
Typically, 3D modeling techniques needed data prior.
01:40
In this case, the data prior was an approximate shape of the complex objects, which looks
01:45
like this...
01:46
As you can see, it had to be quite similar to the actual human or animal, which is not
01:51
very intelligent.
01:52
With LASR, you can produce even better results.
01:55
With no prior at all, it starts with just a plain sphere whatever the object to reconstruct.
02:01
You can imagine what this means for generalizability and how powerful this can be when you don't
02:06
have to explicitly tell the network both what the object is and how it "typically" looks.
02:11
This is a significant step forward!
02:13
But how does it work?
02:15
As I said, it only needs a video, but there are still some pre-processing steps to do.
02:19
Don't worry.
02:20
These steps are quite well-understood in computer vision.
02:22
As you may recall, I mentioned image segmentation at the beginning of the video.
02:27
We need this segmentation of an object that can be done easily using a trained neural
02:30
network.
02:31
Then, we need the optical flow for each frame, which is the motion of objects between consecutive
02:37
frames of the video.
02:38
This is also easily found using computer vision techniques and improved with neural networks,
02:43
as I covered not even a year ago on my channel.
02:46
They start the rendering process with a sphere assuming it is a rigid object, so an object
02:51
that does not have articulations.
02:53
With this assumption, they optimize the shape and the camera viewpoint understanding of
02:57
their model iteratively for 20 epochs.
03:00
This rigid assumption is shown here with the number of bones equal to zero, meaning that
03:05
nothing can move separately.
03:07
Then, we get back to real life, where the human is not rigid.
03:11
Now, the goal is to have an accurate 3d model that can move realistically.
03:16
This is achieved by increasing the number of bones and vertices to make the model more
03:20
and more precise.
03:21
Here the vertices are 3-dimensional pixels where the lines and volumes of the rendered
03:26
object connect, and the bones are, well, they are basically bones.
03:30
These bones are all the parts of the objects that move during the video with either translations
03:35
or rotations.
03:36
Both the bones and vertices are incrementally augmented until we reach stage 3, where the
03:41
model has learned to generate a pretty accurate render of the current object.
03:45
Here, they also need a model to render this object, which is called a differentiable renderer.
03:50
I won't dive into how it works as I already covered it in previous videos, but basically,
03:55
it is a model able to create a 3-dimensional representation of an object.
03:59
It has the particularity to be differentiable.
04:03
Meaning that you can train this model in a similar way as a typical neural network with
04:07
back-propagation.
04:08
Here, everything is trained together, optimizing the results following the four stages we just
04:13
saw improving the rendered result at each stage.
04:17
The model then learns just like any other machine learning model using gradient descent
04:22
and updating the model's parameters based on the difference between the rendered output
04:27
and the ground-truth video measurements.
04:29
So it doesn't even need to see a ground-truth version of the rendered object.
04:34
It only needs the video, segmentation, and optical flow results to learn by transforming
04:39
back the rendered object into a segmented image and its optical flow and comparing it
04:44
to the input.
04:45
What is even better is that all this is done in a self-supervised learning process.
04:50
Meaning that you give the model the videos to train on with their corresponding segmentation
04:55
and optical flow results, and it iteratively learns to render the objects during training.
05:00
No annotations are needed at all!
05:03
And Voilà, you have your complex 3D renderer without any special training or ground truth
05:08
needed!
05:09
If gradient descent, epoch, parameters, or self-supervised learning are still unclear
05:14
concepts to you, I invite you to watch the series of short videos I made explaining the
05:18
basics of machine learning.
05:20
As always, the full article is available on my website louisbouchard.ai, with many other
05:25
great papers explained and information.
05:28
Thank you for watching.



Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.
Published by HackerNoon on 2021/05/20