Efficient NeRFs for Real-Time Portrait Synthesis (RAD-NeRF)

Written by whatsai | Published 2022/12/05
Tech Story Tags: machine-learning | artificial-intelligence | synthetic-media | ai | computer-vision | youtubers | youtube-transcripts | hackernoon-top-story | web-monetization

TLDRWe’ve heard of deepfakes, we’ve heard of NeRFs, and we’ve seen these kinds of applications allowing you to recreate someone’s face and pretty much make him say whatever you want. What you might not know is how inefficient those methods are and how much computing and time they require. Plus, we only see the best results. Keep in mind that what we see online is the results associated with the faces we could find most examples of, so basically, internet personalities and the models producing those results are trained using lots of computing, meaning expensive resources like many graphics cards. Still, the results are really impressive and only getting better.via the TL;DR App

We’ve heard of deepfakes, we’ve heard of NeRFs, and we’ve seen these kinds of applications allowing you to recreate someone’s face and pretty much make him or her say whatever you want.
What you might not know is how inefficient those methods are and how much computing and time they require. Plus, we only see the best results. Keep in mind that what we see online is the results associated with the faces we could find most examples of, so basically, internet personalities and the models producing those results are trained using lots of computing, meaning expensive resources like many graphics cards. Still, the results are really impressive and only getting better.
Fortunately, some people like Jiaxian Tang and colleagues are working on making those methods more available and effective with a new model called RAD-NeRF.
From a single video, they can synthesize the person talking for pretty much any word or sentence in real time with better quality. You can animate a talking head following any audio track in real-time. This is both so cool and so scary at the same time...

Learn more in the video

References

►Tang, J., Wang, K., Zhou, H., Chen, X., He, D., Hu, T., Liu, J., Zeng, G. and Wang, J., 2022. Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition. arXiv preprint arXiv:2211.12368.
►Results/project page: ​​https://me.kiui.moe/radnerf/

Video Transcript

0:02
[Music]
0:07
we've heard of deep fakes we've heard of
0:09
Nerfs and we've seen these kinds of
0:11
applications allowing you to recreate
0:13
someone's face and pretty much make him
0:15
say whatever you want what you might not
0:17
know is how inefficient those methods
0:20
are and how much Computing and time they
0:22
require plus we only see the best
0:24
results keep in mind that what we see
0:26
online is the results associated with
0:29
the faces we could find most examples of
0:31
so basically internet personalities and
0:34
the models producing those results are
0:36
trained using lots of computing meaning
0:38
expensive resources like many graphics
0:41
cards still the results are really
0:43
impressive and only getting better
0:45
fortunately some people like Jackson
0:47
tang and colleagues are working on
0:49
making those methods more available and
0:52
effective with a new model called red
0:54
Nerf but let's hear that from their own
0:57
model hello thanks for watching the
0:59
supplementary video for our paper
1:00
real-time neural Radiance talking head
1:03
synthesis via decomposed audio spatial
1:05
encoding
1:06
our method is person-specific and only
1:08
needs a three to five minutes monocular
1:10
video for training
1:11
after training the model can synthesize
1:14
realistic Talking Heads driven by
1:15
arbitrary audio in real time while
1:17
keeping comparable or better rendering
1:19
quality compared to previous methods so
1:21
you heard that right from a single video
1:23
they can synthesize the person talking
1:26
for pretty much any word or sentence in
1:28
real time with better quality you can
1:30
animate a talking head following any
1:33
audio track in real time this is both so
1:36
cool and so scary at the same time just
1:39
imagine what could be done if we could
1:40
make you say anything at least they
1:43
still need access to a video of you
1:45
speaking in front of the camera for 5
1:47
minutes so it's hard to achieve that
1:48
without you knowing still as soon as you
1:51
appear online anyone will be able to use
1:53
such a model and create infinite videos
1:56
of you talking about anything they want
1:58
they can even host live streams with
2:00
this method which is even more dangerous
2:03
and makes it even harder to say wetsuit
2:05
or not anyways even though this is
2:08
interesting and I'd love to hear your
2:10
thoughts in the comments and keep the
2:11
discussion question going here I wanted
2:13
to cover something that is only positive
2:15
and exciting science more precisely how
2:19
did they achieve to animate Talking
2:20
Heads in real time from any audio using
2:23
only a video of the face as they State
2:26
their red Nerf model can run 500 times
2:29
faster than the previous works with
2:31
better rendering quality and more
2:33
control you may ask how is that possible
2:36
we usually trade quality for efficiency
2:39
yet they achieve to improve both
2:41
incredibly these immense improvements
2:43
are possible thanks to three main points
2:46
the first two are related to the
2:48
architecture of the model more
2:50
specifically how they adapted the Nerf
2:52
approach to make it more efficient and
2:54
with improved motions of the Torso and
2:57
head the first step is to make nerves
2:59
more efficient I won't dive into how
3:02
Nerfs work since we covered it numerous
3:04
time basically it's an approach based on
3:06
neural networks to reconstruct 3D
3:09
volumetric scenes from a bunch of 2D in
3:11
images which means regular images this
3:14
is why they will take a video as input
3:17
as it basically gives you a lot of
3:19
images of a person from many different
3:21
angles so it usually uses a network to
3:24
predict all pixels colors and densities
3:26
from the camera Viewpoint you are
3:28
visualizing and does that for all
3:31
viewpoints you want to show when
3:32
rotating around the subject which is
3:34
extremely computation hungry as you are
3:37
predicting multiple parameters for each
3:39
coordinate in the image every time and
3:41
you are learning to predict all of them
3:43
Plus in their case it isn't only a Nerf
3:46
producing or 3D scene it also has to
3:49
match an audio input and fit the lips
3:51
mouth eyes and movements with what the
3:53
person says instead of predicting all
3:56
pixels densities and colors matching the
3:58
audio for a specific frame they will
4:00
work with two separate new and condensed
4:03
spaces called grid spaces or grid-based
4:06
Nerf they will translate their
4:08
coordinates into a smaller 3D grid space
4:11
trans laid their audio into a smaller 2D
4:13
grid space and then send them to render
4:16
the head this means they never merge the
4:19
audio data with the spatial data which
4:22
will increase the size exponentially
4:23
adding two dimensional inputs to each
4:26
coordinate so reducing the size of the
4:29
audio features along with keeping the
4:31
audio and spatial features separate is
4:34
what makes the approach so much more
4:36
efficient but how can the results be
4:38
better if they use condensed spaces that
4:40
have less information adding a few
4:42
controllable features like an eye
4:44
blinking control to our grid Nerf the
4:47
model will learn more realistic
4:48
behaviors for the eyes compared to
4:51
previous approaches something really
4:53
important for realism the second
4:55
Improvement they've done is to model the
4:57
Torso with another Nerf using the same
5:00
approach instead of trying to model it
5:02
with the same Nerf used further head
5:04
which will require much fewer parameters
5:07
and different needs as the goal here is
5:09
to animate moving heads and not whole
5:12
bodies since the Torso is pretty much
5:14
static in these cases they use a much
5:16
simpler and more efficient Nerf based
5:18
module that only works in 2D working in
5:21
the image space directly instead of
5:24
using camera arrays as we usually do
5:26
with Nerf to generate many different
5:28
angles which aren't needed for a torso
5:30
so it is basically much more efficient
5:32
because they modified the approach for
5:35
this very specific use case of the rigid
5:37
torso and moving head videos they then
5:40
recompose the head with the Torso to
5:42
produce the final video and voila this
5:45
is how you produce talking head videos
5:47
over any audio input super efficiently
5:50
of course this was just an overview of
5:53
this new exciting research publication
5:55
and they do other modifications during
5:57
the training of their algorithm to make
5:59
it more efficient which is the third
6:01
point I mentioned at the beginning of
6:03
the video if you were wondering I invite
6:05
you to read their paper for more
6:07
information the link is in the
6:09
description below before you leave I
6:10
just wanted to thank the people who
6:12
recently supported this channel through
6:14
patreon this is not necessary and
6:16
strictly to support the work I do here
6:18
huge thanks to artem vladiken Leopoldo
6:22
Alta Murano J Cole Michael carichao
6:25
Daniel gimness and a few Anonymous
6:28
generous donors it will be greatly
6:30
appreciated if you also want and can
6:33
afford to support my work financially
6:35
the link to my patreon page is in the
6:37
description below as well but no worries
6:39
if not a sincere comment below this
6:42
video is all I need to be happier I hope
6:45
you've enjoyed this video and I will see
6:47
you next week with another amazing paper
6:51
[Music]


Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.
Published by HackerNoon on 2022/12/05