Meta's Groundbreaking AI Film Maker: Make-A-Scene

Written by whatsai | Published 2022/10/01
Tech Story Tags: machine-learning | ai | artificial-intelligence | computer-vision | stable-diffusion | text-to-video | hackernoon-top-story | youtubers | web-monetization

TLDR

Meta AI’s new model make-a-video is out and in a single sentence: it generates videos from text. It's not only able to generate videos, but it's also the new state-of-the-art method, producing higher quality and more coherent videos than ever before. This is all information you must’ve seen already on a news website or just by reading the title of the article, but what you don’t know yet is what it is exactly and how it works.via the TL;DR App

Meta AI’s new model make-a-video is out and in a single sentence: it generates videos from text. It’s not only able to generate videos, but it’s also the new state-of-the-art method, producing higher quality and more coherent videos than ever before!

You can see this model as a stable diffusion model for videos. Surely the next step after being able to generate images. This is all information you must’ve seen already on a news website or just by reading the title of the article, but what you don’t know yet is what it is exactly and how it works.

Here's how...

References

►Read the full article: https://www.louisbouchard.ai/make-a-video/
► Meta's blog post: https://ai.facebook.com/blog/generative-ai-text-to-video/
►Singer et al. (Meta AI), 2022, "MAKE-A-VIDEO: TEXT-TO-VIDEO GENERATION WITHOUT TEXT-VIDEO DATA", https://makeavideo.studio/Make-A-Video.pdf
►Make-a-video (official page): https://makeavideo.studio/?fbclid=IwAR0tuL9Uc6kjZaMoJHCngAMUNp9bZbyhLmdOUveJ9leyyfL9awRy4seQGW4
► Pytorch implementation: https://github.com/lucidrains/make-a-video-pytorch
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

Video Transcript

0:00

methias new model make a video is out

0:03

and in a single sentence it generates

0:05

videos from text it's not unable to

0:07

generate videos but it's also the new

0:09

state-of-the-art method producing higher

0:11

quality and more coherent videos than

0:14

ever you can see this model as a stable

0:16

diffusion model for videos surely the

0:19

next step after being able to generate

0:21

images this is how information you must

0:23

have seen already on a News website or

0:26

just by reading the title of the video

0:28

but what you don't know yet is what is

0:30

it exactly and how it works make a video

0:33

is the most recent publication by met

0:35

III and it allows you to generate a

0:37

short video out of textual inputs just

0:40

like this so you are adding complexity

0:42

to the image generation test by not only

0:45

having to generate multiple frames of

0:47

the same subject and scene but it also

0:49

has to be coherent in time you cannot

0:51

simply generate 60 images using dally

0:53

and generate a video it will just look

0:56

bad and nothing realistic you need a

0:58

model that understands the world in a

1:00

better way and leverages this level of

1:02

understanding to generate a coherent

1:04

series of images that blend well

1:06

together you basically want to simulate

1:08

a world and then simulate recordings of

1:11

it but how can you do that typically you

1:14

will need tons of text video pairs to

1:16

train your model to generate such videos

1:18

from textual input but not in this case

1:21

since this kind of data is really

1:23

difficult to get and the training costs

1:25

are super expensive they approach this

1:27

problem differently another way is to

1:30

take the best text to image model and

1:32

adapt it to videos and that's what met I

1:35

did in a research paper they just

1:38

released in their case the text to image

1:40

model is an another model by meta called

1:43

magazine which I covered in a previous

1:45

video if you'd like to learn more about

1:47

it but how do you adapt such a model to

1:50

take time into consideration you add a

1:53

spatial temporal pipeline for your model

1:55

to be able to process videos this means

1:58

that the model will not only generate an

2:00

image but in this case 16 of them in low

2:03

resolution to create a short coherent

2:06

video in a similar manner as a text to

2:08

image model but adding a one-dimensional

2:11

convolution along with the regular

2:13

two-dimensional one the simple addition

2:15

allows them to keep the pre-trained

2:17

two-dimensional convolutions the same

2:19

and add a temporal Dimension that they

2:22

will train from scratch reusing most of

2:25

the code and models parameters from the

2:27

image model they started from we also

2:30

want to guide Our Generations with text

2:32

input which will be very similar to

2:34

image models using clip embeddings a

2:37

process I go in detail in my stable

2:39

diffusion video if you are not familiar

2:41

with their problem but they will also be

2:43

adding the temporal Dimension when

2:45

blending the text features with the

2:47

image features doing the same thing

2:49

keeping the attention module I described

2:52

in my make a scene video and adding a

2:55

one-dimensional attention module or

2:57

temporal considerations copy pasting the

3:00

image generator model and duplicating

3:02

the generation modules for one more

3:04

Dimension to have all our 16 initial

3:07

frames but what can you do with 16

3:10

frames well nothing really interesting

3:13

we need to make a high definition video

3:16

out of those frames the model will do

3:19

that by having access to previews and

3:21

future frames and iteratively

3:23

interpolating from them both in terms of

3:27

temporal and spatial Dimensions at the

3:30

same time so basically generating new

3:33

and larger frames in between those

3:35

initial 16 frames based on the frames

3:38

before and after them which will

3:40

fascinate making the movement coherent

3:43

and overall video ruined this is done

3:45

using a frame interpolation Network

3:47

which I also described in other videos

3:50

but will basically take the images we

3:52

have and fill in gaps generating in

3:54

between information it will do the same

3:57

thing for a spatial component enlarging

3:59

the image and filling the pixel gaps to

4:02

make it more high definition

4:04

so to summarize the fine tune a text to

4:07

image model for video generation this

4:09

means they take a powerful model already

4:12

trained and adapt and train it a little

4:14

bit more to get used to videos this

4:16

retraining will be done with unlabeled

4:19

videos just to teach the model to

4:21

understand videos and video frame

4:23

consistency which makes the data set

4:25

building process much simpler then we

4:27

use once again an image optimized model

4:30

to improve spatial resolution in our

4:32

last frame interpolation component to

4:35

add more frames to make the video fluid

4:38

of course the results aren't perfect yet

4:40

just like text to image models but we

4:43

know how fast the progress goes this was

4:45

just an overview of how met I

4:47

successfully tackled the text to video

4:49

task in this great paper all the links

4:52

are in the description below if you'd

4:53

like to learn more about their approach

4:55

at pytorch implementation is also

4:57

already being developed by the community

4:59

as well so stay tuned for that if you'd

5:02

like to implement it yourself thank you

5:04

for watching the whole video and I will

5:06

see you next time with another amazing

5:08

paper

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.

Published by HackerNoon on 2022/10/01