Meta AI's Make-A-Scene Generates Artwork with Text and Sketches

Written by whatsai | Published 2022/07/20
Tech Story Tags: ai | meta | artificial-intelligence | machine-learning | ml | computer-vision | make-a-scene | hackernoon-top-story | web-monetization

TLDR

The goal of this new model isn’t to allow users to generate random images following text prompt as dalle does. Instead, Meta wanted to push creative expression forward merging this text to.image trend with previous sketch-to-image models, leading to “Make-A-Scene”: a fantastic blend between text and sketch-conditioned image generation. Learn more in the video... ‘Make-a-Scene is not ‘just another Dalle’ – but restricts the user control on the generations.via the TL;DR App

Make-A-Scene is not “just another Dalle”. The goal of this new model isn’t to allow users to generate random images following text prompt as dalle does — which is really cool — but restricts the user control on the generations.

Instead, Meta wanted to push creative expression forward, merging this text-to-image trend with previous sketch-to-image models, leading to “Make-A-Scene”: a fantastic blend between text and sketch-conditioned image generation. Learn more in the video...

References

►Read the full article: https://www.louisbouchard.ai/make-a-scene/
►Meta's blog post: https://ai.facebook.com/blog/greater-creative-control-for-ai-image-generation
►Paper: Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D. and
Taigman, Y., 2022. Make-a-scene: Scene-based text-to-image generation
with human priors.
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

Video Transcript

0:00

[Music]

0:06

this is make a scene it's not just

0:08

another deli the goal of this new model

0:11

isn't to allow users to generate random

0:13

images following text prompt as dali

0:15

does which is really cool but restricts

0:17

the user control on the generations

0:20

instead meta wanted to push creative

0:22

expression forward merging this text to

0:25

image trend with previous sketch to

0:27

image models leading to make a scene a

0:30

fantastic blend between text and sketch

0:32

conditioned image generation this simply

0:35

means that using this new approach you

0:37

can quickly sketch out a cat and write

0:40

what kind of image you would like and

0:42

the image generation process will follow

0:43

both the sketch and the guidance of your

0:45

text it gets us even closer to being

0:48

able to generate the perfect

0:49

illustration we want in a few seconds

0:52

you can see this multimodal generative

0:54

ai method as a daily model with a bit

0:57

more control over the generations since

0:59

it can also take in a quick sketch as

1:01

input this is why we call it multimodal

1:04

since it can take multiple modalities as

1:07

inputs like text and an image a sketch

1:10

in this case compared to delhi which

1:12

only takes text to generate an image

1:14

multi-modal models are something super

1:17

promising especially if we match the

1:19

quality of the results we see online

1:21

since we have more control over the

1:23

results getting closer to a very

1:25

interesting end goal of generating the

1:27

perfect image we have in mind without

1:30

any design skills of course this is

1:32

still in the research state and is an

1:34

exploratory ai research concept it

1:37

doesn't mean what we see isn't

1:38

achievable it just means it will take a

1:41

bit more time to get to the public the

1:43

progress is extremely fast in the field

1:45

and i wouldn't be surprised to see it

1:47

live very shortly or a similar model

1:49

from other people to play with i believe

1:52

such sketch and text-based models are

1:54

even more interesting especially for the

1:56

industry which is why i wanted to cover

1:58

it on my channel even though the results

2:00

are a bit behind those of daily 2 we see

2:03

online and it's not only interesting for

2:05

the industry but for artists too some

2:08

use the sketch feature to generate even

2:10

more unexpected results than what delhi

2:13

could do we can ask it to generate

2:14

something and draw a form that doesn't

2:17

represent the specific thing like

2:18

drawing a jellyfish in a flower shape

2:21

which may not be impossible to have with

2:23

dali but much more complicated without

2:25

sketch guidance as the model will only

2:27

reproduce what it learns from which

2:29

comes from real world images and

2:32

illustrations so the main question is

2:34

how can they guide the generations with

2:36

both text input like delhi and a sketch

2:39

simultaneously and have the model follow

2:41

both guidelines well it's very very

2:44

similar to how delhi works so i won't

2:47

enter too much into the details of a

2:49

generative model as i covered at least

2:51

five different approaches in the past

2:53

two months which you should definitely

2:55

watch if you haven't yet as these models

2:57

like dali 2 or imogen are quite

2:59

fantastic

3:00

typically these models will take

3:02

millions of training examples to learn

3:04

how to generate images from text with

3:07

data in the form of images and their

3:09

captions scraped from the internet here

3:12

during training instead of only relying

3:14

on the caption generating the first

3:17

version of the image and comparing it to

3:19

the actual image and repeating this

3:21

process numerous times with all our

3:23

images we will also feed it a sketch

3:26

what's cool is that the sketches are

3:28

quite easy to produce for training

3:30

simply take a pre-trained network you

3:32

can download online and perform instance

3:35

segmentation for those who wants the

3:37

details they use a free pre-trained vgg

3:40

model on imagenet so a quite small

3:42

network compared to those today super

3:44

accurate and fast producing results like

3:47

this called a segmentation map they

3:49

simply process all their images once and

3:52

get these maps for training the model

3:55

then use this map as well as the caption

3:58

to orient the model to generate the

4:00

initial image at inference time or when

4:02

one of us will use it our sketch will

4:05

replace those maps as i said they used a

4:08

model called vgg to create fake sketches

4:11

for training they use a transformer

4:13

architecture for the image generation

4:15

process which is different from dolly to

4:17

and i invite you to watch the video i

4:19

made introducing transformers for vision

4:21

applications if you'd like more details

4:23

on how it can process and generate

4:25

images this sketch guided transformer is

4:28

the main difference with magazine along

4:30

with not using an image text ranker like

4:33

clip to measure text and image pairs

4:36

which you can also learn about in my

4:37

daily video

4:39

instead all the encoded text and

4:41

segmentation maps are sent to the

4:43

transformer model the model then

4:45

generates the relevant image tokens

4:48

encoded and decoded by the corresponding

4:50

networks mainly to produce the image the

4:53

encoder is used during training to

4:55

calculate the difference between the

4:57

produced and initial image but only the

4:59

decoder is needed to take this

5:01

transformer output and transform it into

5:04

an image

5:05

and voila this is how meta's new model

5:08

is able to take a sketch and text inputs

5:11

and generate a high definition image out

5:13

of it allowing more control over the

5:16

results with great quality

5:18

and as they say it's just the beginning

5:20

of this new kind of ai model the

5:22

approaches will just keep improving both

5:24

in terms of quality and availability for

5:27

the public which is super exciting many

5:30

artists are already using the model for

5:32

their own work as described in meta's

5:34

blog post and i'm excited about when we

5:37

will be able to use it too their

5:39

approach doesn't require any coding

5:41

knowledge only a good sketching hand and

5:43

some prompt engineering which means

5:45

trial and error with the text inputs

5:48

tweaking the formulations and words used

5:50

to produce different and better results

5:53

of course this was just an overview of

5:55

the new make a scene approach and i

5:57

invite you to read the full paper linked

5:59

below for a complete overview of how it

6:02

works i hope you've enjoyed this video

6:04

and i will see you next week with

6:06

another amazing paper

6:09

[Music]

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.

Published by HackerNoon on 2022/07/20