Combining CNNs, GANs, and Transformers to Outperform Image-GPT

Written by whatsai | Published 2021/02/01
Tech Story Tags: artificial-intelligence | gans | transformers | deep-learning | machine-learning | hackernoon-top-story | youtube-transcripts | youtubers | web-monetization

TLDR

via the TL;DR App

German researchers combined the efficiency of GANs and convolutional approaches with the expressivity of transformers to produce a powerful and time-efficient method for semantically-guided high-quality image synthesis.

If the title and subtitle sound like another language to you, this video was made for you!

Chapters:

0:37 Image-GPT
1:44 Transformers and image generation?
2:38 GANs + Transformers for image synthesis- the paper.
5:39 Available pre-trained model and demo
6:16 Conclusion

References:

Taming Transformers for High-Resolution Image Synthesis, Esser et al., 2020

Project link with paper and results: https://compvis.github.io/taming-transformers/
Code: https://github.com/CompVis/taming-transformers
Colab demo to start sampling right away: https://colab.research.google.com/github/CompVis/taming-transformers/blob/master/scripts/taming-transformers.ipynb

Follow me for more AI content:

Instagram: https://www.instagram.com/whats_ai/
LinkedIn: https://www.linkedin.com/in/whats-ai/
Twitter: https://twitter.com/Whats_AI
Facebook: https://www.facebook.com/whats.artificial.intelligence/
Medium: https://medium.com/@whats_ai

Join Our Discord channel, Learn AI Together:
https://discord.gg/learnaitogether

The best courses in AI:

Become a member of the YouTube community:
https://www.youtube.com/channel/UCUzGQrN-lyyc0BWTYoJM_Sg/join

Video Transcript

00:00

tldr they combine the efficiency of guns

00:03

and convolutional approaches

00:04

with the expressivity of transformers to

00:07

produce a powerful

00:08

and time efficient method for

00:10

semantically guided high quality

00:12

image synthesis if what i said sounds

00:14

like another language to you this video

00:16

was made for you

00:24

this is what's ai and i share artificial

00:26

intelligence news every week

00:28

if you are new to the channel and want

00:30

to stay up to date please consider

00:31

subscribing to not miss any further news

00:34

you've probably heard of igpt or image

00:37

gpt

00:37

recently published by openai that i

00:40

covered on my channel

00:41

it's the state-of-the-art generative

00:44

transformer model

00:45

openai use the transformer architecture

00:47

on a pixel representation of

00:49

images to perform image synthesis

00:52

in short they use transformers with half

00:55

the pixels of an image as

00:57

inputs to generate the other half of the

00:59

image

01:00

as you can see here it is extremely

01:02

powerful however

01:03

as you know there are 4k high resolution

01:06

images and videos

01:07

and do you know how many pixels there

01:09

are in one 4k

01:11

image it counts in millions and even

01:14

tens of millions which is a pretty long

01:17

sequence when compared with a single

01:19

phrase or paragraph

01:20

for natural language processing

01:22

applications

01:23

because transformers are designed to

01:25

learn long-range interactions

01:27

on sequential data which in this case

01:29

will be to use all the pixels

01:31

sequentially their approach is

01:33

excessively demanding in computation

01:35

and doesn't scale beyond 192 per 192

01:39

image resolutions so transformers cannot

01:42

be used with images

01:44

since no one wants to generate a super

01:46

low definition image

01:47

right well not really researchers from

01:51

the heidelberg university in germany

01:53

recently published a new paper combining

01:56

the efficiency of convolutional

01:58

approaches with the expressivity of

01:59

transformers to produce a semantically

02:02

guided

02:02

synthesis of high quality images meaning

02:06

that they used a convolutional neural

02:08

network to obtain context-rich

02:09

representations of

02:11

images to then use this representation

02:13

instead of the actual image to train a

02:15

transformer model to synthesize an

02:17

actual image from it

02:19

allowing much higher resolution than

02:21

igpt while conserving the quality of the

02:23

resulted image

02:25

but we will come back to that in a

02:27

minute with a better explanation

02:29

if you are not familiar with cnns or

02:31

transformers i will strongly recommend

02:33

you to watch the videos i made

02:34

explaining them to have a better

02:36

understanding of this approach

02:38

this paper is called taming transformers

02:41

for high resolution image synthesis

02:44

and as i said it enables transformers to

02:47

synthesize high resolution images from

02:49

semantic images

02:50

just like you can see here where the

02:52

only information needed is an

02:54

approximative semantic segmentation

02:56

showing what kind of environment you

02:59

will like at which position in the image

03:02

and it will output a complete high

03:03

definition image

03:05

filling the segmentations with real

03:07

mountains grass

03:08

sky sunsets and etc now the question

03:11

is why are these researchers and openai

03:14

using a transformer

03:15

instead of our typical gan architectures

03:18

for image synthesis

03:19

well the advantages of using

03:22

transformers for image generation

03:24

is clear they continue to show

03:26

state-of-the-art results on a wide

03:28

variety of tasks

03:29

and are extremely promising then they

03:31

contain

03:32

no inductive bias found in cnns where

03:35

the use of two-dimensional images and

03:37

filters

03:38

causes a prioritization of local

03:40

interactions

03:41

this inductive bias is what makes cnns

03:44

so efficient

03:45

but it may be too restrictive to make

03:47

the network expressive or

03:49

original now that we know that

03:51

transformers are more expressive and

03:53

very powerful

03:54

the only thing left is to find a way to

03:56

make it more efficient

03:58

indeed in their approach they achieved

04:00

to use both this high effectiveness

04:02

caused by inductive bias coming from

04:04

cnns as well as the expressivity of

04:07

transformers

04:08

as i said the convolutional neural

04:10

network architecture

04:11

composed of a classic encoder decoder

04:13

and an adversarial training processor

04:15

using a discriminator which they called

04:18

vqgan

04:19

is used to generate an efficient and

04:21

rich representation of the images

04:23

in the form of a code book as the name

04:26

suggests

04:27

it's a gan architecture that is used to

04:29

train a generator to generate a high

04:31

resolution image

04:33

if you are not familiar with how guns

04:35

work you can watch the video i made

04:37

explaining them

04:38

once this first training is done they

04:40

take only the decoder that is then used

04:42

to represent the encoded information

04:45

of the input image as input for a

04:47

transformer

04:48

here referred to as a code book such

04:51

that rather than directly using the

04:53

pixels of the image

04:54

the transformer uses this code book

04:57

containing a representation of the image

04:59

in the form of a composition of

05:01

perceptually rich image constituents

05:03

of course this code book is composed of

05:06

extremely compressed data

05:07

made so it can be read semantically by

05:09

the transformer

05:11

then using this representation as a

05:13

training data set for the transformer

05:15

it learns to predict the distribution of

05:17

possible next indices

05:19

inside this representation just like a

05:21

regular autoregressive model

05:23

meaning that it automatically builds a

05:25

regression equation

05:27

which uses previous time steps as inputs

05:29

to predict the values of future time

05:31

steps

05:32

therefore combining cnns and gans with

05:35

transformers to perform high resolution

05:37

image synthesis here you can see an

05:40

example using the demo version of their

05:42

code

05:42

that we can try right now on google

05:44

caleb without having to set up anything

05:47

they already made the setup for us and

05:49

you just have to run these few lines it

05:52

downloads their code from github

05:54

and installs the required dependencies

05:56

automatically

05:57

then it loads the model and imports a

05:59

pre-trained version of it

06:01

finally you can use their segmented

06:03

image as a test or upload your own

06:05

segmented

06:06

image run a few more lines to encode the

06:08

segmentation

06:09

reminding you here that it's a necessary

06:12

step for the transformer to create a

06:14

specific codebook associated with your

06:16

image

06:17

of course this was just an overview of

06:19

this new paper i strongly recommend

06:21

reading it for a better technical

06:22

understanding

06:23

also as i mentioned earlier their code

06:26

is available on github with pre-trained

06:28

models

06:28

so you can try it yourself and even

06:30

improve it all the links are in the

06:32

description below please leave a like if

06:35

you went this far in the video

06:36

and since there are over 80 percent of

06:38

you guys that are not subscribed yet

06:40

please consider subscribing to the

06:42

channel to not miss any further news

06:44

thank you for watching

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.

Published by HackerNoon on 2021/02/01