What are Latent Diffusion Models? The Architecture Behind Stable Diffusion

Written by whatsai | Published 2022/08/29
Tech Story Tags: ai | artificial-intelligence | diffusion | technology | innovation | tech | machine-learning | hackernoon-top-story | web-monetization

TLDR

What do all recent super powerful image models like DALLE, Imagen, or Midjourney have in common? Other than their high computing costs, huge training time, and shared hype, they are all based on the same mechanism: diffusion. Diffusion models recently achieved state-of-the-art results for most image tasks including text-to-image with DALLE but many other image generation-related tasks too, like image inpainting, style transfer or image super-resolution. But how do they work? Learn more in the video...via the TL;DR App

What do all recent super-powerful image models like DALLE, Imagen, or Midjourney have in common? Other than their high computing costs, huge training time, and shared hype, they are all based on the same mechanism: diffusion.

Diffusion models recently achieved state-of-the-art results for most image tasks including text-to-image with DALLE but many other image generation-related tasks too, like image inpainting, style transfer or image super-resolution. But how do they work? Learn more in the video...

References

►Read the full article: https://www.louisbouchard.ai/latent-diffusion-models/
►Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B., 2022.
High-resolution image synthesis with latent diffusion models. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (pp. 10684–10695), https://arxiv.org/pdf/2112.10752.pdf
►Latent Diffusion Code: https://github.com/CompVis/latent-diffusion
►Stable Diffusion Code (text-to-image based on LD): https://github.com/CompVis/stable-diffusion
►Try it yourself: https://huggingface.co/spaces/stabilityai/stable-diffusion
►Web application:
https://stabilityai.us.auth0.com/u/login?state=hKFo2SA4MFJLR1M4cVhJcllLVmlsSV9vcXNYYy11Q25rRkVzZaFur3VuaXZlcnNhbC1sb2dpbqN0aWTZIFRjV2p5dHkzNGQzdkFKZUdyUEprRnhGeFl6ZVdVUDRZo2NpZNkgS3ZZWkpLU2htVW9PalhwY2xRbEtZVXh1Y0FWZXNsSE4
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

Video Transcript

0:00

what do all recent super powerful image

0:02

models like delhi imagine or mid journey

0:05

have in common other than high computing

0:08

cost huge training time and shared hype

0:10

they are all based on the same mechanism

0:13

diffusion the fusion models recently

0:15

achieved state-of-the-art results for

0:17

most image tasks including text to image

0:19

with delhi but many other image

0:21

generation related tasks like image and

0:23

painting style transfer or image super

0:25

resolution though there are a few

0:27

downsides they work sequentially on the

0:30

whole image meaning that both the

0:31

training and inference times are super

0:34

expensive this is why you need hundreds

0:36

of gpus to train such a model and why

0:38

you wait a few minutes to get your

0:40

results it's no surprise that only the

0:42

biggest companies like google or openai

0:45

are releasing those models

0:47

but what are they i've covered diffusion

0:49

models in a couple of videos which i

0:51

invite you to check for a better

0:52

understanding they are iterative models

0:55

that take random noise as inputs which

0:57

can be conditioned with a text or an

0:59

image so it's not completely random it

1:02

iteratively learns to remove this noise

1:04

by learning what parameters the models

1:06

should apply to this noise to end up

1:08

with a final image so the basic

1:10

diffusion models will take a random

1:12

noise with the size of the image and

1:14

learn to apply even further noise until

1:17

we get back to a real image this is

1:19

possible because the model will have

1:21

access to the real images during

1:23

training and will be able to learn the

1:25

right parameters by applying such noise

1:27

to the image iteratively until it

1:29

reaches complete noise and is

1:31

unrecognizable

1:33

then when we are satisfied with the

1:35

noise we get from all our images meaning

1:37

that they are similar and generate noise

1:40

from a similar distribution we are ready

1:42

to use our model in reverse and feed it

1:45

similar noise in the reverse order to

1:48

expect an image similar to the ones used

1:50

during training so the main problem here

1:53

is that you are working directly with

1:54

the pixels and large data input like

1:57

images let's see how we can fix this

1:59

computation issue while keeping the

2:02

quality of the results the same as shown

2:04

here compared with delhi but first give

2:07

me a few seconds to introduce you to my

2:09

friends at quack sponsoring this video

2:11

as you most certainly know the majority

2:13

of businesses now report ai and ml

2:15

adoption in their processes but complex

2:18

operations such as modal deployment

2:20

training testing and feature store

2:22

management seem to stand in the way of

2:24

progress ml model deployment is one of

2:26

the most complex processes it is such a

2:29

rigorous process that data scientist

2:31

teams spend way too much time on solving

2:33

back-end and engineering tasks before

2:35

being able to push the model into

2:37

production something i personally

2:39

experienced it also requires very

2:42

different skill sets often requiring two

2:44

different teams working closely together

2:46

fortunately for us quack delivers a

2:48

fully managed platform that unifies ml

2:50

engineering and data operations

2:53

providing agile infrastructure that

2:55

enables the continuous productization of

2:57

ml models at scale you don't have to

2:59

learn how to do everything end-to-end

3:01

anymore thanks to them quack empowers

3:04

organizations to deliver machine

3:06

learning models into production at scale

3:08

if you want to speed up your model

3:10

delivery to production please take a few

3:12

minutes and click the first link below

3:14

to check what they offer as i'm sure it

3:16

will be worthwhile thanks to anyone

3:18

taking a look and supporting me and my

3:20

friends at quack

3:23

how can these powerful diffusion models

3:25

be computationally efficient by

3:27

transforming them into latent diffusion

3:30

models this means that robin rumback and

3:32

his colleagues implemented this

3:34

diffusion approach we just covered

3:36

within a compressed image representation

3:38

instead of the image itself and then

3:41

worked to reconstruct the image so they

3:43

are not working with the pixel space or

3:45

regular images anymore working in such a

3:48

compressed space does not only allow for

3:50

more efficient and faster generations as

3:52

the data size is much smaller but also

3:54

allows for working with different

3:56

modalities since they are encoding the

3:58

inputs you can feed it any kind of input

4:00

like images or text and the model will

4:03

learn to encode these inputs in the same

4:05

sub space that the diffusion model will

4:07

use to generate an image so yes just

4:10

like the clip model one model will work

4:13

with text or images to guide generations

4:16

the overall model will look like this

4:18

you will have your initial image here x

4:21

and encode it into an information then

4:23

space called the latent space or z this

4:26

is very similar to a gun where you will

4:29

use an encoder model to take the image

4:31

and extract the most relevant

4:32

information about it in a subspace which

4:35

you can see as a down sampling task

4:37

reducing its size while keeping as much

4:39

information as possible you are now in

4:42

the latent space with your condensed

4:44

input you then do the same thing with

4:46

your condition inputs either text images

4:49

or anything else and merge them with

4:50

your current image representation using

4:53

attention which i described in another

4:55

video this attention mechanism will

4:57

learn the best way to combine the input

4:59

and conditioning inputs in this latent

5:01

space adding attention a transformer

5:04

feature to diffusion models these merged

5:07

inputs are now your initial noise for

5:09

the diffusion process

5:11

then you have the same diffusion model i

5:13

covered in my image and video but still

5:16

in this subspace finally you reconstruct

5:19

the image using a decoder which you can

5:21

see as the reverse step of your initial

5:23

encoder taking this modified and

5:25

denoised input in the latent space to

5:28

construct a final high resolution image

5:31

basically upsampling your results and

5:34

voila this is how you can use diffusion

5:36

models for a wide variety of tasks like

5:39

super resolution in painting and even

5:41

text to image with the recent stable

5:44

diffusion open sourced model through the

5:46

conditioning process while being much

5:49

more efficient and allowing you to run

5:51

them on your gpus instead of requiring

5:54

hundreds of them you heard that right

5:56

for all devs out there wanting to have

5:58

their own text to image and image

6:00

synthesis model running on their own

6:02

gpus the code is available with

6:04

pre-turned models all the links are

6:06

below if you do use the model please

6:08

share your tests ids and results or any

6:10

feedback you have with me i'd love to

6:13

chat about that of course this was just

6:15

an overview of the latent diffusion

6:17

model and i invite you to read their

6:19

great paper linked below as well to

6:21

learn more about the model and approach

6:24

huge thanks to my friends at quack for

6:26

sponsoring this video and even bigger

6:28

thanks to you for watching the whole

6:30

video i will see you next week with

6:33

another amazing paper

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.

Published by HackerNoon on 2022/08/29