Google Brain's New Model Imagen is Even More Impressive than Dall-E 2

Written by whatsai | Published 2022/05/24
Tech Story Tags: artificial-intelligence | ai | machine-learning | youtubers | youtube-transcripts | hackernoon-top-story | google | imagen | web-monetization

TLDRIf you thought Dall-e 2 had great results, wait until you see what this new model from Google Brain can do. Dalle-e is amazing but often lacks realism, and this is what the team attacked with a new model called Imagen. Imagen can not only understand text but also understand images it can also understand the images it generates. Learn more in the video...  Read the full article: https://www.louisbouchard.ai/Google-brain-imagen/via the TL;DR App

If you thought Dall-e 2 had great results, wait until you see what this new model from Google Brain can do.
Dalle-e is amazing but often lacks realism, and this is what the team attacked with this new model called Imagen.
They share a lot of results on their project page as well as a benchmark, which they introduced for comparing text-to-image models, where they clearly outperform Dall-E 2, and previous image generation approaches. Learn more in the video...

References

►Read the full article: https://www.louisbouchard.ai/google-brain-imagen/
►Paper: Saharia et al., 2022, Imagen - Google Brain, https://gweb-research-imagen.appspot.com/paper.pdf
►Project link: https://gweb-research-imagen.appspot.com/
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

Video transcript

       0:00
if you thought dali 2 had great results
0:02
wait until you see what this new model
0:04
from google brain can do delhi is
0:07
amazing but often lacks realism and this
0:10
is what the team attacked with this new
0:12
model called imogen they share a lot of
0:14
results on their project page as well as
0:16
a benchmark which they introduced for
0:18
comparing text to image models where
0:20
they clearly outperformed daily2 and
0:23
previous image generation approaches
0:25
this benchmark is also super cool as we
0:27
see more and more text to image models
0:29
and it's pretty difficult to compare the
0:31
results unless we assume the results are
0:34
really bad which we often do but this
0:36
model and le2 definitely defied the odds
0:40
tldr it's a new text-to-image model that
0:43
you can compare to dali to with more
0:45
realism as per human testers so just
0:48
like dali that i covered not even a
0:50
month ago this model takes texts like a
0:53
golden retriever dog wearing a blue
0:56
checkered barrette and a red dotted
0:58
turtleneck and tries to generate a
1:00
photorealistic image out of this weird
1:02
sentence the main point here is that
1:05
imogen can not only understand text but
1:08
it can also understand the images it
1:10
generates since they are more realistic
1:12
than all previous approaches of course
1:15
when i say understand i mean its own
1:17
kind of understanding that is really
1:20
different than ours the modal doesn't
1:22
really understand the text or the image
1:24
it generates it definitely has some kind
1:27
of knowledge about it but it mainly
1:28
understands how this particular kind of
1:31
sentence with these objects should be
1:33
represented using pixels on an image but
1:36
i'll concede that it sure looks like it
1:38
understands what we send it when we see
1:41
those results obviously you can trick it
1:43
with some really weird sentences that
1:45
couldn't look realistic like this one
1:48
but it sometimes beats even your own
1:50
imagination and just creates something
1:53
amazing still what's even more amazing
1:56
is how it works using something i never
1:58
discussed on the channel a diffusion
2:00
model but before using this diffusion
2:03
model we first need to understand the
2:05
text input and this is also the main
2:07
difference with dali they used a huge
2:10
text model similar to gpt3 to understand
2:13
the text as best as an ai system can so
2:16
instead of training a text model along
2:18
with the image generation model they
2:21
simply use a big pre-trained model and
2:23
freeze it so that it doesn't change
2:25
during the training of the image
2:27
generation model from their study this
2:30
led to much better results and it seemed
2:32
like the model understood text better so
2:35
this text module is how the model
2:37
understands text and this understanding
2:40
is represented in what we call encodings
2:42
which is what the model has been trained
2:44
to do on huge datasets to transfer text
2:47
inputs into a space of information that
2:50
it can use and understand
2:52
now we need to use this transform text
2:54
data to generate the image and as i said
2:57
they used a diffusion model to achieve
3:00
that but what is a diffusion model
3:02
diffusion models are generative models
3:04
that convert random gaussian noise like
3:07
this into images by learning how to
3:10
reverse gaussian noise iteratively they
3:13
are powerful models for super resolution
3:15
or other image to image translations and
3:18
in this case use a modified unit
3:20
architecture which i covered numerous
3:22
times in previous videos so i won't
3:24
enter into the architectural details
3:26
here basically the model is trained to
3:29
denoise an image from pure noise which
3:31
the orient using the text encodings and
3:34
a technique called classifier free
3:36
guidance which they say is essential and
3:38
clearly explained in their paper i'll
3:40
let you read it for more information on
3:42
this technique so now we have a model
3:45
able to take random gaussian noise and
3:47
our text encoding and denoise it with
3:49
guidance from the text encodings to
3:51
generate our image but as you see here
3:54
it isn't as simple as it sounds the
3:56
image we just generated is a very small
3:58
image as a bigger image will require
4:00
much more computation and a much bigger
4:02
model which are not viable instead we
4:05
first generate a photorealistic image
4:07
using the diffusion model we just
4:09
discussed and then use other diffusion
4:12
models to improve the quality of the
4:14
image iteratively i already covered
4:16
super resolution models in past videos
4:19
so i won't enter into the details here
4:21
but let's do a quick overview once again
4:24
we want to have noise and not an image
4:26
so we cover up this initially generated
4:28
low resolution image with again some
4:31
gaussian noise and we train our second
4:33
diffusion model to take this modified
4:35
image and improve it then we repeat
4:38
these two steps with another model but
4:40
this time using just patches of the
4:43
image instead of the full image to do
4:45
the same upscaling ratio and stay
4:47
computationally viable and voila we end
4:51
up with our photorealistic high
4:53
resolution image
4:55
of course this was just an overview of
4:56
this exciting new model with really cool
4:59
results i definitely invite you to read
5:01
their great paper for a deeper
5:03
understanding of their approach and a
5:05
detailed results analysis
5:07
and you do you think the results are
5:09
comparable to delhi too are they better
5:12
or worse i sure think it is dally's main
5:15
competitor as of now let me know what
5:17
you think of this new google brain
5:19
publication and the explanation i hope
5:21
you enjoyed this video and if you did
5:24
please take a second to leave a like and
5:26
subscribe to stay up to date with
5:27
exciting ai news if you are subscribed i
5:30
will see you next week with another
amazing paper




Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.
Published by HackerNoon on 2022/05/24