Will Transformers Replace CNNs in Computer Vision?

Written by whatsai | Published 2021/04/06
Tech Story Tags: ai | artificial-intelligence | nvidia | gtc | gtc2021 | hackernoon-top-story | youtubers | youtube-transcripts | web-monetization

TLDR The Swin Transformer is the next generation of neural networks for all computer vision applications. Transformer architecture can be applied to computer vision with a new paper called the Swin Transformformer. Make sure you stay until the end of the video for a giveaway sponsored by NVIDIA GTC! Subscribe here to have a chance to win a free NVIDIA newsletter and the next free NVIDIA event will be held in two weeks. Read the video and the code for you to implement yourself at the source of this article.via the TL;DR App

In a couple of minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer.
As a bonus, make sure you stay until the end of the video for a giveaway sponsored by NVIDIA GTC!

References

►My Newsletter (subscribe here to have a chance to win!): http://eepurl.com/huGLT5
►Paper: Liu, Z., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, 2021, https://arxiv.org/abs/2103.14030v1

Video transcript

00:00
This video is about most probably the next generation of neural networks for all computer
00:05
vision applications: The transformer architecture.
00:09
You've certainly already heard about this architecture in the field of natural language
00:13
processing, or NLP, mainly with GPT3 that made a lot of noise in 2020.
00:19
Transformers can be used as a general-purpose backbone for many different applications and
00:23
not only NLP. photo: transformers, gpt3
00:25
In a couple of minutes, you will know how this transformer architecture can be applied
00:29
to computer vision with a new paper called the Swin Transformer by Ze Lio et al. from
00:35
Microsoft Research.
00:37
Before diving into the paper, I just wanted to tell you to stay until the end of the video,
00:41
where I will talk about my newsletter I just created and the next free NVIDIA GTC EVENT
00:47
happening in two weeks.
00:48
You should definitely stay or skip right to it as I will provide you with the timeline
00:53
as usual because I will be hosting a giveaway in collaboration with NVIDIA GTC!
00:58
This video may be less flashy than usual as it doesn't really show the actual results
01:03
of a precise application.
01:04
Instead, the researchers showed how to adapt the transformers architecture from text inputs
01:10
to images, surpassing computer vision state-of-the-art convolutional neural networks, which is much
01:15
more exciting than a small accuracy improvement, in my opinion!
01:19
And of course, they are providing the code for you to implement yourself!
01:23
The link is in the description.
01:25
But why are we trying to replace convolutional neural networks for computer vision applications?
01:30
This is because transformers can efficiently use a lot more memory and are much more powerful
01:36
when it comes to complex tasks.
01:38
This is, of course, according to the fact that you have the data to train it.
01:43
Transformers also use the attention mechanism introduced with the 2017 paper Attention is
01:48
all you need.
01:49
Attention allows the transformer architecture to compute in a parallelized manner.
01:54
It can simultaneously extract all the information we need from the input and its inter-relation,
02:00
compared to CNNs.
02:02
CNNs are much more localized, using small filters to compress the information towards
02:07
a general answer.
02:08
While this architecture is powerful for general classification tasks, it does not have the
02:13
spatial information necessary for many tasks like instance recognition.
02:18
This is because convolutions don't consider distanced-pixels relations.
02:23
In the case of NLP, a classical type of input is a sentence and an image in a computer vision
02:29
case.
02:30
To quickly introduce the concept of attention, let's take a simple NLP example sending a
02:34
sentence to translate it into a transformer network.
02:38
In this case, attention is basically measuring how each word in the input sentence is associated
02:44
with each word on the output translated sentence.
02:47
Similarly, there is also what we call self-attention that could be seen as a measurement of a specific
02:53
word's effect on all other words of the same sentence.
02:57
This same process can be applied to images calculating the attention of patches of the
03:01
images and their relations to each other, as we will discuss further in the video.
03:06
Now that we know transformers are very interesting, there is still a problem when it comes to
03:11
computer vision applications.
03:12
Indeed, just like the popular saying "a picture is worth a thousand words," pictures contain
03:18
much more information than sentences, so we have to adapt the basic transformer's architecture
03:23
to process images effectively.
03:26
This is what this paper is all about.
03:28
This is due to the fact that the computational complexity of its self-attention is quadratic
03:33
to image size.
03:35
Thus exploding the computation time and memory needs.
03:38
Instead, the researchers replaced this quadratic computational complexity with a linear computational
03:44
complexity to image size.
03:47
The process to achieve this is quite simple.
03:50
At first, like most computer vision tasks, an RGB image is sent to the network.
03:55
This image is split into patches, and each patch is treated as a token.
04:00
And these tokens' features are the RGB values of the pixels themselves.
04:04
To compare with NLP, you can see this as the overall image is the sentence, and each patch
04:10
is the words of that sentence.
04:13
Self-attention is applied on each patch, here referred to as windows.
04:17
Then, the windows are shifted, resulting in a new window configuration to apply self-attention
04:22
again.
04:23
This allows the creation of connections between windows while maintaining the computation
04:28
efficiency of this windowed architecture.
04:31
This is very interesting when compared with convolutional neural networks as it allows
04:35
long-range pixel relations to appear.
04:38
This was only for the first stage.
04:41
The second stage is very similar but concatenates the features of each group of two by two neighboring
04:47
patches, downsampling the resolution by a factor of two.
04:50
This procedure is repeated twice in Stages 3 and 4 producing the same feature map resolutions
04:57
like those of typical convolutional networks like resnets and VGG.
05:03
You may say that this is highly similar to a convolutional architecture and filters using
05:07
dot products.
05:08
Well, yes and no.
05:10
The power of convolutions is that the filters use fixed weights globally, enabling the translation-invariance
05:16
property of convolution, making it a powerful generalizer.
05:20
In self-attention, the weights are not fixed globally.
05:23
Instead, they rely on the local context itself.
05:26
Thus, self-attention takes into account each pixel, but also its relation to the other
05:32
pixels.
05:33
Also, their shifted window technique allows long-range pixel relations to appear.
05:38
Unfortunately, these long-range relations only appear with neighboring windows.
05:43
Thus, losing very long-range relations, showing that there is still a place for improvement
05:47
of the transformer architecture when it comes to computer vision,
05:51
As they state in the paper, "It is our belief that a unified architecture
05:55
across computer vision and natural language processing could benefit
05:59
both fields, since it would facilitate joint modeling of visual and textual signals and
06:04
the modeling knowledge from both domains can be more deeply shared"
06:08
And I completely agree.
06:10
I think using a similar architecture for both NLP and computer vision could significantly
06:15
accelerate the research process.
06:17
Of course, transformers are still highly data-dependent, and nobody can say whether or not it will
06:23
be the future of either NLP or computer vision.
06:26
Still, it is undoubtedly a significant step forward for both fields!
06:31
Now that you've stayed this far let's talk about an awesome upcoming event for our field:
06:36
GTC.
06:37
So what is GTC2021?
06:38
It is a weeklong event offering over 1,500 talks from AI leaders like Yoshua Bengio,
06:45
Yann Lecun, Geoffrey Hinton, and much more!
06:48
The conference will start on April 12 with a keynote from the CEO of NVIDIA, where he
06:53
will be hosting the three AI pioneers I just mentioned.
06:57
This will be amazing!
06:58
It is an official NVIDIA conference for AI innovators, technologists, and creatives.
07:04
The conferences are covering many exciting topics.
07:07
Such as automotive, healthcare, data science, energy, deep learning, education, and much
07:11
more.
07:12
You don't want to miss it out!
07:14
Oh, and did I forget to mention that the registration is completely free this year?
07:18
So sign-up right now and watch it with me.
07:21
The link is in the description!
07:23
What's even cooler is that NVIDIA provided me 5 Deep Learning Institute credits that
07:28
you can use for an online, self-paced course of your choice worth around 30$ each!
07:34
The deep learning institute offers hands-on training in AI for developers, data scientists,
07:40
students, and researchers to get practical experience powered by GPUs in the cloud!
07:45
I think it's an awesome platform to learn, and it is super cool that they are offering
07:49
credits to give away, don't miss out on this opportunity!
07:52
To participate in this giveaway, you need to mention your favorite moment from the GTC
07:57
keynote on April 12 at 8:30 am pacific time using the hashtag #GTCWithMe and tagging me
08:05
(@whats_ai) on LinkedIn or Twitter!
08:09
I will also be live-streaming the event on my channel to watch it together and discuss
08:13
it in the chat.
08:14
Stay tuned for that, and please let me know what you think of the conference afterward!
08:19
NVIDIA also provided me with two extra codes to give away to the ones subscribing to my
08:24
newsletter!
08:25
This newsletter is about sharing only ONE paper each week.
08:29
There will be a video, an article, the code, and the paper itself.
08:32
I will also add some of the projects I am working on, guides to learning machine learning,
08:37
and other exciting news!
08:39
It's the first link in the description, and I will draw the winners just after the GTC
08:43
event!
08:44
Finally, just a final word as I wanted to personally thanks the four recent Youtube
08:49
members!
08:50
Huge thanks to you ebykova, Tonia Spight-Sokoya, Hello Paperspace, and Martin Petrovski, for
08:58
your support and everyone watching the videos!
09:01
See you in the next one!

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.
Published by HackerNoon on 2021/04/06