What Did AI Bring to Computer Vision?

Written by whatsai | Published 2021/05/05
Tech Story Tags: artificial-intelligence | ai | deep-learning | computer-vision | ml | machine-learning | hackernoon-top-story | youtube-transcripts | web-monetization

TLDR In this video, I will openly share everything about deep nets for computer vision applications, their successes, and the limitations we have yet to address. I explain Artificial Intelligence terms and news to non-experts. The video is a transcript of an interview with Louis Bouchard. He explains the architecture of deep nets and how they are used to learn more about the power of our computers. In the video, he explains how deep nets can learn more information about the feature feature that we can learn during training.via the TL;DR App

In this video, I will openly share everything about deep nets for computer vision applications, their successes, and the limitations we have yet to address.

Watch the video

References

Yuille, A.L., and Liu, C., 2021. Deep nets: What have they ever done for vision?. International Journal of Computer Vision, 129(3), pp.781–802, https://arxiv.org/abs/1805.04025.

Video Transcript

00:00
if you clicked on this video you are
00:01
certainly interested in computer vision
00:04
applications
00:04
like image classification image
00:06
segmentation object detection
00:08
and more complex tasks like face
00:10
recognition image generation or even
00:12
star transfer application
00:14
as you may already know with the growing
00:16
power of our computers
00:17
most of these applications are now being
00:19
realized using similar deep neural
00:21
networks
00:22
what we often refer to as artificial
00:25
intelligence models
00:26
there are of course some differences
00:28
between the deep nets used in these
00:30
different vision applications
00:31
but as of now they all use the same
00:34
basis of convolutions
00:35
introduced in 1989 by yan loken the
00:38
major difference here
00:40
is our computation power coming from the
00:42
recent advancements of gpus
00:44
to quickly go over the architecture as
00:46
the name says convolution
00:48
is a process where an original image or
00:50
video frame which is
00:51
our input in a computer vision
00:53
applications is convolved
00:55
using filters that detect important
00:57
small features of an image
00:59
such as edges the network will
01:01
autonomously learn
01:02
filter values that detect important
01:04
features to match the output we want to
01:06
have
01:07
such as the object's name in a specific
01:09
image sent as input
01:11
for a classification task these filters
01:13
are usually of size
01:15
3x3 or 5x5 pixel squares allowing them
01:18
to detect the direction of the edges
01:20
left right up or down just like you can
01:23
see in this image
01:24
the process of convolution makes a dot
01:26
product between the filter and the
01:28
pixels it faces
01:29
it's basically just a sum of all the
01:31
filter pixels multiplied with the values
01:34
of the images pixels
01:35
at the corresponding positions then it
01:38
goes to the right and does it again
01:40
convolving the whole image once it's
01:42
done these convolved features
01:44
give us the output of the first
01:46
convolution layer which we call this
01:48
output a feature map
01:49
we repeat this process with many other
01:51
filters giving us multiple feature maps
01:54
one for each filter used in the
01:56
convolution process
01:57
having more than one feature map gives
01:59
us more information about the image
02:01
and especially more information that we
02:03
can learn during training
02:04
since these filters are what we aim to
02:06
learn for our task
02:08
these feature maps are all sent into the
02:10
next layer
02:11
as input to produce many other smaller
02:14
sized
02:14
feature maps again the deeper we get
02:17
into the network the smaller these
02:18
feature maps gets
02:20
because of the nature of convolutions
02:21
and the more general the information of
02:23
these feature maps become
02:25
until it reaches the end of the network
02:27
with extremely general information
02:29
about what the image contains disposed
02:31
of our many feature maps
02:33
which is used for classification or to
02:35
build a latent code
02:37
to represent information present in the
02:39
image in the case of a gan architecture
02:41
to generate a new image
02:43
based on this code which we refer to as
02:45
encoded information
02:47
in the example of image classification
02:49
simply put
02:50
we can see that at the end of the
02:52
network these small feature maps contain
02:54
the information about the presence of
02:56
each possible class telling you whether
02:58
it's a dog a cat
03:00
a person etc of course this is super
03:03
simplified
03:04
and there are other steps but i feel
03:06
like this is an accurate summary of
03:07
what's going on
03:08
inside a deep convolutional neural
03:10
network
03:11
if you've been following my channel and
03:13
posts you know that deep neural networks
03:15
proved
03:16
to be extremely powerful again and again
03:18
but they also have weaknesses and
03:20
weaknesses
03:21
that we should not try to hide as with
03:24
all things in life
03:25
deep nets have strength and weaknesses
03:27
while strengths are widely shared
03:30
the latter is often omitted or even
03:32
discarded by companies
03:34
and ultimately by some researchers this
03:36
paper
03:37
by alan yule and chenxileo aims to
03:40
openly share
03:41
everything about deep nets for vision
03:43
applications their success and the
03:45
limitations we have to address
03:47
moreover just like for our brain we
03:50
still do not fully understand their
03:52
inner workings
03:53
which makes the use of deep nets even
03:55
more limited since we cannot maximize
03:57
their strength
03:58
and limit weaknesses as stated by o
04:00
hobart
04:01
it's like a road map that tells you
04:02
where cars can drive but doesn't tell
04:04
you when or where
04:06
cars are actually driving this is
04:08
another point they discuss
04:09
in their paper namely what is the future
04:12
of computer vision algorithms
04:14
as you may be thinking one way to
04:16
improve computer vision applications is
04:18
to understand our own visual system
04:20
better starting with our brain which is
04:23
why
04:24
neuroscience is such an important field
04:26
for ai
04:27
indeed current deep nets are
04:29
surprisingly different than our own
04:31
vision system
04:32
firstly humans can learn from very small
04:35
numbers of examples
04:37
by exploiting our memory and the
04:39
knowledge we already acquired
04:40
we can also exploit our understanding of
04:43
the world and its physical properties to
04:45
make
04:45
deductions something that a deep net
04:47
cannot do in 1999
04:50
gupp nick ital explained that babies are
04:52
more like tiny scientists
04:54
who understand the world by performing
04:56
experiments
04:57
and seeking causal explanations for
05:00
phenomena rather than
05:01
simply receiving stimulus from images
05:03
like current
05:04
deep nets do also we humans
05:07
are much more robust as we can easily
05:10
identify an object from any viewpoint
05:12
texture it has occlusions it may
05:14
encounter and novel context
05:16
as a concrete example you can just
05:18
visualize the annoying captcha you
05:20
always have to fill in
05:22
when logging into a website this captcha
05:24
is used to detect butts since they are
05:26
awful
05:27
when there are occlusions like this as
05:30
you can see here
05:31
the deep net got fooled by all the
05:33
examples because of the jungle context
05:35
and the fact that a monkey
05:37
is not typically holding a guitar this
05:39
happens because it's certainly not in
05:41
the training data set
05:43
of course this exact situation might not
05:45
happen very often
05:46
in real life but i will show some more
05:48
concrete examples
05:49
that are more relatable and that already
05:52
happened later on
05:53
in the video deep nets also have
05:55
strength that we must highlight
05:57
they can outperform us for face
05:59
recognition tasks since
06:00
humans are not used to until recently
06:03
seeing more than a few thousands of
06:05
people
06:05
in their whole lifetime but this
06:07
strength of deep nets also comes with a
06:09
limitation
06:10
where these faces need to be straight
06:12
centered clear
06:14
without any occlusions etc indeed
06:17
the algorithm could not recognize your
06:19
best friend at the alwyn party
06:21
disguised in harry potter having only
06:23
glasses and a lightning bolt on the
06:25
forehead
06:26
where you would instantly recognize him
06:28
and see
06:29
whoa that's not very original it looks
06:31
like you just put glasses on
06:33
similarly such algorithms are extremely
06:36
precise radiologists
06:37
if all the settings are similar to what
06:39
they have been seeing
06:40
during their training they will
06:42
outperform any human
06:43
this is mainly because even the most
06:45
expert radiologists have only seen
06:48
a fairly small number of ct scans in
06:50
their lives as they suggest
06:51
the superiority of algorithms may also
06:54
be because they are doing a low priority
06:56
task
06:57
for humans for example a computer vision
06:59
app on your phone can identify the
07:02
hundreds of plants in your garden much
07:04
better than
07:05
most of us watching the video can but a
07:08
plant expert
07:08
will surely outperform it and all of us
07:11
together as well
07:12
but again this strength comes with a
07:14
huge problem
07:15
related to the data the algorithm needs
07:17
in order to be this powerful
07:19
as they mentioned and as we often see on
07:21
twitter and article titles
07:23
there are biases due to the data set
07:25
these deep nets are trained on
07:28
since an algorithm is only as good as
07:31
the data set it is evaluated on
07:33
and the performance measures used this
07:35
dataset limitation
07:37
comes with the price that these deep
07:39
neural networks are much
07:40
less general purpose flexible and
07:43
adaptative
07:44
than our own visual system they are less
07:46
general purpose
07:47
and flexible in the way that contrary to
07:50
our visual system
07:51
where we automatically perform edge
07:53
detection binocular stereo
07:55
semantic segmentation object
07:57
classification scene classification and
07:59
3d
08:00
depth estimation deep nets can only be
08:02
trained to achieve
08:03
one of these tasks indeed simply by
08:05
looking around
08:06
your vision system automatically
08:08
achieves all these tasks with extreme
08:10
precision
08:10
where deep nets have difficulty
08:12
achieving similar precision on one of
08:14
them
08:14
but even if this seems effortless to us
08:17
half of our neurons are at work
08:19
processing the information
08:20
and analyzing what's going on we are
08:23
still
08:23
far from mimicking our vision system
08:26
even with the current depth of our
08:27
networks
08:28
but is that really the goal of our
08:30
algorithms will it be better to just
08:33
use them as a tool to improve our
08:35
weaknesses i couldn't say
08:36
but i am sure that we want to address
08:39
the deep nets limitations
08:40
that can cause serious consequences
08:43
rather than omitting them
08:44
i will show some concrete examples of
08:47
such consequences
08:48
just after introducing these limitations
08:50
but if you are too intrigued you can
08:52
skip
08:52
right to it following the timestamps
08:54
under the video and come back to the
08:56
explanation
08:57
afterwards indeed the lack of precision
08:59
we previously mentioned
09:01
by deepnets arises mainly because of the
09:03
disparity between the data we use to
09:05
train our algorithm
09:06
and what it sees in real life as you
09:09
know an algorithm needs to see a lot of
09:11
data
09:11
to iteratively improve at the task it is
09:14
trained for
09:14
this data is often referred to as a
09:17
training data set
09:18
this data disparity between the training
09:20
data set and the real
09:22
world is a problem because the real
09:23
world is too complicated to accurately
09:25
be represented
09:27
in a single data set which is why deep
09:29
nets are less additive than our vision
09:31
system
09:32
in the paper they call this the
09:34
combinatorial complexity explosion of
09:36
natural images the combinatorial
09:38
complexity
09:39
comes from the multitude of possible
09:41
variations within a natural image
09:43
like the camera pose lighting texture
09:46
material
09:47
background the position of the objects
09:49
etc biases can appear at any of these
09:52
levels of complexity
09:53
the data set is missing you can see how
09:56
these large data sets now seem
09:57
very small due to all these factors
10:00
considering that having only
10:02
let's say 13 of these different
10:04
parameters and we allow only
10:06
1000 different values for each of them
10:08
we quickly jump to this number of
10:10
different images
10:11
to represent only a single object the
10:14
current data sets only cover a handful
10:16
of these multitudes of possible
10:18
variations for each object
10:20
thus missing most real-world situations
10:22
that it will encounter in production
10:24
it's also worth mentioning that since
10:26
the variety of images is very limited
10:28
the network may find shortcuts to
10:31
detecting some objects as we saw
10:33
previously with the monkey where it was
10:34
detecting a human instead of a monkey
10:37
because of the guitar in front of it
10:39
similarly you can see that it's
10:40
detecting a bird here
10:42
instead of a guitar probably because the
10:44
model has never
10:45
seen a guitar with a jungle background
10:48
this is called
10:49
overfitting to the background context
10:51
where the algorithm does not focus on
10:53
the right thing
10:54
and instead finds a pattern in the
10:56
images themselves rather than on the
10:58
object of interest
11:00
also these data sets are all built from
11:03
images taken by photographs
11:04
meaning that they only cover specific
11:06
angles and poses that do not transfer to
11:09
all orientation possibilities in the
11:11
real world
11:12
currently we use benchmarks with the
11:14
most complex data sets possible to
11:16
compare the current algorithms and rate
11:18
them
11:19
which if you recall are very incomplete
11:21
compared to the real world
11:23
nonetheless we are often happy with 99
11:26
accuracy for a task on such benchmarks
11:29
firstly the problem is that this
11:31
one-person error is determined on a
11:32
benchmark data set
11:34
meaning that it's similar to our
11:35
training data set in the way that it
11:37
doesn't
11:37
represent the richness of natural images
11:40
it's normal because it's impossible to
11:42
represent the real world in just a bunch
11:44
of images
11:45
it's just way too complicated and there
11:47
are too many situations possible
11:49
these benchmarks we use to test our data
11:52
set to determine whether or not
11:53
they are ready to be deployed in the
11:55
real world application are not really
11:57
accurate to determine how well it will
11:59
actually perform
12:00
which leads to the second problem that
12:02
is how it will actually perform
12:04
in the real world let's see that the
12:06
benchmark data set is huge
12:08
and most cases are covered and we really
12:11
have 99
12:12
accuracy what are the consequences of
12:14
the one percent of cases where the
12:16
algorithm fails in the real world
12:19
this number will be represented in
12:21
misdiagnosis
12:22
accidents financial mistakes or even
12:25
worse
12:25
death such cases could be a self-driving
12:28
car
12:28
during a heavy rainy day heavily
12:30
affecting the death sensors
12:32
used by the vehicle causing it to fail
12:34
many depth estimations
12:36
would you trust your life to this
12:37
partially blind robot taxi
12:40
i don't think i would similarly would
12:42
you trust a self-driving car at night to
12:44
avoid
12:44
driving over pedestrians or cyclists
12:47
where even yourself had difficulty
12:49
seeing them
12:50
these kinds of life-threatening
12:51
situations are so broad
12:53
that it's almost impossible that they
12:55
are all represented in the training data
12:57
set
12:57
and of course here i use extreme
12:59
examples of the most relatable
13:01
application
13:02
but you can just imagine how harmful
13:04
this could be
13:05
when the perfectly trained and tested
13:07
algorithm misclassifies your ct scan
13:09
leading to misdiagnosis just because
13:12
your hospital has different settings in
13:13
their scanner or because you didn't
13:15
drink enough water
13:16
or die anything that would be different
13:19
from your training data
13:20
could lead to a major problem in real
13:22
life even if the benchmark
13:24
used to test it says it's perfect also
13:27
as it already happened
13:28
this can lead to people in
13:30
underrepresented demographics being
13:32
unfairly treated by these algorithms
13:34
and even worse this is why i argue that
13:37
we must focus on the task
13:38
where the algorithms help us and not
13:41
where they replace
13:42
us as long as they are that dependent on
13:44
data
13:45
this brings us to the two questions they
13:47
highlight how can we efficiently test
13:49
these algorithms to ensure that they
13:51
work on these enormous data sets
13:53
if we can only test them on a finite
13:55
subset and two
13:57
how can we train algorithms infinite
13:59
size data sets so that they can perform
14:01
well
14:02
on the truly enormous datasets required
14:04
to capture the combinatorial complexity
14:07
of the real world
14:08
in the paper they suggest to rethink our
14:11
methods for benchmarking performance
14:13
and evaluating vision algorithms and i
14:15
agree entirely
14:17
especially now where most applications
14:19
are made for real life users instead of
14:21
only academic competitions
14:23
it's crucial to get out of these
14:24
academia evaluation metrics
14:26
and create more appropriate evaluation
14:28
tools we also have to accept that data
14:31
bias exists
14:32
and that it can cause real world
14:34
problems of course we need to learn to
14:36
reduce these biases
14:38
but also to accept them biases are
14:40
inevitable due to the combinatorial
14:42
complexity of the real world
14:44
that cannot be realistically represented
14:46
in a single data set of images
14:48
yet thus focusing our attention without
14:51
any play of words with transformers
14:53
on better algorithms that can learn to
14:55
be fair
14:56
even when trained on such incomplete
14:58
data sets
14:59
rather than having bigger and bigger
15:01
models trying to represent the most data
15:04
possible
15:05
even if it may look like it this paper
15:07
was not a criticism of current
15:08
approaches
15:09
instead it's an opinion piece motivated
15:11
by discussions with other researchers in
15:14
several disciplines
15:15
as they state we stress that views
15:17
expressed in the paper
15:18
are our own and do not necessarily
15:20
reflect
15:21
those of the computer vision community
15:23
but i must say
15:24
this was a very interesting read and my
15:27
views are quite similar
15:28
they also discuss many important
15:30
innovations that happen over the last 40
15:32
years in computer vision
15:34
that is definitely worth reading as
15:36
always the link to the paper
15:38
is in the description below to end on a
15:40
more positive note we are nearly
15:42
a decade into the revolution of deep
15:44
neural networks that started in 2012
15:47
with alexnet and the imagenet
15:49
competition since then
15:51
there has been immense progress on our
15:53
computation power
15:54
and the deep net architectures like the
15:56
use of batch normalization
15:58
residual connections and more recently
16:00
self-attention
16:01
researchers will undoubtedly improve the
16:03
architecture of deep nets but we shall
16:05
not forget that there are other ways to
16:07
achieve intelligent models than going
16:09
deeper and using more data of course
16:12
these ways are yet to be discovered
16:14
if this story of deep neural networks
16:16
sounds interesting to you
16:18
i made a video of one of the most
16:19
interesting architecture
16:21
along with a short historical review of
16:23
deep nets i'm sure you'll love it
16:25
thank you for watching

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.
Published by HackerNoon on 2021/05/05