How Waymo Combines Lidar and Cameras for 3D Object Detection

Written by whatsai | Published 2022/03/28
Tech Story Tags: autonomous-driving | ai | artificial-intelligence | machine-learning | hackernoon-top-story | waymo | youtubers | youtube-transcripts | web-monetization

TLDRHow do autonomous vehicles see? How do they work, how can they see the world, and what do they see exactly compared to us? Understanding how they work is essential if we want to put them on the road, primarily if you work in the government or build the next regulations. We will look into this in the video with a new research paper by Waymo and Google Research. Watch the video! We previously covered how Tesla autopilot sees and works, but they are different from conventional autonomous vehicles.via the TL;DR App

How do autonomous vehicles see?
You’ve probably heard of LiDAR sensors or other weird cameras they are using. But how do they work, how can they see the world, and what do they see exactly compared to us? Understanding how they work is essential if we want to put them on the road, primarily if you work in the government or build the next regulations. But also as a client of these services.
We previously covered how Tesla autopilot sees and works, but they are different from conventional autonomous vehicles. Tesla only uses cameras to understand the world, while most of them, like Waymo, use regular cameras and 3D LiDAR sensors. These LiDAR sensors are pretty simple to understand: they won’t produce images like regular cameras but 3D point clouds. LiDAR cameras measure the distance between objects, calculating the pulse laser’s traveling time that they project to the object.
Still, how can we efficiently combine this information and have the vehicle understand it? And what does the vehicle end up seeing? Only points everywhere? Is it enough for driving on our roads? We will look into this in the video with a new research paper by Waymo and Google Research...

Watch the video!

References

► Read the full article: https://www.louisbouchard.ai/waymo-lidar/►Piergiovanni, A.J., Casser, V., Ryoo, M.S. and Angelova, A., 2021.
4d-net for learned multi-modal alignment. In Proceedings of the IEEE/CVF
International Conference on Computer Vision (pp. 15435–15445). https://openaccess.thecvf.com/content/ICCV2021/papers/Piergiovanni_4D-Net_for_Learned_Multi-Modal_Alignment_ICCV_2021_paper.pdf
►Google Research's blog post: https://ai.googleblog.com/2022/02/4d-net-learning-multi-modal-alignment.html?m=1
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

Video transcript

0:00
how do autonomous vehicles see you've
0:02
probably heard of lidar sensors or other
0:05
weird cameras they are using but how do
0:07
they work how can they see the world and
0:09
what do they see compared to us
0:11
understanding how they work is essential
0:13
if we want to put them on the road
0:15
especially if you work in the government
0:17
or build the next regulations but also
0:20
as a client of these services we
0:22
previously covered how tesla autopilot
0:24
sees and works but they are different
0:26
from conventional autonomous vehicles
0:28
tesla only uses cameras to understand
0:31
the world while most of them like waymo
0:33
use regular cameras and 3d lidar sensors
0:38
these lidar sensors are pretty simple to
0:40
understand they won't produce images
0:42
like regular cameras but 3d point clouds
0:46
lidar cameras measure the distance
0:48
between objects calculating the pulse
0:50
lasers traveling time that they project
0:52
to the object this way they will produce
0:54
a very few data points with valuable and
0:57
exact distance information as you can
1:00
see here these data points are called
1:02
point clouds and it just means that what
1:04
we will see are just many points at the
1:06
right positions creating some sort of 3d
1:09
model of the world here you can see how
1:12
lidar on the right isn't that precise to
1:14
understand what it sees but it's pretty
1:16
good to understand that with very little
1:18
information which is perfect for
1:20
efficiently computing the data in real
1:22
time an essential criteria for
1:25
autonomous vehicles this minimal amount
1:27
of data and high spatial precision is
1:30
perfect because coupled with rgb images
1:33
as shown on the left we have both
1:35
accurate distance information and the
1:37
accurate object information we lack with
1:40
lidar data alone especially from far
1:43
away objects or people
1:44
this is why waymo and other autonomous
1:47
vehicles companies use both kinds of
1:49
sensors to understand the world still
1:52
how can we efficiently combine this
1:53
information and have the vehicle
1:55
understand it and what does the vehicle
1:58
end up seeing only points everywhere is
2:01
it enough for driving on the roads we
2:03
will look into this with a new research
2:05
paper by waymo and google research
2:08
to release many advancements like this
2:10
one the researchers needed to run many
2:13
many experiments and be super organized
2:16
plus their code needs to be easily
2:18
reproducible and near perfection as many
2:21
people will depend on it in life or
2:23
death situations fortunately these are
2:26
two strong points of this episode
2:28
sponsor weights and biases if you want
2:31
to publish papers in big conferences or
2:33
create the future of autonomous vehicles
2:36
i think using weights and biases will
2:38
certainly help it changed my life as a
2:40
researcher and for my work at design
2:42
stripe weights and biases will
2:44
automatically track each run the hyper
2:46
parameters the github version hardware
2:48
and os used then you can easily create
2:51
your own workspace using filters groups
2:54
and your own panels to display anything
2:56
you need to analyze how can you not be
2:59
well organized with such a tool it
3:01
basically contains everything you need
3:03
for your code to be reproducible without
3:06
you even trying for your future
3:08
colleagues and people that will
3:09
implement your amazing work please take
3:12
the time to ensure your work is
3:14
reproducible and if you want help with
3:16
that try out weights and biases with the
3:18
first thing we do
3:21
i think i couldn't summarize the paper
3:23
better than the sentence they used in
3:25
their article we present fortinet which
3:28
learns how to combine 3d point clouds in
3:31
time and rgb camera images in time for
3:34
the widespread application of 3d object
3:37
detection in autonomous driving i hope
3:39
you enjoyed the video please subscribe
3:41
and i'm just kidding let's dive a little
3:43
deeper into this sentence this is what
3:45
the 3d object detection we are talking
3:47
about looks like and it's also what the
3:50
car will end up seeing it's a very
3:52
accurate representation of the world
3:54
around the vehicle with all objects
3:56
appearing and precisely identified how
3:59
cool does that look and more interesting
4:02
how did they end up with this result
4:04
they produced this view using lidar data
4:07
called point clouds in time or pcit and
4:10
regular cameras are here called rgb
4:13
videos these are both four dimensional
4:16
inputs just like we humans see and
4:19
understand the world the four dimensions
4:21
come from the video being taken in time
4:24
so the vehicle has access to past frames
4:26
to help understand context and objects
4:28
to guess future behaviors just like we
4:31
do creating the fourth dimension the
4:34
three others are the 3d space we are
4:36
familiar with we call this task scene
4:39
understanding and has been widely
4:41
studied in computer vision and has seen
4:43
many advancements with the recent
4:45
progress of the field and machine
4:46
learning algorithms it's also crucial in
4:49
self-driving vehicles where we want to
4:51
have a near perfect comprehension of the
4:54
scenes here you can see that the two
4:56
networks always talk to each other with
4:59
connections this is mainly because when
5:01
we take images we have objects at
5:03
various ranges in the shot and with
5:05
different proportions the car in front
5:08
will look much bigger than the car far
5:10
away but we still need to consider both
5:13
like us when we see someone far away and
5:15
feel like it's our friend but still wait
5:17
closer to be sure before calling his
5:19
name the car will lack details for such
5:22
far away objects to patch for that we
5:24
will extract and share information from
5:26
different levels in the network sharing
5:29
information throughout the network is a
5:31
powerful solution because neural
5:33
networks use small detectors of fixed
5:36
size to condense the image the deeper we
5:38
get into the network meaning that early
5:41
layers will be able to detect small
5:42
objects and on the edges or parts of the
5:45
bigger objects deeper layers will lose
5:48
the small objects but be able to detect
5:50
large objects with great precision the
5:52
main challenge with this approach is
5:54
combining these two very different kinds
5:56
of information through these connections
5:58
the lidar 3d space data and more regular
6:01
rgb frames using both information at all
6:04
network steps as described earlier is
6:07
best to understand the whole scene
6:08
better but how can we merge two
6:10
different streams of information and use
6:13
the time dimension efficiently this data
6:15
translation between the two branches is
6:17
what the network learns during training
6:20
in a supervised way with a similar
6:22
process as in the self-attention
6:24
mechanisms i covered in previous videos
6:26
by trying to recreate the real model of
6:29
the world but to facilitate this data
6:31
translation they use a model called
6:33
point pilers which takes point clouds
6:36
and gives a two-dimensional
6:37
representation you can see this as a
6:40
pseudo image of the point cloud as they
6:42
call it creating somewhat of a regular
6:44
image representing the point cloud with
6:46
the same properties as the rgb images we
6:49
have in the other branch instead of the
6:51
pixels being red green blue colors they
6:54
simply represent the depth and positions
6:56
of the object x y z coordinates this
6:59
pseudo image is also really sparse
7:01
meaning that the information on this
7:03
representation is only dense around
7:05
imported objects and most probably
7:07
useful for the model regarding time as i
7:10
said we simply have the fourth dimension
7:12
in the input image to keep track of the
7:14
frames these two branches we see are
7:16
convolutional neural networks that
7:18
encode the images as described in
7:20
multiple of my videos and then decode
7:23
this encoded information to recreate the
7:25
3d representation we have here so it
7:28
uses a very similar encoder for both
7:30
branches shares information with each
7:33
other and reconstructs a 3d model of the
7:35
world using a decoder and voila this is
7:39
how the weimo vehicles see our world it
7:42
can process 32 point clouds in time and
7:45
16 rgb frames within
7:47
164 milliseconds producing better
7:50
results than other methods this might
7:53
not ring any bell so we can compare it
7:55
with the next best approach that is less
7:58
accurate and takes 300 milliseconds
8:00
almost double the time to process
8:03
of course this was just an overview of
8:05
this new paper by google research and
8:07
weymouth i'd recommend reading the paper
8:10
to learn more about their models
8:11
architecture and other features i didn't
8:14
dive into like time information's
8:16
efficiency problem it's linked in the
8:18
description below i hope you enjoyed the
8:20
video and if you did please consider
8:22
subscribing to the channel and
8:24
commenting what you think of this
8:26
summary i'd love to read what you think
8:29
thank you for watching and i will see
8:31
you next week with another amazing paper
8:35
[Music]

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.
Published by HackerNoon on 2022/03/28