Deepmind May Have Just Created the World's First General AI

Written by whatsai | Published 2022/05/16
Tech Story Tags: deepmind | google | ai | artificial-intelligence | reinforcement-learning | machine-learning | data-science | hackernoon-top-story | web-monetization

TLDR

Gato from DeepMind was just published! It is a single transformer that can play Atari games, caption images, chat with people, control a real robotic arm, and more! Indeed, it is trained once and uses the same weights to achieve all those tasks. Gato is a multi-modal agent meaning that it can create captions for images or answer questions as a chatbot. It understands words, images, and even physics... learn more in the video transcript below below.via the TL;DR App

Gato from DeepMind was just published! It is a single transformer that can play Atari games, caption images, chat with people, control a real robotic arm, and more! Indeed, it is trained once and uses the same weights to achieve all those tasks. And as per Deepmind, this is not only a transformer but also an agent. This is what happens when you mix Transformers with progress on multi-task reinforcement learning agents.

As we said, Gato is a multi-modal agent. Meaning that it can create captions for images or answer questions as a chatbot. You’d say that GPT-3 can already do that, but Gato can do more… The multi-modality comes from the fact that Gato can also play Atari games at the human level or even do real-world tasks like controlling robotic arms to move objects precisely. It understands words, images, and even physics...

Learn more in the video

References

►Read the full article: https://www.louisbouchard.ai/deepmind-gato/
►Deepmind's blog post: https://www.deepmind.com/publications/a-generalist-agent
►Paper: Reed S. et al., 2022, Deemind: Gato, https://storage.googleapis.com/deepmind-media/A%20Generalist%20Agent/Generalist%20Agent.pdf
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

Video transcript

0:00

Gato from deepmind was just published

0:02

it's a single transformer that can play

0:04

atari games caption images chat with

0:07

people control a real robotic arm and

0:09

more indeed is trained once and uses the

0:12

same weights to achieve all those tasks

0:15

and as per deepmind this is not only a

0:17

transformer but also an agent this is

0:20

what happens when you mix transformers

0:22

with progress on multi-task

0:23

reinforcement learning agents as we said

0:26

gato is a multi-modal agent meaning that

0:29

it can create captions for images or

0:31

answer questions as a chatbot you'd see

0:34

that gpt3 can already do that but ghetto

0:36

can do more the multimodality comes from

0:39

the fact that ghetto can also play atari

0:41

games at the human level or even do real

0:44

world tasks like controlling robotic

0:46

arms to move objects precisely it

0:48

understands words images and even

0:51

physics ghetto is the first generalist

0:54

model that performs so well on so many

0:56

different tasks and it's extremely

0:58

promising for the field it was trained

1:00

on 604 distinct tasks with varying

1:03

modalities observations and action

1:06

specifications making it the perfect

1:08

generalist and as i said it does all

1:11

that with the same network and weights

1:13

and before you ask it only needs 1.2

1:15

billion parameters compared to gpt3 that

1:18

requires

1:19

175 billion of them it's not a trap

1:22

where you have to retrain or fight unit

1:24

to all tasks you can send both an image

1:27

and text and it will work you can even

1:29

add in a few movements from a robot arm

1:32

the model can decide which type of

1:34

output to provide based on its context

1:36

ranging from text to discrete actions in

1:38

an environment if you enjoyed the video

1:41

please consider subscribing and let me

1:43

know if you like this kind of news video

1:46

i definitely do more this is possible

1:48

because of their tokenization process

1:50

tokenization is when you prepare your

1:52

inputs for the modal as they do not

1:55

understand text or images by themselves

1:57

language models and ghetto took the

1:59

total number of sub words for example 32

2:02

000 and each word has a number assigned

2:05

to it for images they follow the vit

2:08

patch embedding using a widely used

2:10

resnet block as we covered in a previous

2:12

video we also tokenized the button

2:14

presses as integer numbers for atari

2:16

games or discrete values finally for

2:19

continuous values like proprioceptive

2:21

inputs we talked about with the robotic

2:23

arms they encoded the different track

2:25

matrix into float numbers and added them

2:27

after the text tokens using all those

2:30

different inputs the agent adapts to the

2:32

current task to generate appropriate

2:34

outputs during training they use prompt

2:36

conditioning as in gpt3 with previously

2:39

sampled actions and observations the

2:42

progress in generalist rl agents in the

2:44

last years has been incredible and came

2:47

mainly from deepmind one could see that

2:49

they are moving the needle closer to

2:51

general ai or human level intelligence

2:55

if we can finally define it i love how

2:57

many details they gave in their paper

2:59

and i'm excited to see what they will do

3:01

or what other people will do using this

3:03

model's architecture the link to the

3:06

paper for more information about the

3:07

model is in the description i hope you

3:09

enjoyed this short video i just saw this

3:12

news when i woke up and i couldn't do

3:13

anything else than make this video

3:15

before starting my day it's just too

3:17

exciting i will see you next week with

another amazing paper

Written by whatsai | I explain Artificial Intelligence terms and news to non-experts.

Published by HackerNoon on 2022/05/16