Lec4b pong from_pixels

•

0 likes•100 views

Ronald Teo

Taken from https://sites.google.com/view/deep-rl-bootcamp/lectures

Technology

Pong from Pixels
Deep RL Bootcamp
Andrej Karpathy, Aug 26, 2017

[80 x 80]
array of
height width
e.g.,
Policy network

[80 x 80]
array
height width
Policy network

E.g. 200 nodes in the hidden network, so:
[(80*80)*200 + 200] + [200*1 + 1] = ~1.3M parameters
Layer 1 Layer 2
[80 x 80]
array
height width
Policy network

Network does not see this. Network sees 80*80 = 6,400 numbers.
It gets a reward of +1 or -1, some of the time.
Q: How do we efficiently find a good setting of the 1.3M parameters?

Problem is easy if you want to be inefficient...
1. Repeat Forever:
2. Sample 1.3M random numbers
3. Run the policy for a while
4. If the performance is best so far, save it
5. Return the best policy

Suppose we had the training labels…
(we know what to do in any state)
(x1,UP)
(x2,DOWN)
(x3,UP)
...

Suppose we had the training labels…
(we know what to do in any state)
(x1,UP)
(x2,DOWN)
(x3,UP)
...
maximize:

Except, we don’t have labels...
Should we go UP or DOWN?

Except, we don’t have labels...
“Try a bunch of stuff and see
what happens. Do more of the
stuff that worked in the future.”
-RL

Let’s just act according to our current policy...
Rollout the policy
and collect an
episode

Not sure whatever we did here, but
apparently it was good.

Not sure whatever we did here, but it was bad.

Pretend every action we took here
was the correct label.
Pretend every action we took
here was the wrong label.
maximize: maximize:

Supervised Learning
maximize:
For images x_i and their
labels y_i.

Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning

Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
1) we have no labels so we sample:

Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
maximize:
1) we have no labels so we sample:
2) once we collect a batch of rollouts:

Discounting
Reward +1.0 Reward -1.0
Blame each action assuming that its effects have
exponentially decaying impact into the future.

Discounting
Reward +1.0 Reward -1.0
Blame each action assuming that its effects have
exponentially decaying impact into the future.
00-1-0.9-0.810.270.240.21
Discounted rewards
gamma = 0.9

-1 rewardball gets past our paddleepisode start
discounted reward
time
we screwed up
somewhere here
this was all hopeless and only
contributes noise to the gradient

130 line gist, numpy as the only dependency.
https://gist.github.com/karpathy/a4166c7fe253700972fcbc77
e4ea32c5

Nothing too scary over here.
We use OpenAI Gym.
And start the main training loop.

Get the current image and
preprocess it.

Bookkeeping so that we can do
backpropagation later. If you were
to use PyTorch or something, this
would not be needed.

Derivative of the [log probability of the taken action
given this image] with respect to the [output of the
network (before sigmoid)]
A small piece of backprop:
recall: loss:

Step the environment
(execute the action,
get new state and
record the reward)

Once a rollout is done,
Concatenate together all images, hidden
states, etc. that were seen in this batch.
Again, if using PyTorch, no need to do this.

In summary
1. Initialize a policy network at random
2. Repeat Forever:
3. Collect a bunch of rollouts with the policy
4. Increase the probability of actions that worked well
5. ???
6. Profit.

Similar to Lec4b pong from_pixels

李宏毅课件-Regression.pdfssusere61d07

2018 Global Azure Bootcamp Azure Machine Learning for neural networksSetu Chokshi

Introduction to Support Vector MachinesAnish M M

Deep learning from scratch Eran Shlomo

Notes on Reinforcement Learning - v0.1Joo-Haeng Lee

Neural networks - BigSkyDevConryanstout

Deep Learning with Apache MXNet (September 2017)Julien SIMON

Deep recommendations in PyTorchStitch Fix Algorithms

Mlcc #4Chung-Hsiang Ofa Hsueh

semi supervised Learning and Reinforcement learning (1).pptxDr.Shweta

Scalable gradientbasedtuningcontinuousregularizationhyperparameters pptRuochun Tzeng

NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS

Deep Learning for DevelopersJulien SIMON

20181221 q-traderTaku Yoshioka

Dynamic Programming and Reinforcement Learning applied to Tetris GameSuelen Carvalho

Distributive propertyDebbyChamberlain

Making smart decisions in real-time with Reinforcement LearningRuth Yakubu

Artwork Personalization at NetflixJustin Basilico

Sippin: A Mobile Application Case Study presented at Techfest LouisvilleDawn Yankeelov

Understanding computer vision with Deep LearningCloudxLab

Similar to Lec4b pong from_pixels (20)

李宏毅课件-Regression.pdf

2018 Global Azure Bootcamp Azure Machine Learning for neural networks

Introduction to Support Vector Machines

Deep learning from scratch

Notes on Reinforcement Learning - v0.1

Neural networks - BigSkyDevCon

Deep Learning with Apache MXNet (September 2017)

Deep recommendations in PyTorch

Mlcc #4

semi supervised Learning and Reinforcement learning (1).pptx

Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt

NUS-ISS Learning Day 2019-Introduction to reinforcement learning

Deep Learning for Developers

20181221 q-trader

Dynamic Programming and Reinforcement Learning applied to Tetris Game

Distributive property

Making smart decisions in real-time with Reinforcement Learning

Artwork Personalization at Netflix

Sippin: A Mobile Application Case Study presented at Techfest Louisville

Understanding computer vision with Deep Learning

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

Key Features Of Token Development (1).pptxLBM Solutions

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Understanding the Laravel MVC ArchitecturePixlogix Infotech

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada

Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Build your next Gen AI Breakthrough - April 2024Neo4j

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Bluetooth Controlled Car with Arduino.pdfngoud9212

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs

Science&tech:THE INFORMATION AGE STS.pdf

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

Key Features Of Token Development (1).pptx

Advanced Test Driven-Development @ php[tek] 2024

Understanding the Laravel MVC Architecture

New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024

Snow Chain-Integrated Tire for a Safe Drive on Winter Roads

My INSURER PTE LTD - Insurtech Innovation Award 2024

Unblocking The Main Thread Solving ANRs and Frozen Frames

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Build your next Gen AI Breakthrough - April 2024

My Hashitalk Indonesia April 2024 Presentation

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Bluetooth Controlled Car with Arduino.pdf

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Lec4b pong from_pixels

1. Pong from Pixels Deep RL Bootcamp Andrej Karpathy, Aug 26, 2017

3. MDP

4. Policy network

5. [80 x 80] array of height width e.g., Policy network

6. [80 x 80] array height width Policy network

7. E.g. 200 nodes in the hidden network, so: [(80*80)*200 + 200] + [200*1 + 1] = ~1.3M parameters Layer 1 Layer 2 [80 x 80] array height width Policy network

8. Network does not see this. Network sees 80*80 = 6,400 numbers. It gets a reward of +1 or -1, some of the time. Q: How do we efficiently find a good setting of the 1.3M parameters?

9. Problem is easy if you want to be inefficient... 1. Repeat Forever: 2. Sample 1.3M random numbers 3. Run the policy for a while 4. If the performance is best so far, save it 5. Return the best policy

10. Problem is easy if you want to be inefficient... 1. Repeat Forever: 2. Sample 1.3M random numbers 3. Run the policy for a while 4. If the performance is best so far, save it 5. Return the best policy

11. Problem is easy if you want to be inefficient... 1. Repeat Forever: 2. Sample 1.3M random numbers 3. Run the policy for a while 4. If the performance is best so far, save it 5. Return the best policy

12. Policy Gradients

13. Suppose we had the training labels… (we know what to do in any state) (x1,UP) (x2,DOWN) (x3,UP) ...

14. Suppose we had the training labels… (we know what to do in any state) (x1,UP) (x2,DOWN) (x3,UP) ...

15. Suppose we had the training labels… (we know what to do in any state) (x1,UP) (x2,DOWN) (x3,UP) ... maximize:

16. Except, we don’t have labels... Should we go UP or DOWN?

17. Except, we don’t have labels... “Try a bunch of stuff and see what happens. Do more of the stuff that worked in the future.” -RL

18. Let’s just act according to our current policy... Rollout the policy and collect an episode

19. 4 rollouts: Collect many rollouts...

20. Not sure whatever we did here, but apparently it was good.

21. Not sure whatever we did here, but it was bad.

22. Pretend every action we took here was the correct label. Pretend every action we took here was the wrong label. maximize: maximize:

23. Supervised Learning maximize: For images x_i and their labels y_i.

24. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning

25. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning 1) we have no labels so we sample:

26. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning maximize: 1) we have no labels so we sample: 2) once we collect a batch of rollouts:

27. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning maximize: 1) we have no labels so we sample: 2) once we collect a batch of rollouts: We call this the advantage, it’s a number, like +1.0 or -1.0 based on how this action eventually turned out.

28. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning maximize: 1) we have no labels so we sample: 2) once we collect a batch of rollouts: +ve advantage will make that action more likely in the future, for that state. -ve advantage will make that action less likely in the future, for that state.

29.

30. Discounting Reward +1.0 Reward -1.0 Blame each action assuming that its effects have exponentially decaying impact into the future.

31. Discounting Reward +1.0 Reward -1.0 Blame each action assuming that its effects have exponentially decaying impact into the future. 00-1-0.9-0.810.270.240.21 Discounted rewards gamma = 0.9

32. -1 rewardball gets past our paddleepisode start discounted reward time we screwed up somewhere here this was all hopeless and only contributes noise to the gradient

33.

34. 130 line gist, numpy as the only dependency. https://gist.github.com/karpathy/a4166c7fe253700972fcbc77 e4ea32c5

35. Nothing too scary over here. We use OpenAI Gym. And start the main training loop.

36. Get the current image and preprocess it.

37.

38. Bookkeeping so that we can do backpropagation later. If you were to use PyTorch or something, this would not be needed.

39. Derivative of the [log probability of the taken action given this image] with respect to the [output of the network (before sigmoid)] A small piece of backprop: recall: loss:

40. Derivative of the [log probability of the taken action given this image] with respect to the [output of the network (before sigmoid)] A small piece of backprop: More compact: recall: loss:

41. Step the environment (execute the action, get new state and record the reward)

42. Once a rollout is done, Concatenate together all images, hidden states, etc. that were seen in this batch. Again, if using PyTorch, no need to do this.

43.

44. backprop!!!!!!1 Advantage modulation

45. Use RMSProp for the parameter update.

46. prints etc

47. In summary 1. Initialize a policy network at random 2. Repeat Forever: 3. Collect a bunch of rollouts with the policy 4. Increase the probability of actions that worked well 5. ??? 6. Profit.

48.

49. Thank you! Questions?

Lec4b pong from_pixels

Recommended

Recommended

More Related Content

Similar to Lec4b pong from_pixels

Similar to Lec4b pong from_pixels (20)

More from Ronald Teo

More from Ronald Teo (16)

Recently uploaded

Recently uploaded (20)

Lec4b pong from_pixels