SlideShare a Scribd company logo
1 of 49
Download to read offline
Pong from Pixels
Deep RL Bootcamp
Andrej Karpathy, Aug 26, 2017
MDP
Policy network
[80 x 80]
array of
height width
e.g.,
Policy network
[80 x 80]
array
height width
Policy network
E.g. 200 nodes in the hidden network, so:
[(80*80)*200 + 200] + [200*1 + 1] = ~1.3M parameters
Layer 1 Layer 2
[80 x 80]
array
height width
Policy network
Network does not see this. Network sees 80*80 = 6,400 numbers.
It gets a reward of +1 or -1, some of the time.
Q: How do we efficiently find a good setting of the 1.3M parameters?
Problem is easy if you want to be inefficient...
1. Repeat Forever:
2. Sample 1.3M random numbers
3. Run the policy for a while
4. If the performance is best so far, save it
5. Return the best policy
Problem is easy if you want to be inefficient...
1. Repeat Forever:
2. Sample 1.3M random numbers
3. Run the policy for a while
4. If the performance is best so far, save it
5. Return the best policy
Problem is easy if you want to be inefficient...
1. Repeat Forever:
2. Sample 1.3M random numbers
3. Run the policy for a while
4. If the performance is best so far, save it
5. Return the best policy
Policy Gradients
Suppose we had the training labels…
(we know what to do in any state)
(x1,UP)
(x2,DOWN)
(x3,UP)
...
Suppose we had the training labels…
(we know what to do in any state)
(x1,UP)
(x2,DOWN)
(x3,UP)
...
Suppose we had the training labels…
(we know what to do in any state)
(x1,UP)
(x2,DOWN)
(x3,UP)
...
maximize:
Except, we don’t have labels...
Should we go UP or DOWN?
Except, we don’t have labels...
“Try a bunch of stuff and see
what happens. Do more of the
stuff that worked in the future.”
-RL
Let’s just act according to our current policy...
Rollout the policy
and collect an
episode
4 rollouts:
Collect many rollouts...
Not sure whatever we did here, but
apparently it was good.
Not sure whatever we did here, but it was bad.
Pretend every action we took here
was the correct label.
Pretend every action we took
here was the wrong label.
maximize: maximize:
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
1) we have no labels so we sample:
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
maximize:
1) we have no labels so we sample:
2) once we collect a batch of rollouts:
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
maximize:
1) we have no labels so we sample:
2) once we collect a batch of rollouts:
We call this the advantage, it’s a
number, like +1.0 or -1.0 based on how
this action eventually turned out.
Supervised Learning
maximize:
For images x_i and their
labels y_i.
Reinforcement Learning
maximize:
1) we have no labels so we sample:
2) once we collect a batch of rollouts:
+ve advantage will make that action more
likely in the future, for that state.
-ve advantage will make that action less
likely in the future, for that state.
Discounting
Reward +1.0 Reward -1.0
Blame each action assuming that its effects have
exponentially decaying impact into the future.
Discounting
Reward +1.0 Reward -1.0
Blame each action assuming that its effects have
exponentially decaying impact into the future.
00-1-0.9-0.810.270.240.21
Discounted rewards
gamma = 0.9
-1 rewardball gets past our paddleepisode start
discounted reward
time
we screwed up
somewhere here
this was all hopeless and only
contributes noise to the gradient
130 line gist, numpy as the only dependency.
https://gist.github.com/karpathy/a4166c7fe253700972fcbc77
e4ea32c5
Nothing too scary over here.
We use OpenAI Gym.
And start the main training loop.
Get the current image and
preprocess it.
Bookkeeping so that we can do
backpropagation later. If you were
to use PyTorch or something, this
would not be needed.
Derivative of the [log probability of the taken action
given this image] with respect to the [output of the
network (before sigmoid)]
A small piece of backprop:
recall: loss:
Derivative of the [log probability of the taken action
given this image] with respect to the [output of the
network (before sigmoid)]
A small piece of backprop:
More compact:
recall: loss:
Step the environment
(execute the action,
get new state and
record the reward)
Once a rollout is done,
Concatenate together all images, hidden
states, etc. that were seen in this batch.
Again, if using PyTorch, no need to do this.
backprop!!!!!!1
Advantage modulation
Use RMSProp for the
parameter update.
prints etc
In summary
1. Initialize a policy network at random
2. Repeat Forever:
3. Collect a bunch of rollouts with the policy
4. Increase the probability of actions that worked well
5. ???
6. Profit.
Thank you! Questions?

More Related Content

Similar to Lec4b pong from_pixels

李宏毅课件-Regression.pdf
李宏毅课件-Regression.pdf李宏毅课件-Regression.pdf
李宏毅课件-Regression.pdfssusere61d07
 
2018 Global Azure Bootcamp Azure Machine Learning for neural networks
2018 Global Azure Bootcamp Azure Machine Learning for neural networks2018 Global Azure Bootcamp Azure Machine Learning for neural networks
2018 Global Azure Bootcamp Azure Machine Learning for neural networksSetu Chokshi
 
Introduction to Support Vector Machines
Introduction to Support Vector MachinesIntroduction to Support Vector Machines
Introduction to Support Vector MachinesAnish M M
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch Eran Shlomo
 
Notes on Reinforcement Learning - v0.1
Notes on Reinforcement Learning - v0.1Notes on Reinforcement Learning - v0.1
Notes on Reinforcement Learning - v0.1Joo-Haeng Lee
 
Neural networks - BigSkyDevCon
Neural networks - BigSkyDevConNeural networks - BigSkyDevCon
Neural networks - BigSkyDevConryanstout
 
Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Julien SIMON
 
semi supervised Learning and Reinforcement learning (1).pptx
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptxDr.Shweta
 
Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Scalable gradientbasedtuningcontinuousregularizationhyperparameters pptScalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Scalable gradientbasedtuningcontinuousregularizationhyperparameters pptRuochun Tzeng
 
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS
 
Deep Learning for Developers
Deep Learning for DevelopersDeep Learning for Developers
Deep Learning for DevelopersJulien SIMON
 
Dynamic Programming and Reinforcement Learning applied to Tetris Game
Dynamic Programming and Reinforcement Learning applied to Tetris GameDynamic Programming and Reinforcement Learning applied to Tetris Game
Dynamic Programming and Reinforcement Learning applied to Tetris GameSuelen Carvalho
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningRuth Yakubu
 
Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at NetflixJustin Basilico
 
Sippin: A Mobile Application Case Study presented at Techfest Louisville
Sippin: A Mobile Application Case Study presented at Techfest LouisvilleSippin: A Mobile Application Case Study presented at Techfest Louisville
Sippin: A Mobile Application Case Study presented at Techfest LouisvilleDawn Yankeelov
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep LearningCloudxLab
 

Similar to Lec4b pong from_pixels (20)

李宏毅课件-Regression.pdf
李宏毅课件-Regression.pdf李宏毅课件-Regression.pdf
李宏毅课件-Regression.pdf
 
2018 Global Azure Bootcamp Azure Machine Learning for neural networks
2018 Global Azure Bootcamp Azure Machine Learning for neural networks2018 Global Azure Bootcamp Azure Machine Learning for neural networks
2018 Global Azure Bootcamp Azure Machine Learning for neural networks
 
Introduction to Support Vector Machines
Introduction to Support Vector MachinesIntroduction to Support Vector Machines
Introduction to Support Vector Machines
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch
 
Notes on Reinforcement Learning - v0.1
Notes on Reinforcement Learning - v0.1Notes on Reinforcement Learning - v0.1
Notes on Reinforcement Learning - v0.1
 
Neural networks - BigSkyDevCon
Neural networks - BigSkyDevConNeural networks - BigSkyDevCon
Neural networks - BigSkyDevCon
 
Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)
 
Deep recommendations in PyTorch
Deep recommendations in PyTorchDeep recommendations in PyTorch
Deep recommendations in PyTorch
 
Mlcc #4
Mlcc #4Mlcc #4
Mlcc #4
 
semi supervised Learning and Reinforcement learning (1).pptx
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptx
 
Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Scalable gradientbasedtuningcontinuousregularizationhyperparameters pptScalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
Scalable gradientbasedtuningcontinuousregularizationhyperparameters ppt
 
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learningNUS-ISS Learning Day 2019-Introduction to reinforcement learning
NUS-ISS Learning Day 2019-Introduction to reinforcement learning
 
Deep Learning for Developers
Deep Learning for DevelopersDeep Learning for Developers
Deep Learning for Developers
 
20181221 q-trader
20181221 q-trader20181221 q-trader
20181221 q-trader
 
Dynamic Programming and Reinforcement Learning applied to Tetris Game
Dynamic Programming and Reinforcement Learning applied to Tetris GameDynamic Programming and Reinforcement Learning applied to Tetris Game
Dynamic Programming and Reinforcement Learning applied to Tetris Game
 
Distributive property
Distributive propertyDistributive property
Distributive property
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement Learning
 
Artwork Personalization at Netflix
Artwork Personalization at NetflixArtwork Personalization at Netflix
Artwork Personalization at Netflix
 
Sippin: A Mobile Application Case Study presented at Techfest Louisville
Sippin: A Mobile Application Case Study presented at Techfest LouisvilleSippin: A Mobile Application Case Study presented at Techfest Louisville
Sippin: A Mobile Application Case Study presented at Techfest Louisville
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 

More from Ronald Teo

07 regularization
07 regularization07 regularization
07 regularizationRonald Teo
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodelRonald Teo
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgRonald Teo
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsRonald Teo
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchRonald Teo
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticRonald Teo
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingRonald Teo
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsRonald Teo
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebraRonald Teo
 

More from Ronald Teo (16)

Mc td
Mc tdMc td
Mc td
 
07 regularization
07 regularization07 regularization
07 regularization
 
Dp
DpDp
Dp
 
06 mlp
06 mlp06 mlp
06 mlp
 
Mdp
MdpMdp
Mdp
 
04 numerical
04 numerical04 numerical
04 numerical
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 
Intro rl
Intro rlIntro rl
Intro rl
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-research
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-critic
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methods
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebra
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Lec4b pong from_pixels

  • 1. Pong from Pixels Deep RL Bootcamp Andrej Karpathy, Aug 26, 2017
  • 2.
  • 3. MDP
  • 5. [80 x 80] array of height width e.g., Policy network
  • 6. [80 x 80] array height width Policy network
  • 7. E.g. 200 nodes in the hidden network, so: [(80*80)*200 + 200] + [200*1 + 1] = ~1.3M parameters Layer 1 Layer 2 [80 x 80] array height width Policy network
  • 8. Network does not see this. Network sees 80*80 = 6,400 numbers. It gets a reward of +1 or -1, some of the time. Q: How do we efficiently find a good setting of the 1.3M parameters?
  • 9. Problem is easy if you want to be inefficient... 1. Repeat Forever: 2. Sample 1.3M random numbers 3. Run the policy for a while 4. If the performance is best so far, save it 5. Return the best policy
  • 10. Problem is easy if you want to be inefficient... 1. Repeat Forever: 2. Sample 1.3M random numbers 3. Run the policy for a while 4. If the performance is best so far, save it 5. Return the best policy
  • 11. Problem is easy if you want to be inefficient... 1. Repeat Forever: 2. Sample 1.3M random numbers 3. Run the policy for a while 4. If the performance is best so far, save it 5. Return the best policy
  • 13. Suppose we had the training labels… (we know what to do in any state) (x1,UP) (x2,DOWN) (x3,UP) ...
  • 14. Suppose we had the training labels… (we know what to do in any state) (x1,UP) (x2,DOWN) (x3,UP) ...
  • 15. Suppose we had the training labels… (we know what to do in any state) (x1,UP) (x2,DOWN) (x3,UP) ... maximize:
  • 16. Except, we don’t have labels... Should we go UP or DOWN?
  • 17. Except, we don’t have labels... “Try a bunch of stuff and see what happens. Do more of the stuff that worked in the future.” -RL
  • 18. Let’s just act according to our current policy... Rollout the policy and collect an episode
  • 20. Not sure whatever we did here, but apparently it was good.
  • 21. Not sure whatever we did here, but it was bad.
  • 22. Pretend every action we took here was the correct label. Pretend every action we took here was the wrong label. maximize: maximize:
  • 23. Supervised Learning maximize: For images x_i and their labels y_i.
  • 24. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning
  • 25. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning 1) we have no labels so we sample:
  • 26. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning maximize: 1) we have no labels so we sample: 2) once we collect a batch of rollouts:
  • 27. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning maximize: 1) we have no labels so we sample: 2) once we collect a batch of rollouts: We call this the advantage, it’s a number, like +1.0 or -1.0 based on how this action eventually turned out.
  • 28. Supervised Learning maximize: For images x_i and their labels y_i. Reinforcement Learning maximize: 1) we have no labels so we sample: 2) once we collect a batch of rollouts: +ve advantage will make that action more likely in the future, for that state. -ve advantage will make that action less likely in the future, for that state.
  • 29.
  • 30. Discounting Reward +1.0 Reward -1.0 Blame each action assuming that its effects have exponentially decaying impact into the future.
  • 31. Discounting Reward +1.0 Reward -1.0 Blame each action assuming that its effects have exponentially decaying impact into the future. 00-1-0.9-0.810.270.240.21 Discounted rewards gamma = 0.9
  • 32. -1 rewardball gets past our paddleepisode start discounted reward time we screwed up somewhere here this was all hopeless and only contributes noise to the gradient
  • 33.
  • 34. 130 line gist, numpy as the only dependency. https://gist.github.com/karpathy/a4166c7fe253700972fcbc77 e4ea32c5
  • 35. Nothing too scary over here. We use OpenAI Gym. And start the main training loop.
  • 36. Get the current image and preprocess it.
  • 37.
  • 38. Bookkeeping so that we can do backpropagation later. If you were to use PyTorch or something, this would not be needed.
  • 39. Derivative of the [log probability of the taken action given this image] with respect to the [output of the network (before sigmoid)] A small piece of backprop: recall: loss:
  • 40. Derivative of the [log probability of the taken action given this image] with respect to the [output of the network (before sigmoid)] A small piece of backprop: More compact: recall: loss:
  • 41. Step the environment (execute the action, get new state and record the reward)
  • 42. Once a rollout is done, Concatenate together all images, hidden states, etc. that were seen in this batch. Again, if using PyTorch, no need to do this.
  • 43.
  • 45. Use RMSProp for the parameter update.
  • 47. In summary 1. Initialize a policy network at random 2. Repeat Forever: 3. Collect a bunch of rollouts with the policy 4. Increase the probability of actions that worked well 5. ??? 6. Profit.
  • 48.