SlideShare a Scribd company logo
Thom Lane
14th December 2018
Proximal Policy Optimization
Open AI in 2017
Policy Approximation
Learn actions. Optionally values too.
i.e. learn the policy directly,
rather than indirectly via values (e.g. Q-values).
Actor Critic methods learn both.
Works for discrete and continuous action space.
Can learn probabilities (with a softmax output)
for discrete actions.
Can learn distribution parameters (mean and st. dev)
for continuous actions.
Can be easier to learn in some cases (e.g. Tetris).
Stochastic Policies
Discrete Action Spaces
Often use Categorical Distribution.
Continuous Action Spaces
Often use Diagonal Gaussian Distribution.
Standard deviation can depend on state.
Use log to remove >0 constraint.
Can learn exploration directly in the policy
Smooths learning
Stochastic Policies can be optimal
Useful for Continuous Control
Policy Gradients
! " = $%&
(())
∇ ! " ∝ -
.
/ ( -
0
1% (, 3 ∇45 3 (
∇ ! " = 6% -
0
1% (, 3 ∇45 3 (
∇ ! " = 6% 1%((, 37)
∇45 37 (
45 37 (
∇ ! " = 6% 87
∇45 37 (
45 37 (
∇ ! " = 87
∇45 37 (
45 37 (
Policy Gradients
! " = $%&
(())
∇ ! " = ,-
∇./ 0- (
./ 0- (
"-12 ← "- + 5∇ ! "
"-12 ← "- + 5,-
∇./ 0- (
./ 0- (
Gradient Assent
∇ ! " = ,- ∇ log(./ 0- ( ) because ∇ log 9 =
∇:
:
Policy Gradients
!"#$ ← !" + '("
∇*+ ," -
*+ ," -
Policy Gradients
!"#$ ← !" + '("
∇*+ ," -
*+ ," -
!.
!/
∇*+ 0123 -Action Space: left and right
State: A
Sampled action: left
Return: +2
Probability of action: 75%
Policy Gradients
!"#$ ← !" + '("
∇*+ ," -
*+ ," -
!.
!/
Action Space: left and right
State: A
Sampled action: left
Return: +2
Probability of action: 75%
2∇*+ 1234 -
Policy Gradients
!"#$ ← !" + '("
∇*+ ," -
*+ ," -
!.
!/
Action Space: left and right
State: A
Sampled action: left
Return: +2
Probability of action: 75%
2
∇*+ 1234 -
0.75
Policy Gradients
!"#$ ← !" + '("
∇*+ ," -
*+ ," -
!.
!/
Action Space: left and right
State: B
Sampled action: right
Return: -1
Probability of action: 50%
2
∇*+ 1234 -
0.75
∇*+ 9:;ℎ4 -
Policy Gradients
!"#$ ← !" + '("
∇*+ ," -
*+ ," -
!.
!/
Action Space: left and right
State: B
Sampled action: right
Return: -1
Probability of action: 50%
2
∇*+ 1234 -
0.75
−1∇*+ ;<=ℎ4 -
Policy Gradients
!"#$ ← !" + '("
∇*+ ," -
*+ ," -
!.
!/
Action Space: left and right
State: B
Sampled action: right
Return: -1
Probability of action: 50%
2
∇*+ 1234 -
0.75
−1
∇*+ ;<=ℎ4 -
0.5
Problem #1: Updates affect trajectories
Start with ‘on-policy’ learning.
We calculate gradient using trajectories collected from the current policy.
After a single update, we’re already ‘off-policy’.
We calculate gradient using trajectories collected from the previous policy.
Gradient calculation is only valid for ‘on-policy’ trajectories.
Can’t keep updating without collecting new trajectories.
Similar to ‘overfitting’
We might see loss fall if keep training using old data, but…
Very likely to perform badly in the next rollouts. So ignore the loss!
Solution #1: Importance Sampling
Problem #2: Similar values
Actions usually have similar expected returns from a given state.
Convoluted path taken in parameter space using this gradient.
!"
!#
$%
$&
$'
Action Space: $%, $& and $'
State: C
Sampled Action*: $%, $& and $'
Returns: 0.8, 1.0 and 1.2
Probability of action: equal
* assume we’ve seen State C 3 times in the current trajectories.
Solution #2: Advantage
Shift from absolute, to relative. Compared to what was expected.
!" #, % = '" #, % − )"(#)
,-
,.
%/
%0
%1
Action Space: %/, %0 and %1
State: C
Sampled Action*: %/, %0 and %1
Returns: 0.8, 1.0 and 1.2
Probability of action: equal (i.e. 1/3)
Estimated State Value: 1.1
Advantages: -0.3, -0.1 and 0.1
Actor Critic Methods
!" #, % = '" #, % − )"(#)
Advantage Actor Critic (A2C)
1) Use ) as a baseline
2) Use ) to bootstrap ' with 1-step TD
,-./ ← ,- + 2!-
∇45 %- #
45 %- #
6-./ ← 6- + 2[8- + 9:;< #=
− :;< # ]∇:;< #
Actor
Observation
Action
probabilities
State
value
Critic
Trust Region Policy Optimization (TPRO)
max
$
%& '&())+&
,-./012 23 %& 45[7$89:
; ,& , 7$(; |,&)] ≤ @
'& ) =
7$(B&|,&)
7$89:
(B&|,&)
Proximal Policy Optimization (PPO)
!" # =
%&(("|*")
%&,-.
(("|*")
max
&
2" !"(#)3"
max
&
2" min(!" # 3", 789: !" # , 1 − =, 1 + ? 3")
Proximal Policy Optimization (PPO)
Multiple losses in PPO
!"#$% & − ()!*+ & + (-.[01](45)
Clipped Policy Gradient
Value Error
Entropy
Advantages
Sample efficient
A little bit, can do multiple steps on single batch.
Which would usually break on policy rules.
Stable: due to clipping
Ease of implementation.
Thanks!

More Related Content

What's hot

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
Omar Enayet
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
Jie-Han Chen
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
Jie-Han Chen
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Salem-Kabbani
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
Jie-Han Chen
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
Seung Jae Lee
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
Dongmin Lee
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
Jie-Han Chen
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
Seung Jae Lee
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
Dong Guo
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
pauldix
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
Prof. Neeta Awasthy
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Muhammad Iqbal Tawakal
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
Ronald Teo
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
Jun Young Park
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
Big Data Colombia
 

What's hot (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Planning and Learning with Tabular Methods
Planning and Learning with Tabular MethodsPlanning and Learning with Tabular Methods
Planning and Learning with Tabular Methods
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 

Similar to Proximal Policy Optimization (Reinforcement Learning)

Curvefitting
CurvefittingCurvefitting
Curvefitting
Philberto Saroni
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Simplilearn
 
Predicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in RPredicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in R
Sotiris Baratsas
 
Statistics in real life engineering
Statistics in real life engineeringStatistics in real life engineering
Statistics in real life engineering
MD TOUFIQ HASAN ANIK
 
(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WA(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WA
Mohammed Al Hamadi
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11
darwinrlo
 
Astronomical data analysis by python.pdf
Astronomical data analysis by python.pdfAstronomical data analysis by python.pdf
Astronomical data analysis by python.pdf
ZainRahim3
 
3ml.pdf
3ml.pdf3ml.pdf
3ml.pdf
MianAdnan27
 
Linear logisticregression
Linear logisticregressionLinear logisticregression
Linear logisticregression
kongara
 
Exploring fractals in CSS, @fronttrends, Warsaw, 2015
Exploring fractals in CSS, @fronttrends, Warsaw, 2015Exploring fractals in CSS, @fronttrends, Warsaw, 2015
Exploring fractals in CSS, @fronttrends, Warsaw, 2015
pixelass
 
Quantitative Methods for Lawyers - Class #20 - Regression Analysis - Part 3
Quantitative Methods for Lawyers - Class #20 - Regression Analysis - Part 3Quantitative Methods for Lawyers - Class #20 - Regression Analysis - Part 3
Quantitative Methods for Lawyers - Class #20 - Regression Analysis - Part 3
Daniel Katz
 
IVS-B UNIT-1_merged. Semester 2 fundamental of sciencepdf
IVS-B UNIT-1_merged. Semester 2 fundamental of sciencepdfIVS-B UNIT-1_merged. Semester 2 fundamental of sciencepdf
IVS-B UNIT-1_merged. Semester 2 fundamental of sciencepdf
42Rnu
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
재연 윤
 
Introduction to Python Language and Data Types
Introduction to Python Language and Data TypesIntroduction to Python Language and Data Types
Introduction to Python Language and Data Types
Ravi Shankar
 
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsDeep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Jason Tsai
 
2.2 More on Functions and Their Graphs
2.2 More on Functions and Their Graphs2.2 More on Functions and Their Graphs
2.2 More on Functions and Their Graphs
smiller5
 
Polymer Brush Data Processor
Polymer Brush Data ProcessorPolymer Brush Data Processor
Polymer Brush Data Processor
Cory Bethrant
 
Operators
OperatorsOperators
Operators
VijayaLakshmi506
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
preetikumara
 
Multivariate Time Series
Multivariate Time SeriesMultivariate Time Series
Multivariate Time Series
Apache MXNet
 

Similar to Proximal Policy Optimization (Reinforcement Learning) (20)

Curvefitting
CurvefittingCurvefitting
Curvefitting
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
 
Predicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in RPredicting US house prices using Multiple Linear Regression in R
Predicting US house prices using Multiple Linear Regression in R
 
Statistics in real life engineering
Statistics in real life engineeringStatistics in real life engineering
Statistics in real life engineering
 
(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WA(Machine Learning) Clustering & Classifying Houses in King County, WA
(Machine Learning) Clustering & Classifying Houses in King County, WA
 
Cs221 lecture8-fall11
Cs221 lecture8-fall11Cs221 lecture8-fall11
Cs221 lecture8-fall11
 
Astronomical data analysis by python.pdf
Astronomical data analysis by python.pdfAstronomical data analysis by python.pdf
Astronomical data analysis by python.pdf
 
3ml.pdf
3ml.pdf3ml.pdf
3ml.pdf
 
Linear logisticregression
Linear logisticregressionLinear logisticregression
Linear logisticregression
 
Exploring fractals in CSS, @fronttrends, Warsaw, 2015
Exploring fractals in CSS, @fronttrends, Warsaw, 2015Exploring fractals in CSS, @fronttrends, Warsaw, 2015
Exploring fractals in CSS, @fronttrends, Warsaw, 2015
 
Quantitative Methods for Lawyers - Class #20 - Regression Analysis - Part 3
Quantitative Methods for Lawyers - Class #20 - Regression Analysis - Part 3Quantitative Methods for Lawyers - Class #20 - Regression Analysis - Part 3
Quantitative Methods for Lawyers - Class #20 - Regression Analysis - Part 3
 
IVS-B UNIT-1_merged. Semester 2 fundamental of sciencepdf
IVS-B UNIT-1_merged. Semester 2 fundamental of sciencepdfIVS-B UNIT-1_merged. Semester 2 fundamental of sciencepdf
IVS-B UNIT-1_merged. Semester 2 fundamental of sciencepdf
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Introduction to Python Language and Data Types
Introduction to Python Language and Data TypesIntroduction to Python Language and Data Types
Introduction to Python Language and Data Types
 
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning BasicsDeep Learning: Introduction & Chapter 5 Machine Learning Basics
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
 
2.2 More on Functions and Their Graphs
2.2 More on Functions and Their Graphs2.2 More on Functions and Their Graphs
2.2 More on Functions and Their Graphs
 
Polymer Brush Data Processor
Polymer Brush Data ProcessorPolymer Brush Data Processor
Polymer Brush Data Processor
 
Operators
OperatorsOperators
Operators
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
Multivariate Time Series
Multivariate Time SeriesMultivariate Time Series
Multivariate Time Series
 

Recently uploaded

5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 

Recently uploaded (20)

5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 

Proximal Policy Optimization (Reinforcement Learning)

  • 1. Thom Lane 14th December 2018 Proximal Policy Optimization Open AI in 2017
  • 2. Policy Approximation Learn actions. Optionally values too. i.e. learn the policy directly, rather than indirectly via values (e.g. Q-values). Actor Critic methods learn both. Works for discrete and continuous action space. Can learn probabilities (with a softmax output) for discrete actions. Can learn distribution parameters (mean and st. dev) for continuous actions. Can be easier to learn in some cases (e.g. Tetris).
  • 3. Stochastic Policies Discrete Action Spaces Often use Categorical Distribution. Continuous Action Spaces Often use Diagonal Gaussian Distribution. Standard deviation can depend on state. Use log to remove >0 constraint. Can learn exploration directly in the policy Smooths learning
  • 6. Policy Gradients ! " = $%& (()) ∇ ! " ∝ - . / ( - 0 1% (, 3 ∇45 3 ( ∇ ! " = 6% - 0 1% (, 3 ∇45 3 ( ∇ ! " = 6% 1%((, 37) ∇45 37 ( 45 37 ( ∇ ! " = 6% 87 ∇45 37 ( 45 37 ( ∇ ! " = 87 ∇45 37 ( 45 37 (
  • 7. Policy Gradients ! " = $%& (()) ∇ ! " = ,- ∇./ 0- ( ./ 0- ( "-12 ← "- + 5∇ ! " "-12 ← "- + 5,- ∇./ 0- ( ./ 0- ( Gradient Assent ∇ ! " = ,- ∇ log(./ 0- ( ) because ∇ log 9 = ∇: :
  • 8. Policy Gradients !"#$ ← !" + '(" ∇*+ ," - *+ ," -
  • 9. Policy Gradients !"#$ ← !" + '(" ∇*+ ," - *+ ," - !. !/ ∇*+ 0123 -Action Space: left and right State: A Sampled action: left Return: +2 Probability of action: 75%
  • 10. Policy Gradients !"#$ ← !" + '(" ∇*+ ," - *+ ," - !. !/ Action Space: left and right State: A Sampled action: left Return: +2 Probability of action: 75% 2∇*+ 1234 -
  • 11. Policy Gradients !"#$ ← !" + '(" ∇*+ ," - *+ ," - !. !/ Action Space: left and right State: A Sampled action: left Return: +2 Probability of action: 75% 2 ∇*+ 1234 - 0.75
  • 12. Policy Gradients !"#$ ← !" + '(" ∇*+ ," - *+ ," - !. !/ Action Space: left and right State: B Sampled action: right Return: -1 Probability of action: 50% 2 ∇*+ 1234 - 0.75 ∇*+ 9:;ℎ4 -
  • 13. Policy Gradients !"#$ ← !" + '(" ∇*+ ," - *+ ," - !. !/ Action Space: left and right State: B Sampled action: right Return: -1 Probability of action: 50% 2 ∇*+ 1234 - 0.75 −1∇*+ ;<=ℎ4 -
  • 14. Policy Gradients !"#$ ← !" + '(" ∇*+ ," - *+ ," - !. !/ Action Space: left and right State: B Sampled action: right Return: -1 Probability of action: 50% 2 ∇*+ 1234 - 0.75 −1 ∇*+ ;<=ℎ4 - 0.5
  • 15. Problem #1: Updates affect trajectories Start with ‘on-policy’ learning. We calculate gradient using trajectories collected from the current policy. After a single update, we’re already ‘off-policy’. We calculate gradient using trajectories collected from the previous policy. Gradient calculation is only valid for ‘on-policy’ trajectories. Can’t keep updating without collecting new trajectories. Similar to ‘overfitting’ We might see loss fall if keep training using old data, but… Very likely to perform badly in the next rollouts. So ignore the loss!
  • 17. Problem #2: Similar values Actions usually have similar expected returns from a given state. Convoluted path taken in parameter space using this gradient. !" !# $% $& $' Action Space: $%, $& and $' State: C Sampled Action*: $%, $& and $' Returns: 0.8, 1.0 and 1.2 Probability of action: equal * assume we’ve seen State C 3 times in the current trajectories.
  • 18. Solution #2: Advantage Shift from absolute, to relative. Compared to what was expected. !" #, % = '" #, % − )"(#) ,- ,. %/ %0 %1 Action Space: %/, %0 and %1 State: C Sampled Action*: %/, %0 and %1 Returns: 0.8, 1.0 and 1.2 Probability of action: equal (i.e. 1/3) Estimated State Value: 1.1 Advantages: -0.3, -0.1 and 0.1
  • 19. Actor Critic Methods !" #, % = '" #, % − )"(#) Advantage Actor Critic (A2C) 1) Use ) as a baseline 2) Use ) to bootstrap ' with 1-step TD ,-./ ← ,- + 2!- ∇45 %- # 45 %- # 6-./ ← 6- + 2[8- + 9:;< #= − :;< # ]∇:;< # Actor Observation Action probabilities State value Critic
  • 20. Trust Region Policy Optimization (TPRO) max $ %& '&())+& ,-./012 23 %& 45[7$89: ; ,& , 7$(; |,&)] ≤ @ '& ) = 7$(B&|,&) 7$89: (B&|,&)
  • 21. Proximal Policy Optimization (PPO) !" # = %&(("|*") %&,-. (("|*") max & 2" !"(#)3" max & 2" min(!" # 3", 789: !" # , 1 − =, 1 + ? 3")
  • 23. Multiple losses in PPO !"#$% & − ()!*+ & + (-.[01](45) Clipped Policy Gradient Value Error Entropy
  • 24. Advantages Sample efficient A little bit, can do multiple steps on single batch. Which would usually break on policy rules. Stable: due to clipping Ease of implementation.