SlideShare a Scribd company logo
Practical RL with
TensorFlow
Illia Polosukhin, XIX.ai
Reinforcement Learning Problem
OpenAI Gym
- Library of environments
Control, Atari, Doom, etc.
- Same API
- Provides way to share and
compare results
https://gym.openai.com/
Acting in an Environment
Random Agent
Let’s review some theory
Markov Decision Process
MDP < S, A, P, R, 𝛾 >
- S: set of states
- A: set of actions
- T(s, a, s’): probability of transition
- Reward(s): reward function
- 𝛾: discounting factory
Trace: {<s0,a0,r0>, …, <sn,an,rn>}
Definitions
- Return: total discounted reward:
- Policy: Agent’s behavior
- Deterministic policy: π(s) = a
- Stochastic policy: π(a | s) = P[At = a | St = s]
- Value function: Expected return starting from state s:
- State-value function: Vπ(s) = Eπ[R | St = s]
- Action-value function: Qπ(s, a) = Eπ[R | St = s, At = a]
Deep Q Learning
- Model-free, off-policy technique to learn optimal Q(s, a):
- Qi+1(s, a) ← Qi(s, a) + 𝛼(R + 𝛾 maxa’ Qi(s’, a’) - Qi(s, a))
- Optimal policy then π(s) = argmaxa’ Q(s, a’)
- Requires exploration (ε-greedy) to explore various transitions from the states.
- Take random action with ε probability, start ε high and decay to low value as training
progresses.
- Deep Q Learning: approximate Q(s, a) with neural network: Q(s, a, 𝜃)
- Do stochastic gradient descent using loss
Q-network
Run Optimization
Full example: https://github.com/ilblackdragon/tensorflow-rl/blob/master/examples/atari-rl.py
Monitored Session
- Handles pitfalls of distributed training.
- Saving and restoring checkpoints.
- Hooks is a general interface for injecting
computation into TensorFlow training
loop.
Original Results on Atari Games
Mnih et al., 2013
Beating Human Level Mnih at el., 2015
Policy Gradient
- Given policy π 𝜃(a | s) find such 𝜃 that maximizes expected return:
J(𝜃) = ∑sdπ(s)V(s)
- In Deep RL, we approximate π 𝜃(a | s) with neural network.
- Usually with softmax layer on top to estimate probabilities of each action.
- We can estimate J(𝜃) from samples of observed behavior: ∑k=0..Tp𝜃( 𝜏k | π)R( 𝜏k)
- Do stochastic gradient descent using update:
𝜃i+1 = 𝜃i + 𝛼 (1/T) ∑k=0..T ∇log p𝜃( 𝜏k | π)R( 𝜏k)
Policy Network
Run Optimization
Async Advantage Actor-Critic (A3C)
- Asynchronous: using multiple instances of
environments and networks
- Actor-Critic: using both policy and
estimate of value function.
- Advantage: estimate how different was
outcome than expected.
Image by Arthur Juliani
Policy and Value Networks
Run optimization
A3C Results on Atari Games
Mnih at el., 2016
Mnih at el., 2016
Practical use cases
- Robotics
- Finance
- Industrial optimization
- Predictive assistant
Illia Polosukhin
XIX.ai
@ilblackdragon, illia@xix.ai
Questions?
Full code will be available soon at
https://github.com/ilblackdragon/tensorflow-rl/

More Related Content

What's hot

GTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introductionGTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introduction
Kenta Oono
 
Chainer v3
Chainer v3Chainer v3
Chainer v3
Seiya Tokui
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
Hidekazu Oiwa
 
Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorch
Mayur Bhangale
 
Machine Intelligence at Google Scale: TensorFlow
Machine Intelligence at Google Scale: TensorFlowMachine Intelligence at Google Scale: TensorFlow
Machine Intelligence at Google Scale: TensorFlow
DataWorks Summit/Hadoop Summit
 
Introduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlowIntroduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlow
Paolo Tomeo
 
Introduction to theano, case study of Word Embeddings
Introduction to theano, case study of Word EmbeddingsIntroduction to theano, case study of Word Embeddings
Introduction to theano, case study of Word Embeddings
Shashank Gupta
 
Cv mini project (1)
Cv mini project (1)Cv mini project (1)
Cv mini project (1)
Kadambini Indurkar
 
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RLPFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
Naoto Yoshida
 
DQN with Differentiable Memory Architectures
DQN with Differentiable Memory ArchitecturesDQN with Differentiable Memory Architectures
DQN with Differentiable Memory Architectures
Preferred Networks
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
Preferred Networks
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
Ralph Vincent Regalado
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
MLconf
 
TensorFlow example for AI Ukraine2016
TensorFlow example  for AI Ukraine2016TensorFlow example  for AI Ukraine2016
TensorFlow example for AI Ukraine2016
Andrii Babii
 
Deep Learning in theano
Deep Learning in theanoDeep Learning in theano
Deep Learning in theano
Massimo Quadrana
 
TensorFlow Dev Summit 2017 요약
TensorFlow Dev Summit 2017 요약TensorFlow Dev Summit 2017 요약
TensorFlow Dev Summit 2017 요약
Jin Joong Kim
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
Amgad Muhammad
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its Features
Seiya Tokui
 
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
Big Data Spain
 
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAlex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
AI Frontiers
 

What's hot (20)

GTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introductionGTC Japan 2016 Chainer feature introduction
GTC Japan 2016 Chainer feature introduction
 
Chainer v3
Chainer v3Chainer v3
Chainer v3
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
 
Deep Learning with PyTorch
Deep Learning with PyTorchDeep Learning with PyTorch
Deep Learning with PyTorch
 
Machine Intelligence at Google Scale: TensorFlow
Machine Intelligence at Google Scale: TensorFlowMachine Intelligence at Google Scale: TensorFlow
Machine Intelligence at Google Scale: TensorFlow
 
Introduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlowIntroduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlow
 
Introduction to theano, case study of Word Embeddings
Introduction to theano, case study of Word EmbeddingsIntroduction to theano, case study of Word Embeddings
Introduction to theano, case study of Word Embeddings
 
Cv mini project (1)
Cv mini project (1)Cv mini project (1)
Cv mini project (1)
 
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RLPFN Spring Internship Final Report: Autonomous Drive by Deep RL
PFN Spring Internship Final Report: Autonomous Drive by Deep RL
 
DQN with Differentiable Memory Architectures
DQN with Differentiable Memory ArchitecturesDQN with Differentiable Memory Architectures
DQN with Differentiable Memory Architectures
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
 
TensorFlow example for AI Ukraine2016
TensorFlow example  for AI Ukraine2016TensorFlow example  for AI Ukraine2016
TensorFlow example for AI Ukraine2016
 
Deep Learning in theano
Deep Learning in theanoDeep Learning in theano
Deep Learning in theano
 
TensorFlow Dev Summit 2017 요약
TensorFlow Dev Summit 2017 요약TensorFlow Dev Summit 2017 요약
TensorFlow Dev Summit 2017 요약
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Overview of Chainer and Its Features
Overview of Chainer and Its FeaturesOverview of Chainer and Its Features
Overview of Chainer and Its Features
 
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
 
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAlex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNet
 

Similar to Practical Reinforcement Learning with TensorFlow

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
Lyft
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
Willy Marroquin (WillyDevNET)
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
Data Science Milan
 
Cheatsheet deep-learning
Cheatsheet deep-learningCheatsheet deep-learning
Cheatsheet deep-learning
Steve Nouri
 
Value Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank ModelsValue Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank Models
Lyft
 
A new Evolutionary Reinforcement Scheme for Stochastic Learning Automata
A new Evolutionary Reinforcement Scheme for Stochastic Learning AutomataA new Evolutionary Reinforcement Scheme for Stochastic Learning Automata
A new Evolutionary Reinforcement Scheme for Stochastic Learning Automata
infopapers
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
재연 윤
 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
infopapers
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
alpinedatalabs
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
DB Tsai
 
shuyangli_summerpresentation08082014
shuyangli_summerpresentation08082014shuyangli_summerpresentation08082014
shuyangli_summerpresentation08082014
Shuyang Li
 
La question de la durabilité des technologies de calcul et de télécommunication
La question de la durabilité des technologies de calcul et de télécommunicationLa question de la durabilité des technologies de calcul et de télécommunication
La question de la durabilité des technologies de calcul et de télécommunication
Alexandre Monnin
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
Kai-Wen Zhao
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
Mohammaderfan Arefimoghaddam
 
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
A New Nonlinear Reinforcement Scheme for Stochastic Learning AutomataA New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
infopapers
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
Charles Martin
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
Ryo Iwaki
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
Yoonho Lee
 
Supervisory control of discrete event systems for linear temporal logic speci...
Supervisory control of discrete event systems for linear temporal logic speci...Supervisory control of discrete event systems for linear temporal logic speci...
Supervisory control of discrete event systems for linear temporal logic speci...
AmiSakakibara
 

Similar to Practical Reinforcement Learning with TensorFlow (20)

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
Cheatsheet deep-learning
Cheatsheet deep-learningCheatsheet deep-learning
Cheatsheet deep-learning
 
Value Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank ModelsValue Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank Models
 
A new Evolutionary Reinforcement Scheme for Stochastic Learning Automata
A new Evolutionary Reinforcement Scheme for Stochastic Learning AutomataA new Evolutionary Reinforcement Scheme for Stochastic Learning Automata
A new Evolutionary Reinforcement Scheme for Stochastic Learning Automata
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithmOptimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
Optimizing a New Nonlinear Reinforcement Scheme with Breeder genetic algorithm
 
Alpine Spark Implementation - Technical
Alpine Spark Implementation - TechnicalAlpine Spark Implementation - Technical
Alpine Spark Implementation - Technical
 
Multinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache SparkMultinomial Logistic Regression with Apache Spark
Multinomial Logistic Regression with Apache Spark
 
shuyangli_summerpresentation08082014
shuyangli_summerpresentation08082014shuyangli_summerpresentation08082014
shuyangli_summerpresentation08082014
 
La question de la durabilité des technologies de calcul et de télécommunication
La question de la durabilité des technologies de calcul et de télécommunicationLa question de la durabilité des technologies de calcul et de télécommunication
La question de la durabilité des technologies de calcul et de télécommunication
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
A New Nonlinear Reinforcement Scheme for Stochastic Learning AutomataA New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata
 
Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3Applied machine learning for search engine relevance 3
Applied machine learning for search engine relevance 3
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
Supervisory control of discrete event systems for linear temporal logic speci...
Supervisory control of discrete event systems for linear temporal logic speci...Supervisory control of discrete event systems for linear temporal logic speci...
Supervisory control of discrete event systems for linear temporal logic speci...
 

Recently uploaded

UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
Alberto Brandolini
 
Boost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management AppsBoost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management Apps
Jhone kinadey
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid
 
ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
Maitrey Patel
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
dakas1
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
Paul Brebner
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Julian Hyde
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
Marcin Chrost
 
The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
Yara Milbes
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
safelyiotech
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
brainerhub1
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
Benefits of Artificial Intelligence in Healthcare!
Benefits of  Artificial Intelligence in Healthcare!Benefits of  Artificial Intelligence in Healthcare!
Benefits of Artificial Intelligence in Healthcare!
Prestware
 
What’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete RoadmapWhat’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete Roadmap
Envertis Software Solutions
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
ISH Technologies
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
What is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdfWhat is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdf
kalichargn70th171
 

Recently uploaded (20)

UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 
Modelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - AmsterdamModelling Up - DDDEurope 2024 - Amsterdam
Modelling Up - DDDEurope 2024 - Amsterdam
 
Boost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management AppsBoost Your Savings with These Money Management Apps
Boost Your Savings with These Money Management Apps
 
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdfBaha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
Baha Majid WCA4Z IBM Z Customer Council Boston June 2024.pdf
 
ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.ACE - Team 24 Wrapup event at ahmedabad.
ACE - Team 24 Wrapup event at ahmedabad.
 
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
一比一原版(UMN毕业证)明尼苏达大学毕业证如何办理
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...Superpower Your Apache Kafka Applications Development with Complementary Open...
Superpower Your Apache Kafka Applications Development with Complementary Open...
 
Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)Measures in SQL (SIGMOD 2024, Santiago, Chile)
Measures in SQL (SIGMOD 2024, Santiago, Chile)
 
Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !Enums On Steroids - let's look at sealed classes !
Enums On Steroids - let's look at sealed classes !
 
The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024The Rising Future of CPaaS in the Middle East 2024
The Rising Future of CPaaS in the Middle East 2024
 
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
Safelyio Toolbox Talk Softwate & App (How To Digitize Safety Meetings)
 
Unveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdfUnveiling the Advantages of Agile Software Development.pdf
Unveiling the Advantages of Agile Software Development.pdf
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
Benefits of Artificial Intelligence in Healthcare!
Benefits of  Artificial Intelligence in Healthcare!Benefits of  Artificial Intelligence in Healthcare!
Benefits of Artificial Intelligence in Healthcare!
 
What’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete RoadmapWhat’s New in Odoo 17 – A Complete Roadmap
What’s New in Odoo 17 – A Complete Roadmap
 
Preparing Non - Technical Founders for Engaging a Tech Agency
Preparing Non - Technical Founders for Engaging  a  Tech AgencyPreparing Non - Technical Founders for Engaging  a  Tech Agency
Preparing Non - Technical Founders for Engaging a Tech Agency
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
What is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdfWhat is Continuous Testing in DevOps - A Definitive Guide.pdf
What is Continuous Testing in DevOps - A Definitive Guide.pdf
 

Practical Reinforcement Learning with TensorFlow

  • 3. OpenAI Gym - Library of environments Control, Atari, Doom, etc. - Same API - Provides way to share and compare results https://gym.openai.com/
  • 4. Acting in an Environment
  • 7. Markov Decision Process MDP < S, A, P, R, 𝛾 > - S: set of states - A: set of actions - T(s, a, s’): probability of transition - Reward(s): reward function - 𝛾: discounting factory Trace: {<s0,a0,r0>, …, <sn,an,rn>}
  • 8. Definitions - Return: total discounted reward: - Policy: Agent’s behavior - Deterministic policy: π(s) = a - Stochastic policy: π(a | s) = P[At = a | St = s] - Value function: Expected return starting from state s: - State-value function: Vπ(s) = Eπ[R | St = s] - Action-value function: Qπ(s, a) = Eπ[R | St = s, At = a]
  • 9. Deep Q Learning - Model-free, off-policy technique to learn optimal Q(s, a): - Qi+1(s, a) ← Qi(s, a) + 𝛼(R + 𝛾 maxa’ Qi(s’, a’) - Qi(s, a)) - Optimal policy then π(s) = argmaxa’ Q(s, a’) - Requires exploration (ε-greedy) to explore various transitions from the states. - Take random action with ε probability, start ε high and decay to low value as training progresses. - Deep Q Learning: approximate Q(s, a) with neural network: Q(s, a, 𝜃) - Do stochastic gradient descent using loss
  • 11. Run Optimization Full example: https://github.com/ilblackdragon/tensorflow-rl/blob/master/examples/atari-rl.py
  • 12. Monitored Session - Handles pitfalls of distributed training. - Saving and restoring checkpoints. - Hooks is a general interface for injecting computation into TensorFlow training loop.
  • 13. Original Results on Atari Games Mnih et al., 2013
  • 14. Beating Human Level Mnih at el., 2015
  • 15. Policy Gradient - Given policy π 𝜃(a | s) find such 𝜃 that maximizes expected return: J(𝜃) = ∑sdπ(s)V(s) - In Deep RL, we approximate π 𝜃(a | s) with neural network. - Usually with softmax layer on top to estimate probabilities of each action. - We can estimate J(𝜃) from samples of observed behavior: ∑k=0..Tp𝜃( 𝜏k | π)R( 𝜏k) - Do stochastic gradient descent using update: 𝜃i+1 = 𝜃i + 𝛼 (1/T) ∑k=0..T ∇log p𝜃( 𝜏k | π)R( 𝜏k)
  • 18. Async Advantage Actor-Critic (A3C) - Asynchronous: using multiple instances of environments and networks - Actor-Critic: using both policy and estimate of value function. - Advantage: estimate how different was outcome than expected. Image by Arthur Juliani
  • 19. Policy and Value Networks
  • 21. A3C Results on Atari Games Mnih at el., 2016
  • 22. Mnih at el., 2016
  • 23. Practical use cases - Robotics - Finance - Industrial optimization - Predictive assistant
  • 24. Illia Polosukhin XIX.ai @ilblackdragon, illia@xix.ai Questions? Full code will be available soon at https://github.com/ilblackdragon/tensorflow-rl/

Editor's Notes

  1. Let’s start by defining a problem that we are trying to solve. ... Agents divide into model-based and model-free agents Model based agent try to simulate the environment inside it to make decisions based on that. Model free though just take observation and choose action. This is interesting, because this is very close how animals and people learn - based on some limited feedback from the environment or teacher. Like animals get positive reinforcement when developing reflexes. Or children getting positive or negative reinforcement from parents on their behaviour.
  2. Let’s review some theory around RL. The set of states and actions, together with rules for transitioning from one state to another, make up a Markov decision process. One episode of this process (e.g. one game) forms a finite sequence of states, actions and rewards. Additional term - set of [(s, a), ..] is a trajectory.
  3. Model free - meaning there is no MDP approximation or learning inside the agent. Observations are stored into replay buffers and used as training data for the model. Off policy means that learning optimal policy is independent of agent’s actions. Because the policy of taking action would be deterministic, force it to explore by taking random action with ε probability. Where ε starts high in the beginning and slowly decays as training progresses. For example for Atari game, there is lots of possible states (number of pixels by number of colors). E.g. breakout game 84x84 pixels screen by 256 colors - at least 256^84*84 states. And it will take a long time to even visit each state. Approximate with neural network, that will be able to learn how to deal with state based on their similarity. Deep Q Learning - popularized by DeepMind - first Deep RL model that worked.
  4. Expected return is can be defined in few ways. One way is to define as sum of values of state-value function of each state weighted by how much we will end up at that state under current policy (it’s also called stationary distribution). This can be estimated from observations - trajectories, as a sum of probability of a trajectory under policy multiplied by reward from this trajectory.
  5. Asynchronous: Unlike DQN, where a single agent represented by a single neural network interacts with a single environment, A3C utilizes multiple incarnations of the above in order to learn more efficiently. In A3C there is a global network, and multiple worker agents which each have their own set of network parameters. Each of these agents interacts with it’s own copy of the environment at the same time as the other agents are interacting with their environments. The reason this works better than having a single agent (beyond the speedup of getting more work done), is that the experience of each agent is independent of the experience of the others. In this way the overall experience available for training becomes more diverse. Actor-Critic: Actor-Critic combines the benefits of both approaches. In the case of A3C, our network will estimate both a value function V(s) (how good a certain state is to be in) and a policy π(s) (a set of action probability outputs). These will each be separate fully-connected layers sitting at the top of the network. Critically, the agent uses the value estimate (the critic) to update the policy (the actor) more intelligently than traditional policy gradient methods. The insight of using advantage estimates rather than just discounted returns is to allow the agent to determine not just how good its actions were, but how much better they turned out to be than expected.
  6. Mean and median human-normalized scores on 57 Atari games using the human starts evaluation metric. D-DQN - double DQN. A3C paper - https://arxiv.org/pdf/1602.01783.pdf