SlideShare a Scribd company logo
Reinforcement Learning
Dr. Subrat Panda
Head of AI and Data Sciences,
Capillary Technologies
Introduction about me
● BTech ( 2002) , PhD (2009) – CSE, IIT Kharagpur
● Synopsys (EDA), IBM (CPU), NVIDIA (GPU), Taro (Full Stack Engineer), Capillary (Principal
Architect - AI)
● Applying AI to Retail
● Co-Founded IDLI (for social good) with Prof. Amit Sethi (IIT Bombay), Jacob Minz (Synopsys)
and Biswa Gourav Singh (Capillary)
● https://www.facebook.com/groups/idliai/
● Linked In - https://www.linkedin.com/in/subratpanda/
● Facebook - https://www.facebook.com/subratpanda
● Twitter - @subratpanda Email - subrat.panda@capillarytech.com
Overview
• Supervised Learning: Immediate feedback (labels provided for every input).
• Unsupervised Learning: No feedback (No labels provided).
• Reinforcement Learning: Delayed scalar feedback (a number called reward).
• RL deals with agents that must sense & act upon their environment. This combines
classical AI and machine learning techniques. It is probably the most comprehensive problem
setting.
• Examples:
• Robot-soccer
• Share Investing
• Learning to walk/fly/ride a vehicle
• Scheduling
• AlphaGo / Super Mario
Machine Learning – http://techleer.com
Some Definitions and assumptions
MDP - Markov Decision Problem
The framework of the MDP has the following elements:
1. state of the system,
2. actions,
3. transition probabilities,
4. transition rewards,
5. a policy, and
6. a performance metric.
We assume that the system is modeled by a so-called abstract stochastic process called the Markov
chain.
RL is generally used to solve the so-called Markov decision problem (MDP).
The theory of RL relies on dynamic programming (DP) and artificial intelligence (AI).
The Big Picture
Your action influences the state of the world which determines its reward
Some Complications
• The outcome of your actions may be uncertain
• You may not be able to perfectly sense the state of the world
• The reward may be stochastic.
• Reward is delayed (i.e. finding food in a maze)
• You may have no clue (model) about how the world responds to your actions.
• You may have no clue (model) of how rewards are being paid off.
• The world may change while you try to learn it.
• How much time do you need to explore uncharted territory before you exploit what you have learned?
• For large-scale systems with millions of states, it is impractical to store these values. This is called the curse of
dimensionality. DP breaks down on problems which suffer from any one of these curses because it needs all these
values.
The Task
• To learn an optimal policy that maps states of the world to actions of the
agent.
i.e., if this patch of room is dirty, I clean it. If my battery is empty, I recharge it.
• What is it that the agent tries to optimize?
Answer: the total future discounted reward:
Note: immediate reward is worth more than future reward.
What would happen to mouse in a maze with gamma = 0 ? - Greedy
Value Function
• Let’s say we have access to the optimal value function that computes the
total future discounted reward
• What would be the optimal policy ?
• Answer: we choose the action that maximizes:
• We assume that we know what the reward will be if we perform action “a” in
state “s”:
• We also assume we know what the next state of the world will be if we perform
action “a” in state “s”:
Example I
• Consider some complicated graph, and we would like to find the
shortest
path from a node Si to a goal node G.
• Traversing an edge will cost you “length edge” dollars.
• The value function encodes the total remaining
distance to the goal node from any node s, i.e.
V(s) = “1 / distance” to goal from s.
• If you know V(s), the problem is trivial. You simply
choose the node that has highest V(s).
Si
G
Example II
Find your way to the goal.
Q-Function
• One approach to RL is then to try to estimate V*(s).
• However, this approach requires you to know r(s,a) and delta(s,a).
• This is unrealistic in many real problems. What is the reward if a robot is exploring mars
and decides to take a right turn?
• Fortunately we can circumvent this problem by exploring and experiencing how the world
reacts to our actions. We need to learn r & delta.
• We want a function that directly learns good state-action pairs, i.e. what action should I
take in this state. We call this Q(s,a).
• Given Q(s,a) it is now trivial to execute the optimal policy, without knowing
r(s,a) and delta(s,a). We have:
Bellman Equation:
Example II
Check that
Q-Learning
• This still depends on r(s,a) and delta(s,a).
• However, imagine the robot is exploring its environment, trying new actions as
it goes.
• At every step it receives some reward “r”, and it observes the environment
change into a new state s’ for action a. How can we use these observations,
(s,a,s’,r) to learn a model?
s’=st+1
Q-Learning
• This equation continually estimates Q at state s consistent with an estimate
of Q at state s’, one step in the future: temporal difference (TD) learning.
• Note that s’ is closer to goal, and hence more “reliable”, but still an estimate itself.
• Updating estimates based on other estimates is called bootstrapping.
• We do an update after each state-action pair. I.e., we are learning online!
• We are learning useful things about explored state-action pairs. These are typically most
useful because they are likely to be encountered again.
• Under suitable conditions, these updates can actually be proved to converge to the real
answer.
s’=st+1
Example Q-Learning
Q-learning propagates Q-estimates 1-step backwards
Exploration / Exploitation
• It is very important that the agent does not simply follow the current policy
when learning Q. (off-policy learning).The reason is that you may get stuck
in a suboptimal solution. i.e. there may be other solutions out there that you
have never seen.
• Hence it is good to try new things so now and then, e.g. If T is large lots of
exploring, if T is small follow current policy. One can decrease T over time.
Improvements
• One can trade-off memory and computation by caching (s,s’,r) for observed
transitions. After a while, as Q(s’,a’) has changed, you can “replay” the update:
•One can actively search for state-action pairs for which Q(s,a) is expected to
change a lot (prioritized sweeping).
• One can do updates along the sampled path much further back than just
one step ( learning).
Extensions
• To deal with stochastic environments, we need to maximize
expected future discounted reward:
• Often the state space is too large to deal with all states. In this case we
need to learn a function:
• Neural network with back-propagation have been quite successful.
• For instance, TD-Gammon is a back-gammon program that plays at expert
level.
state-space very large, trained by playing against itself, uses NN to approximate
value function, uses TD(lambda) for learning.
Real Life Examples Using RL
Deep RL - Deep Q Learning
Tools and EcoSystem
● OpenAI Gym- Python based rich AI simulation environment
● TensorFlow - TensorLayer providing popular RL modules
● Pytorch - Open Sourced by Facebook
● Keras - On top of TensorFlow
● DeepMind Lab - Google 3D platform
Small Demo - on Youtube
Application Areas of RL
- Resources management in computer clusters
- Traffic Light Control
- Robotics
- Personalized Recommendations
- Chemistry
- Bidding and Advertising
- Games
What you need to know to apply RL?
- Understanding your problem
- A simulated environment
- MDP
- Algorithms
Learning how to run? - Good example by
deepsense.ai
Conclusion
• Reinforcement learning addresses a very broad and relevant question: How can we learn
to survive in our environment?
• We have looked at Q-learning, which simply learns from experience. No model of the world
is needed.
• We made simplifying assumptions: e.g. state of the world only depends on last state and
action. This is the Markov assumption. The model is called a Markov Decision Process
(MDP).
• We assumed deterministic dynamics, reward function, but the world really is stochastic.
• There are many extensions to speed up learning.
• There have been many successful real world applications.
References
• Thanks to Google and AI research Community
• https://www.slideshare.net/AnirbanSantara/an-introduction-to-reinforcement-learning-the-doors-to-agi - Good Intro PPT
• https://skymind.ai/wiki/deep-reinforcement-learning - Good blog
• https://medium.com/@BonsaiAI/why-reinforcement-learning-might-be-the-best-ai-technique-for-complex-industrial-systems-
fde8b0ebd5fb
• https://deepsense.ai/learning-to-run-an-example-of-reinforcement-learning/
• https://en.wikipedia.org/wiki/Reinforcement_learning
• https://www.ics.uci.edu/~welling/teaching/ICS175winter12/RL.ppt - Good Intro PPT which I have adopted in whole
• https://people.eecs.berkeley.edu/~jordan/MLShortCourse/reinforcement-learning.ppt
• https://www.cse.iitm.ac.in/~ravi/courses/Reinforcement%20Learning.html - Detailed Course
• https://web.mst.edu/~gosavia/tutorial.pdf - short Tutorial
• Book - Reinforcement Learning: An Introduction (English, Hardcover, Andrew G Barto Richard S Sutton Barto
Sutton)
• https://www.techleer.com/articles/203-machine-learning-algorithm-backbone-of-emerging-technologies/
• http://ruder.io/transfer-learning/
• Exploration and Exploitation - https://artint.info/html/ArtInt_266.html ,
http://www.cs.cmu.edu/~rsalakhu/10703/Lecture_Exploration.pdf
• https://hub.packtpub.com/tools-for-reinforcement-learning/
• http://adventuresinmachinelearning.com/reinforcement-learning-tensorflow/
• https://towardsdatascience.com/applications-of-reinforcement-learning-in-real-world-1a94955bcd12
Acknowledgements
- Anirban Santara - For his Intro Session on IDLI and support
- Prof. Ravindran - Inspiration
- Palash Arora from Capillary, Ashwin Krishna.
- Techgig Team for the opportunity
- Sumandeep and Biswa from Capillary for review and discussions.
- https://www.ics.uci.edu/~welling/teaching/ICS175winter12/RL.ppt - Good Intro PPT
which I have adopted in whole
Thank You!!

More Related Content

Similar to anintroductiontoreinforcementlearning-180912151720.pdf

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
DongHyun Kwak
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
DataScienceLab
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Hogeon Seo
 
TD Learning Webinar
TD Learning WebinarTD Learning Webinar
TD Learning Webinar
Siddharth Sahani
 
ODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systemsODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systems
Anuj Gupta
 
Continuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesContinuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakes
Anuj Gupta
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
Anuj Gupta
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
Srinath Perera
 
cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdf
Kuan-Tsae Huang
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
charusharma165
 
semi supervised Learning and Reinforcement learning (1).pptx
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptx
Dr.Shweta
 
Introduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sIntroduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner's
Vidyasagar Bhargava
 
Horizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at ScaleHorizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at Scale
Databricks
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Khaled Saleh
 
李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning
台灣資料科學年會
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
Balázs Hidasi
 
Lecture 3 Problem Solving.pptx
Lecture 3 Problem Solving.pptxLecture 3 Problem Solving.pptx
Lecture 3 Problem Solving.pptx
AndrewKuziwakwasheMu
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
ahmad bassiouny
 
Introduction2drl
Introduction2drlIntroduction2drl
Introduction2drl
Shenglin Zhao
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
Ben Ball
 

Similar to anintroductiontoreinforcementlearning-180912151720.pdf (20)

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
TD Learning Webinar
TD Learning WebinarTD Learning Webinar
TD Learning Webinar
 
ODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systemsODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systems
 
Continuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesContinuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakes
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
 
cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdf
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
 
semi supervised Learning and Reinforcement learning (1).pptx
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptx
 
Introduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sIntroduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner's
 
Horizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at ScaleHorizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at Scale
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning
 
Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017Deep Learning in Recommender Systems - RecSys Summer School 2017
Deep Learning in Recommender Systems - RecSys Summer School 2017
 
Lecture 3 Problem Solving.pptx
Lecture 3 Problem Solving.pptxLecture 3 Problem Solving.pptx
Lecture 3 Problem Solving.pptx
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
 
Introduction2drl
Introduction2drlIntroduction2drl
Introduction2drl
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 

Recently uploaded

Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 

Recently uploaded (20)

Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 

anintroductiontoreinforcementlearning-180912151720.pdf

  • 1. Reinforcement Learning Dr. Subrat Panda Head of AI and Data Sciences, Capillary Technologies
  • 2. Introduction about me ● BTech ( 2002) , PhD (2009) – CSE, IIT Kharagpur ● Synopsys (EDA), IBM (CPU), NVIDIA (GPU), Taro (Full Stack Engineer), Capillary (Principal Architect - AI) ● Applying AI to Retail ● Co-Founded IDLI (for social good) with Prof. Amit Sethi (IIT Bombay), Jacob Minz (Synopsys) and Biswa Gourav Singh (Capillary) ● https://www.facebook.com/groups/idliai/ ● Linked In - https://www.linkedin.com/in/subratpanda/ ● Facebook - https://www.facebook.com/subratpanda ● Twitter - @subratpanda Email - subrat.panda@capillarytech.com
  • 3. Overview • Supervised Learning: Immediate feedback (labels provided for every input). • Unsupervised Learning: No feedback (No labels provided). • Reinforcement Learning: Delayed scalar feedback (a number called reward). • RL deals with agents that must sense & act upon their environment. This combines classical AI and machine learning techniques. It is probably the most comprehensive problem setting. • Examples: • Robot-soccer • Share Investing • Learning to walk/fly/ride a vehicle • Scheduling • AlphaGo / Super Mario
  • 4. Machine Learning – http://techleer.com
  • 5.
  • 6. Some Definitions and assumptions MDP - Markov Decision Problem The framework of the MDP has the following elements: 1. state of the system, 2. actions, 3. transition probabilities, 4. transition rewards, 5. a policy, and 6. a performance metric. We assume that the system is modeled by a so-called abstract stochastic process called the Markov chain. RL is generally used to solve the so-called Markov decision problem (MDP). The theory of RL relies on dynamic programming (DP) and artificial intelligence (AI).
  • 7. The Big Picture Your action influences the state of the world which determines its reward
  • 8. Some Complications • The outcome of your actions may be uncertain • You may not be able to perfectly sense the state of the world • The reward may be stochastic. • Reward is delayed (i.e. finding food in a maze) • You may have no clue (model) about how the world responds to your actions. • You may have no clue (model) of how rewards are being paid off. • The world may change while you try to learn it. • How much time do you need to explore uncharted territory before you exploit what you have learned? • For large-scale systems with millions of states, it is impractical to store these values. This is called the curse of dimensionality. DP breaks down on problems which suffer from any one of these curses because it needs all these values.
  • 9. The Task • To learn an optimal policy that maps states of the world to actions of the agent. i.e., if this patch of room is dirty, I clean it. If my battery is empty, I recharge it. • What is it that the agent tries to optimize? Answer: the total future discounted reward: Note: immediate reward is worth more than future reward. What would happen to mouse in a maze with gamma = 0 ? - Greedy
  • 10. Value Function • Let’s say we have access to the optimal value function that computes the total future discounted reward • What would be the optimal policy ? • Answer: we choose the action that maximizes: • We assume that we know what the reward will be if we perform action “a” in state “s”: • We also assume we know what the next state of the world will be if we perform action “a” in state “s”:
  • 11. Example I • Consider some complicated graph, and we would like to find the shortest path from a node Si to a goal node G. • Traversing an edge will cost you “length edge” dollars. • The value function encodes the total remaining distance to the goal node from any node s, i.e. V(s) = “1 / distance” to goal from s. • If you know V(s), the problem is trivial. You simply choose the node that has highest V(s). Si G
  • 12. Example II Find your way to the goal.
  • 13. Q-Function • One approach to RL is then to try to estimate V*(s). • However, this approach requires you to know r(s,a) and delta(s,a). • This is unrealistic in many real problems. What is the reward if a robot is exploring mars and decides to take a right turn? • Fortunately we can circumvent this problem by exploring and experiencing how the world reacts to our actions. We need to learn r & delta. • We want a function that directly learns good state-action pairs, i.e. what action should I take in this state. We call this Q(s,a). • Given Q(s,a) it is now trivial to execute the optimal policy, without knowing r(s,a) and delta(s,a). We have: Bellman Equation:
  • 15. Q-Learning • This still depends on r(s,a) and delta(s,a). • However, imagine the robot is exploring its environment, trying new actions as it goes. • At every step it receives some reward “r”, and it observes the environment change into a new state s’ for action a. How can we use these observations, (s,a,s’,r) to learn a model? s’=st+1
  • 16. Q-Learning • This equation continually estimates Q at state s consistent with an estimate of Q at state s’, one step in the future: temporal difference (TD) learning. • Note that s’ is closer to goal, and hence more “reliable”, but still an estimate itself. • Updating estimates based on other estimates is called bootstrapping. • We do an update after each state-action pair. I.e., we are learning online! • We are learning useful things about explored state-action pairs. These are typically most useful because they are likely to be encountered again. • Under suitable conditions, these updates can actually be proved to converge to the real answer. s’=st+1
  • 17. Example Q-Learning Q-learning propagates Q-estimates 1-step backwards
  • 18. Exploration / Exploitation • It is very important that the agent does not simply follow the current policy when learning Q. (off-policy learning).The reason is that you may get stuck in a suboptimal solution. i.e. there may be other solutions out there that you have never seen. • Hence it is good to try new things so now and then, e.g. If T is large lots of exploring, if T is small follow current policy. One can decrease T over time.
  • 19. Improvements • One can trade-off memory and computation by caching (s,s’,r) for observed transitions. After a while, as Q(s’,a’) has changed, you can “replay” the update: •One can actively search for state-action pairs for which Q(s,a) is expected to change a lot (prioritized sweeping). • One can do updates along the sampled path much further back than just one step ( learning).
  • 20. Extensions • To deal with stochastic environments, we need to maximize expected future discounted reward: • Often the state space is too large to deal with all states. In this case we need to learn a function: • Neural network with back-propagation have been quite successful. • For instance, TD-Gammon is a back-gammon program that plays at expert level. state-space very large, trained by playing against itself, uses NN to approximate value function, uses TD(lambda) for learning.
  • 21. Real Life Examples Using RL
  • 22. Deep RL - Deep Q Learning
  • 23. Tools and EcoSystem ● OpenAI Gym- Python based rich AI simulation environment ● TensorFlow - TensorLayer providing popular RL modules ● Pytorch - Open Sourced by Facebook ● Keras - On top of TensorFlow ● DeepMind Lab - Google 3D platform
  • 24. Small Demo - on Youtube
  • 25. Application Areas of RL - Resources management in computer clusters - Traffic Light Control - Robotics - Personalized Recommendations - Chemistry - Bidding and Advertising - Games
  • 26. What you need to know to apply RL? - Understanding your problem - A simulated environment - MDP - Algorithms
  • 27. Learning how to run? - Good example by deepsense.ai
  • 28. Conclusion • Reinforcement learning addresses a very broad and relevant question: How can we learn to survive in our environment? • We have looked at Q-learning, which simply learns from experience. No model of the world is needed. • We made simplifying assumptions: e.g. state of the world only depends on last state and action. This is the Markov assumption. The model is called a Markov Decision Process (MDP). • We assumed deterministic dynamics, reward function, but the world really is stochastic. • There are many extensions to speed up learning. • There have been many successful real world applications.
  • 29. References • Thanks to Google and AI research Community • https://www.slideshare.net/AnirbanSantara/an-introduction-to-reinforcement-learning-the-doors-to-agi - Good Intro PPT • https://skymind.ai/wiki/deep-reinforcement-learning - Good blog • https://medium.com/@BonsaiAI/why-reinforcement-learning-might-be-the-best-ai-technique-for-complex-industrial-systems- fde8b0ebd5fb • https://deepsense.ai/learning-to-run-an-example-of-reinforcement-learning/ • https://en.wikipedia.org/wiki/Reinforcement_learning • https://www.ics.uci.edu/~welling/teaching/ICS175winter12/RL.ppt - Good Intro PPT which I have adopted in whole • https://people.eecs.berkeley.edu/~jordan/MLShortCourse/reinforcement-learning.ppt • https://www.cse.iitm.ac.in/~ravi/courses/Reinforcement%20Learning.html - Detailed Course • https://web.mst.edu/~gosavia/tutorial.pdf - short Tutorial • Book - Reinforcement Learning: An Introduction (English, Hardcover, Andrew G Barto Richard S Sutton Barto Sutton) • https://www.techleer.com/articles/203-machine-learning-algorithm-backbone-of-emerging-technologies/ • http://ruder.io/transfer-learning/ • Exploration and Exploitation - https://artint.info/html/ArtInt_266.html , http://www.cs.cmu.edu/~rsalakhu/10703/Lecture_Exploration.pdf • https://hub.packtpub.com/tools-for-reinforcement-learning/ • http://adventuresinmachinelearning.com/reinforcement-learning-tensorflow/ • https://towardsdatascience.com/applications-of-reinforcement-learning-in-real-world-1a94955bcd12
  • 30. Acknowledgements - Anirban Santara - For his Intro Session on IDLI and support - Prof. Ravindran - Inspiration - Palash Arora from Capillary, Ashwin Krishna. - Techgig Team for the opportunity - Sumandeep and Biswa from Capillary for review and discussions. - https://www.ics.uci.edu/~welling/teaching/ICS175winter12/RL.ppt - Good Intro PPT which I have adopted in whole Thank You!!