SlideShare a Scribd company logo
Reinforcement Learning
Dr. Subrat Panda
Head of AI and Data Sciences,
Capillary Technologies
Introduction about me
● BTech ( 2002) , PhD (2009) – CSE, IIT Kharagpur
● Synopsys (EDA), IBM (CPU), NVIDIA (GPU), Taro (Full Stack Engineer), Capillary (Principal
Architect - AI)
● Applying AI to Retail
● Co-Founded IDLI (for social good) with Prof. Amit Sethi (IIT Bombay), Jacob Minz (Synopsys)
and Biswa Gourav Singh (Capillary)
● https://www.facebook.com/groups/idliai/
● Linked In - https://www.linkedin.com/in/subratpanda/
● Facebook - https://www.facebook.com/subratpanda
● Twitter - @subratpanda Email - subrat.panda@capillarytech.com
Overview
• Supervised Learning: Immediate feedback (labels provided for every input).
• Unsupervised Learning: No feedback (No labels provided).
• Reinforcement Learning: Delayed scalar feedback (a number called reward).
• RL deals with agents that must sense & act upon their environment. This combines
classical AI and machine learning techniques. It is probably the most comprehensive problem
setting.
• Examples:
• Robot-soccer
• Share Investing
• Learning to walk/fly/ride a vehicle
• Scheduling
• AlphaGo / Super Mario
Machine Learning – http://techleer.com
Some Definitions and assumptions
MDP - Markov Decision Problem
The framework of the MDP has the following elements:
1. state of the system,
2. actions,
3. transition probabilities,
4. transition rewards,
5. a policy, and
6. a performance metric.
We assume that the system is modeled by a so-called abstract stochastic process called the Markov
chain.
RL is generally used to solve the so-called Markov decision problem (MDP).
The theory of RL relies on dynamic programming (DP) and artificial intelligence (AI).
The Big Picture
Your action influences the state of the world which determines its reward
Some Complications
• The outcome of your actions may be uncertain
• You may not be able to perfectly sense the state of the world
• The reward may be stochastic.
• Reward is delayed (i.e. finding food in a maze)
• You may have no clue (model) about how the world responds to your actions.
• You may have no clue (model) of how rewards are being paid off.
• The world may change while you try to learn it.
• How much time do you need to explore uncharted territory before you exploit what you have learned?
• For large-scale systems with millions of states, it is impractical to store these values. This is called the curse of
dimensionality. DP breaks down on problems which suffer from any one of these curses because it needs all these
values.
The Task
• To learn an optimal policy that maps states of the world to actions of the
agent.
i.e., if this patch of room is dirty, I clean it. If my battery is empty, I recharge it.
• What is it that the agent tries to optimize?
Answer: the total future discounted reward:
Note: immediate reward is worth more than future reward.
What would happen to mouse in a maze with gamma = 0 ? - Greedy
Value Function
• Let’s say we have access to the optimal value function that computes the
total future discounted reward
• What would be the optimal policy ?
• Answer: we choose the action that maximizes:
• We assume that we know what the reward will be if we perform action “a” in
state “s”:
• We also assume we know what the next state of the world will be if we perform
action “a” in state “s”:
Example I
• Consider some complicated graph, and we would like to find the
shortest
path from a node Si to a goal node G.
• Traversing an edge will cost you “length edge” dollars.
• The value function encodes the total remaining
distance to the goal node from any node s, i.e.
V(s) = “1 / distance” to goal from s.
• If you know V(s), the problem is trivial. You simply
choose the node that has highest V(s).
Si
G
Example II
Find your way to the goal.
Q-Function
• One approach to RL is then to try to estimate V*(s).
• However, this approach requires you to know r(s,a) and delta(s,a).
• This is unrealistic in many real problems. What is the reward if a robot is exploring mars
and decides to take a right turn?
• Fortunately we can circumvent this problem by exploring and experiencing how the world
reacts to our actions. We need to learn r & delta.
• We want a function that directly learns good state-action pairs, i.e. what action should I
take in this state. We call this Q(s,a).
• Given Q(s,a) it is now trivial to execute the optimal policy, without knowing
r(s,a) and delta(s,a). We have:
Bellman Equation:
Example II
Check that
Q-Learning
• This still depends on r(s,a) and delta(s,a).
• However, imagine the robot is exploring its environment, trying new actions as
it goes.
• At every step it receives some reward “r”, and it observes the environment
change into a new state s’ for action a. How can we use these observations,
(s,a,s’,r) to learn a model?
s’=st+1
Q-Learning
• This equation continually estimates Q at state s consistent with an estimate
of Q at state s’, one step in the future: temporal difference (TD) learning.
• Note that s’ is closer to goal, and hence more “reliable”, but still an estimate itself.
• Updating estimates based on other estimates is called bootstrapping.
• We do an update after each state-action pair. I.e., we are learning online!
• We are learning useful things about explored state-action pairs. These are typically most
useful because they are likely to be encountered again.
• Under suitable conditions, these updates can actually be proved to converge to the real
answer.
s’=st+1
Example Q-Learning
Q-learning propagates Q-estimates 1-step backwards
Exploration / Exploitation
• It is very important that the agent does not simply follow the current policy
when learning Q. (off-policy learning).The reason is that you may get stuck
in a suboptimal solution. i.e. there may be other solutions out there that you
have never seen.
• Hence it is good to try new things so now and then, e.g. If T is large lots of
exploring, if T is small follow current policy. One can decrease T over time.
Improvements
• One can trade-off memory and computation by caching (s,s’,r) for observed
transitions. After a while, as Q(s’,a’) has changed, you can “replay” the update:
•One can actively search for state-action pairs for which Q(s,a) is expected to
change a lot (prioritized sweeping).
• One can do updates along the sampled path much further back than just
one step ( learning).
Extensions
• To deal with stochastic environments, we need to maximize
expected future discounted reward:
• Often the state space is too large to deal with all states. In this case we
need to learn a function:
• Neural network with back-propagation have been quite successful.
• For instance, TD-Gammon is a back-gammon program that plays at expert
level.
state-space very large, trained by playing against itself, uses NN to approximate
value function, uses TD(lambda) for learning.
Real Life Examples Using RL
Deep RL - Deep Q Learning
Tools and EcoSystem
● OpenAI Gym- Python based rich AI simulation environment
● TensorFlow - TensorLayer providing popular RL modules
● Pytorch - Open Sourced by Facebook
● Keras - On top of TensorFlow
● DeepMind Lab - Google 3D platform
Small Demo - on Youtube
Application Areas of RL
- Resources management in computer clusters
- Traffic Light Control
- Robotics
- Personalized Recommendations
- Chemistry
- Bidding and Advertising
- Games
What you need to know to apply RL?
- Understanding your problem
- A simulated environment
- MDP
- Algorithms
Learning how to run? - Good example by
deepsense.ai
Conclusion
• Reinforcement learning addresses a very broad and relevant question: How can we learn
to survive in our environment?
• We have looked at Q-learning, which simply learns from experience. No model of the world
is needed.
• We made simplifying assumptions: e.g. state of the world only depends on last state and
action. This is the Markov assumption. The model is called a Markov Decision Process
(MDP).
• We assumed deterministic dynamics, reward function, but the world really is stochastic.
• There are many extensions to speed up learning.
• There have been many successful real world applications.
References
• Thanks to Google and AI research Community
• https://www.slideshare.net/AnirbanSantara/an-introduction-to-reinforcement-learning-the-doors-to-agi - Good Intro PPT
• https://skymind.ai/wiki/deep-reinforcement-learning - Good blog
• https://medium.com/@BonsaiAI/why-reinforcement-learning-might-be-the-best-ai-technique-for-complex-industrial-systems-
fde8b0ebd5fb
• https://deepsense.ai/learning-to-run-an-example-of-reinforcement-learning/
• https://en.wikipedia.org/wiki/Reinforcement_learning
• https://www.ics.uci.edu/~welling/teaching/ICS175winter12/RL.ppt - Good Intro PPT which I have adopted in whole
• https://people.eecs.berkeley.edu/~jordan/MLShortCourse/reinforcement-learning.ppt
• https://www.cse.iitm.ac.in/~ravi/courses/Reinforcement%20Learning.html - Detailed Course
• https://web.mst.edu/~gosavia/tutorial.pdf - short Tutorial
• Book - Reinforcement Learning: An Introduction (English, Hardcover, Andrew G Barto Richard S Sutton Barto
Sutton)
• https://www.techleer.com/articles/203-machine-learning-algorithm-backbone-of-emerging-technologies/
• http://ruder.io/transfer-learning/
• Exploration and Exploitation - https://artint.info/html/ArtInt_266.html ,
http://www.cs.cmu.edu/~rsalakhu/10703/Lecture_Exploration.pdf
• https://hub.packtpub.com/tools-for-reinforcement-learning/
• http://adventuresinmachinelearning.com/reinforcement-learning-tensorflow/
• https://towardsdatascience.com/applications-of-reinforcement-learning-in-real-world-1a94955bcd12
Acknowledgements
- Anirban Santara - For his Intro Session on IDLI and support
- Prof. Ravindran - Inspiration
- Palash Arora from Capillary, Ashwin Krishna.
- Techgig Team for the opportunity
- Sumandeep and Biswa from Capillary for review and discussions.
- https://www.ics.uci.edu/~welling/teaching/ICS175winter12/RL.ppt - Good Intro PPT
which I have adopted in whole
Thank You!!

More Related Content

What's hot

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
Kuppusamy P
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
Nikolay Pavlov
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
Bill Liu
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Muhammad Iqbal Tawakal
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
MeetupDataScienceRoma
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Ding Li
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
Kai-Wen Zhao
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
Jie-Han Chen
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Khaled Saleh
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Shahan Ali Memon
 
AI - Introduction to Bellman Equations
AI - Introduction to Bellman EquationsAI - Introduction to Bellman Equations
AI - Introduction to Bellman Equations
Andrew Ferlitsch
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
Peerasak C.
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
Jie-Han Chen
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
Oswald Campesato
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
Seung Jae Lee
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
Dongmin Lee
 

What's hot (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
AI - Introduction to Bellman Equations
AI - Introduction to Bellman EquationsAI - Introduction to Bellman Equations
AI - Introduction to Bellman Equations
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 

Similar to An introduction to reinforcement learning

24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
ManiMaran230751
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
IDEAS - Int'l Data Engineering and Science Association
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
Natan Katz
 
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Shakeeb Ahmad Mohammad Mukhtar
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
Utkarsh Garg
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
재연 윤
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
DongHyun Kwak
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
DataScienceLab
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Hogeon Seo
 
TD Learning Webinar
TD Learning WebinarTD Learning Webinar
TD Learning Webinar
Siddharth Sahani
 
ODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systemsODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systems
Anuj Gupta
 
Continuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesContinuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakes
Anuj Gupta
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
Anuj Gupta
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
Srinath Perera
 
cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdf
Kuan-Tsae Huang
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
charusharma165
 
semi supervised Learning and Reinforcement learning (1).pptx
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptx
Dr.Shweta
 
Introduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sIntroduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner's
Vidyasagar Bhargava
 
Horizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at ScaleHorizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at Scale
Databricks
 
李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning
台灣資料科學年會
 

Similar to An introduction to reinforcement learning (20)

24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
Q-Learning Algorithm: A Concise Introduction [Shakeeb A.]
 
Intro to Reinforcement Learning
Intro to Reinforcement LearningIntro to Reinforcement Learning
Intro to Reinforcement Learning
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
TD Learning Webinar
TD Learning WebinarTD Learning Webinar
TD Learning Webinar
 
ODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systemsODSC East 2020 : Continuous_learning_systems
ODSC East 2020 : Continuous_learning_systems
 
Continuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakesContinuous Learning Systems: Building ML systems that learn from their mistakes
Continuous Learning Systems: Building ML systems that learn from their mistakes
 
Building Continuous Learning Systems
Building Continuous Learning SystemsBuilding Continuous Learning Systems
Building Continuous Learning Systems
 
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
ICTER 2014 Invited Talk: Large Scale Data Processing in the Real World: from ...
 
cs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdfcs330_2021_lifelong_learning.pdf
cs330_2021_lifelong_learning.pdf
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
 
semi supervised Learning and Reinforcement learning (1).pptx
 semi supervised Learning and Reinforcement learning (1).pptx semi supervised Learning and Reinforcement learning (1).pptx
semi supervised Learning and Reinforcement learning (1).pptx
 
Introduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner'sIntroduction to Deep learning and H2O for beginner's
Introduction to Deep learning and H2O for beginner's
 
Horizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at ScaleHorizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at Scale
 
李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning李俊良/Feature Engineering in Machine Learning
李俊良/Feature Engineering in Machine Learning
 

More from Subrat Panda, PhD

Role of technology in agriculture courses by srmist & the hindu
Role of technology in agriculture courses by srmist & the hinduRole of technology in agriculture courses by srmist & the hindu
Role of technology in agriculture courses by srmist & the hindu
Subrat Panda, PhD
 
Journey so far
Journey so farJourney so far
Journey so far
Subrat Panda, PhD
 
AI in security
AI in securityAI in security
AI in security
Subrat Panda, PhD
 
AI in Retail
AI in RetailAI in Retail
AI in Retail
Subrat Panda, PhD
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 
ML Workshop at SACON 2018
ML Workshop at SACON 2018ML Workshop at SACON 2018
ML Workshop at SACON 2018
Subrat Panda, PhD
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
Subrat Panda, PhD
 
AI and The future of work
AI and The future of work AI and The future of work
AI and The future of work
Subrat Panda, PhD
 
AI in retail
AI in retailAI in retail
AI in retail
Subrat Panda, PhD
 

More from Subrat Panda, PhD (9)

Role of technology in agriculture courses by srmist & the hindu
Role of technology in agriculture courses by srmist & the hinduRole of technology in agriculture courses by srmist & the hindu
Role of technology in agriculture courses by srmist & the hindu
 
Journey so far
Journey so farJourney so far
Journey so far
 
AI in security
AI in securityAI in security
AI in security
 
AI in Retail
AI in RetailAI in Retail
AI in Retail
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
 
ML Workshop at SACON 2018
ML Workshop at SACON 2018ML Workshop at SACON 2018
ML Workshop at SACON 2018
 
AI and Deep Learning
AI and Deep Learning AI and Deep Learning
AI and Deep Learning
 
AI and The future of work
AI and The future of work AI and The future of work
AI and The future of work
 
AI in retail
AI in retailAI in retail
AI in retail
 

Recently uploaded

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 

Recently uploaded (20)

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 

An introduction to reinforcement learning

  • 1. Reinforcement Learning Dr. Subrat Panda Head of AI and Data Sciences, Capillary Technologies
  • 2. Introduction about me ● BTech ( 2002) , PhD (2009) – CSE, IIT Kharagpur ● Synopsys (EDA), IBM (CPU), NVIDIA (GPU), Taro (Full Stack Engineer), Capillary (Principal Architect - AI) ● Applying AI to Retail ● Co-Founded IDLI (for social good) with Prof. Amit Sethi (IIT Bombay), Jacob Minz (Synopsys) and Biswa Gourav Singh (Capillary) ● https://www.facebook.com/groups/idliai/ ● Linked In - https://www.linkedin.com/in/subratpanda/ ● Facebook - https://www.facebook.com/subratpanda ● Twitter - @subratpanda Email - subrat.panda@capillarytech.com
  • 3. Overview • Supervised Learning: Immediate feedback (labels provided for every input). • Unsupervised Learning: No feedback (No labels provided). • Reinforcement Learning: Delayed scalar feedback (a number called reward). • RL deals with agents that must sense & act upon their environment. This combines classical AI and machine learning techniques. It is probably the most comprehensive problem setting. • Examples: • Robot-soccer • Share Investing • Learning to walk/fly/ride a vehicle • Scheduling • AlphaGo / Super Mario
  • 4. Machine Learning – http://techleer.com
  • 5.
  • 6. Some Definitions and assumptions MDP - Markov Decision Problem The framework of the MDP has the following elements: 1. state of the system, 2. actions, 3. transition probabilities, 4. transition rewards, 5. a policy, and 6. a performance metric. We assume that the system is modeled by a so-called abstract stochastic process called the Markov chain. RL is generally used to solve the so-called Markov decision problem (MDP). The theory of RL relies on dynamic programming (DP) and artificial intelligence (AI).
  • 7. The Big Picture Your action influences the state of the world which determines its reward
  • 8. Some Complications • The outcome of your actions may be uncertain • You may not be able to perfectly sense the state of the world • The reward may be stochastic. • Reward is delayed (i.e. finding food in a maze) • You may have no clue (model) about how the world responds to your actions. • You may have no clue (model) of how rewards are being paid off. • The world may change while you try to learn it. • How much time do you need to explore uncharted territory before you exploit what you have learned? • For large-scale systems with millions of states, it is impractical to store these values. This is called the curse of dimensionality. DP breaks down on problems which suffer from any one of these curses because it needs all these values.
  • 9. The Task • To learn an optimal policy that maps states of the world to actions of the agent. i.e., if this patch of room is dirty, I clean it. If my battery is empty, I recharge it. • What is it that the agent tries to optimize? Answer: the total future discounted reward: Note: immediate reward is worth more than future reward. What would happen to mouse in a maze with gamma = 0 ? - Greedy
  • 10. Value Function • Let’s say we have access to the optimal value function that computes the total future discounted reward • What would be the optimal policy ? • Answer: we choose the action that maximizes: • We assume that we know what the reward will be if we perform action “a” in state “s”: • We also assume we know what the next state of the world will be if we perform action “a” in state “s”:
  • 11. Example I • Consider some complicated graph, and we would like to find the shortest path from a node Si to a goal node G. • Traversing an edge will cost you “length edge” dollars. • The value function encodes the total remaining distance to the goal node from any node s, i.e. V(s) = “1 / distance” to goal from s. • If you know V(s), the problem is trivial. You simply choose the node that has highest V(s). Si G
  • 12. Example II Find your way to the goal.
  • 13. Q-Function • One approach to RL is then to try to estimate V*(s). • However, this approach requires you to know r(s,a) and delta(s,a). • This is unrealistic in many real problems. What is the reward if a robot is exploring mars and decides to take a right turn? • Fortunately we can circumvent this problem by exploring and experiencing how the world reacts to our actions. We need to learn r & delta. • We want a function that directly learns good state-action pairs, i.e. what action should I take in this state. We call this Q(s,a). • Given Q(s,a) it is now trivial to execute the optimal policy, without knowing r(s,a) and delta(s,a). We have: Bellman Equation:
  • 15. Q-Learning • This still depends on r(s,a) and delta(s,a). • However, imagine the robot is exploring its environment, trying new actions as it goes. • At every step it receives some reward “r”, and it observes the environment change into a new state s’ for action a. How can we use these observations, (s,a,s’,r) to learn a model? s’=st+1
  • 16. Q-Learning • This equation continually estimates Q at state s consistent with an estimate of Q at state s’, one step in the future: temporal difference (TD) learning. • Note that s’ is closer to goal, and hence more “reliable”, but still an estimate itself. • Updating estimates based on other estimates is called bootstrapping. • We do an update after each state-action pair. I.e., we are learning online! • We are learning useful things about explored state-action pairs. These are typically most useful because they are likely to be encountered again. • Under suitable conditions, these updates can actually be proved to converge to the real answer. s’=st+1
  • 17. Example Q-Learning Q-learning propagates Q-estimates 1-step backwards
  • 18. Exploration / Exploitation • It is very important that the agent does not simply follow the current policy when learning Q. (off-policy learning).The reason is that you may get stuck in a suboptimal solution. i.e. there may be other solutions out there that you have never seen. • Hence it is good to try new things so now and then, e.g. If T is large lots of exploring, if T is small follow current policy. One can decrease T over time.
  • 19. Improvements • One can trade-off memory and computation by caching (s,s’,r) for observed transitions. After a while, as Q(s’,a’) has changed, you can “replay” the update: •One can actively search for state-action pairs for which Q(s,a) is expected to change a lot (prioritized sweeping). • One can do updates along the sampled path much further back than just one step ( learning).
  • 20. Extensions • To deal with stochastic environments, we need to maximize expected future discounted reward: • Often the state space is too large to deal with all states. In this case we need to learn a function: • Neural network with back-propagation have been quite successful. • For instance, TD-Gammon is a back-gammon program that plays at expert level. state-space very large, trained by playing against itself, uses NN to approximate value function, uses TD(lambda) for learning.
  • 21. Real Life Examples Using RL
  • 22. Deep RL - Deep Q Learning
  • 23. Tools and EcoSystem ● OpenAI Gym- Python based rich AI simulation environment ● TensorFlow - TensorLayer providing popular RL modules ● Pytorch - Open Sourced by Facebook ● Keras - On top of TensorFlow ● DeepMind Lab - Google 3D platform
  • 24. Small Demo - on Youtube
  • 25. Application Areas of RL - Resources management in computer clusters - Traffic Light Control - Robotics - Personalized Recommendations - Chemistry - Bidding and Advertising - Games
  • 26. What you need to know to apply RL? - Understanding your problem - A simulated environment - MDP - Algorithms
  • 27. Learning how to run? - Good example by deepsense.ai
  • 28. Conclusion • Reinforcement learning addresses a very broad and relevant question: How can we learn to survive in our environment? • We have looked at Q-learning, which simply learns from experience. No model of the world is needed. • We made simplifying assumptions: e.g. state of the world only depends on last state and action. This is the Markov assumption. The model is called a Markov Decision Process (MDP). • We assumed deterministic dynamics, reward function, but the world really is stochastic. • There are many extensions to speed up learning. • There have been many successful real world applications.
  • 29. References • Thanks to Google and AI research Community • https://www.slideshare.net/AnirbanSantara/an-introduction-to-reinforcement-learning-the-doors-to-agi - Good Intro PPT • https://skymind.ai/wiki/deep-reinforcement-learning - Good blog • https://medium.com/@BonsaiAI/why-reinforcement-learning-might-be-the-best-ai-technique-for-complex-industrial-systems- fde8b0ebd5fb • https://deepsense.ai/learning-to-run-an-example-of-reinforcement-learning/ • https://en.wikipedia.org/wiki/Reinforcement_learning • https://www.ics.uci.edu/~welling/teaching/ICS175winter12/RL.ppt - Good Intro PPT which I have adopted in whole • https://people.eecs.berkeley.edu/~jordan/MLShortCourse/reinforcement-learning.ppt • https://www.cse.iitm.ac.in/~ravi/courses/Reinforcement%20Learning.html - Detailed Course • https://web.mst.edu/~gosavia/tutorial.pdf - short Tutorial • Book - Reinforcement Learning: An Introduction (English, Hardcover, Andrew G Barto Richard S Sutton Barto Sutton) • https://www.techleer.com/articles/203-machine-learning-algorithm-backbone-of-emerging-technologies/ • http://ruder.io/transfer-learning/ • Exploration and Exploitation - https://artint.info/html/ArtInt_266.html , http://www.cs.cmu.edu/~rsalakhu/10703/Lecture_Exploration.pdf • https://hub.packtpub.com/tools-for-reinforcement-learning/ • http://adventuresinmachinelearning.com/reinforcement-learning-tensorflow/ • https://towardsdatascience.com/applications-of-reinforcement-learning-in-real-world-1a94955bcd12
  • 30. Acknowledgements - Anirban Santara - For his Intro Session on IDLI and support - Prof. Ravindran - Inspiration - Palash Arora from Capillary, Ashwin Krishna. - Techgig Team for the opportunity - Sumandeep and Biswa from Capillary for review and discussions. - https://www.ics.uci.edu/~welling/teaching/ICS175winter12/RL.ppt - Good Intro PPT which I have adopted in whole Thank You!!