1. Reinforcement learning involves an agent learning through trial-and-error interactions with an environment. The agent learns a policy for how to act by maximizing rewards.
2. The document outlines key elements of reinforcement learning including states, actions, rewards, value functions, and explores different methods for solving reinforcement learning problems including dynamic programming, Monte Carlo methods, and temporal difference learning.
3. Temporal difference learning combines the advantages of Monte Carlo methods and dynamic programming by allowing for incremental learning through bootstrapping predictions like dynamic programming while also learning directly from experience like Monte Carlo methods.
This presentation contains an introduction to reinforcement learning, comparison with others learning ways, introduction to Q-Learning and some applications of reinforcement learning in video games.
In some applications, the output of the system is a sequence of actions. In such a case, a single action is not important
game playing where a single move by itself is not that important.in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to achieve its goal
This presentation contains an introduction to reinforcement learning, comparison with others learning ways, introduction to Q-Learning and some applications of reinforcement learning in video games.
In some applications, the output of the system is a sequence of actions. In such a case, a single action is not important
game playing where a single move by itself is not that important.in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to achieve its goal
Machine Learning and Real-World ApplicationsMachinePulse
This presentation was created by Ajay, Machine Learning Scientist at MachinePulse, to present at a Meetup on Jan. 30, 2015. These slides provide an overview of widely used machine learning algorithms. The slides conclude with examples of real world applications.
Ajay Ramaseshan, is a Machine Learning Scientist at MachinePulse. He holds a Bachelors degree in Computer Science from NITK, Suratkhal and a Master in Machine Learning and Data Mining from Aalto University School of Science, Finland. He has extensive experience in the machine learning domain and has dealt with various real world problems.
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
This is a deep learning presentation based on Deep Neural Network. It reviews the deep learning concept, related works and specific application areas.It describes a use case scenario of deep learning and highlights the current trends and research issues of deep learning
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
Deep Reinforcement Learning Talk at PI School. Covering following contents as:
1- Deep Reinforcement Learning
2- QLearning
3- Deep QLearning (DQN)
4- Google Deepmind Paper (DQN for ATARI)
The presentation is made on CNN's which is explained using the image classification problem, the presentation was prepared in perspective of understanding computer vision and its applications. I tried to explain the CNN in the most simple way possible as for my understanding. This presentation helps the beginners of CNN to have a brief idea about the architecture and different layers in the architecture of CNN with the example. Please do refer the references in the last slide for a better idea on working of CNN. In this presentation, I have also discussed the different types of CNN(not all) and the applications of Computer Vision.
Machine Learning and Real-World ApplicationsMachinePulse
This presentation was created by Ajay, Machine Learning Scientist at MachinePulse, to present at a Meetup on Jan. 30, 2015. These slides provide an overview of widely used machine learning algorithms. The slides conclude with examples of real world applications.
Ajay Ramaseshan, is a Machine Learning Scientist at MachinePulse. He holds a Bachelors degree in Computer Science from NITK, Suratkhal and a Master in Machine Learning and Data Mining from Aalto University School of Science, Finland. He has extensive experience in the machine learning domain and has dealt with various real world problems.
Welcome to the Supervised Machine Learning and Data Sciences.
Algorithms for building models. Support Vector Machines.
Classification algorithm explanation and code in Python ( SVM ) .
This is a deep learning presentation based on Deep Neural Network. It reviews the deep learning concept, related works and specific application areas.It describes a use case scenario of deep learning and highlights the current trends and research issues of deep learning
Decision tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. It is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.
Deep Reinforcement Learning Talk at PI School. Covering following contents as:
1- Deep Reinforcement Learning
2- QLearning
3- Deep QLearning (DQN)
4- Google Deepmind Paper (DQN for ATARI)
The presentation is made on CNN's which is explained using the image classification problem, the presentation was prepared in perspective of understanding computer vision and its applications. I tried to explain the CNN in the most simple way possible as for my understanding. This presentation helps the beginners of CNN to have a brief idea about the architecture and different layers in the architecture of CNN with the example. Please do refer the references in the last slide for a better idea on working of CNN. In this presentation, I have also discussed the different types of CNN(not all) and the applications of Computer Vision.
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
A computer game using temporal difference algorithm of Machine learning which improves the ability of the computer to learn and also explore the best next move for the game by greedy movement techniques and exploration method techniques for the future states of the game.
this talk was an introduction to Reinforcement Learning based on the book by Andrew Barto and Richard S. Sutton. We explained the main components of an RL problem and detailed the tabular solutions and approximate solutions methods.
Reinforcement learning is the training of machine learning models to make a sequence of decisions. The agent learns to achieve a goal in an uncertain, potentially complex environment. In reinforcement learning, an artificial intelligence faces a game-like situation.
Reinforcement Learning Guide For Beginnersgokulprasath06
Reinforcement Learning Guide:
Land in multiple job interviews by joining our Data Science certification course.
Data Science course content designed uniquely, which helps you start learning Data Science from basics to advanced data science concepts.
Content: http://bit.ly/2Mub6xP
Any Queries, Call us@ +91 9884412301 / 9600112302
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
Deep Reinforcement Learning with Shallow Trees:
In this talk, I present Concept Network Reinforcement Learning (CNRL), developed at Bonsai. It is an industrially applicable approach to solving complex tasks using reinforcement learning, which facilitates problem decomposition, allows component reuse, and simplifies reward functions. Inspired by Sutton’s options framework, we introduce the notion of “Concept Networks” which are tree-like structures in which leaves are “sub-concepts” (sub-tasks), representing policies on a subset of state space. The parent (non-leaf) nodes are “Selectors”, containing policies on which sub-concept to choose from the child nodes, at each time during an episode. There will be a high-level overview on the reinforcement learning fundamentals at the beginning of the talk.
Bio: Matineh Shaker is an Artificial Intelligence Scientist at Bonsai in Berkeley, CA, where she builds machine learning, reinforcement learning, and deep learning tools and algorithms for general purpose intelligent systems. She was previously a Machine Learning Researcher at Geometric Intelligence, Data Science Fellow at Insight Data Science, Predoctoral Fellow at Harvard Medical School. She received her PhD from Northeastern University with a dissertation in geometry-inspired manifold learning.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
2. 22
Outline
Introduction
Element of reinforcement learning
Reinforcement Learning Problem
Problem solving methods for RL
2
3. 33
Introduction
Machine learning: Definition
Machine learning is a scientific discipline that is
concerned with the design and development of
algorithms that allow computers to learn based on
data, such as from sensor data or databases.
A major focus of machine learning research is to
automatically learn to recognize complex patterns and
make intelligent decisions based on data .
3
4. 4
Machine learning Type:
With respect to the feedback type to learner:
Supervised learning : Task Driven (Classification)
Unsupervised learning : Data Driven (Clustering)
Reinforcement learning —
Close to human learning.
Algorithm learns a policy of how to act in a given
environment.
Every action has some impact in the environment,
and the environment provides rewards that guides
the learning algorithm.
5. 5
Supervised Learning Vs
Reinforcement Learning
Supervised Learning
Step: 1
Teacher: Does picture 1 show a car or a flower?
Learner: A flower.
Teacher: No, it’s a car.
Step: 2
Teacher: Does picture 2 show a car or a flower?
Learner: A car.
Teacher: Yes, it’s a car.
Step: 3 ....
6. 6
Supervised Learning Vs
Reinforcement Learning (Cont...)
Reinforcement Learning
Step: 1
World: You are in state 9. Choose action A or C.
Learner: Action A.
World: Your reward is 100.
Step: 2
World: You are in state 32. Choose action B or E.
Learner: Action B.
World: Your reward is 50.
Step: 3 ....
7. 77
Introduction (Cont..)
Meaning of Reinforcement: Occurrence of an
event, in the proper relation to a response, that tends to
increase the probability that the response will occur
again in the same situation.
Reinforcement learning is the problem faced by an
agent that learns behavior through trial-and-error
interactions with a dynamic environment.
Reinforcement Learning is learning how to act in
order to maximize a numerical reward.
7
8. 88
Introduction (Cont..)
Reinforcement learning is not a type of neural
network, nor is it an alternative to neural networks.
Rather, it is an orthogonal approach for Learning
Machine.
Reinforcement learning emphasizes learning feedback
that evaluates the learner's performance without
providing standards of correctness in the form
of behavioral targets.
Example: Bicycle learning
8
9. 99
Element of reinforcement learning
Agent
State Reward Action
Environment
Policy
Agent: Intelligent programs
Environment: External condition
Policy:
Defines the agent’s behavior at a given time
A mapping from states to actions
Lookup tables or simple function
9
10. 1100
Element of reinforcement learning
Reward function :
Defines the goal in an RL problem
Policy is altered to achieve this goal
Value function:
Reward function indicates what is good in an immediate
sense while a value function specifies what is good in
the long run.
Value of a state is the total amount of reward an agent can expect
to accumulate over the future, starting form that state.
Model of the environment :
Predict mimic behavior of environment,
Used for planning & if Know current state and action
then predict the resultant next state and next reward.
10
12. 1122
Steps for Reinforcement Learning
1. The agent observes an input state
2. An action is determined by a decision making
function (policy)
3. The action is performed
4. The agent receives a scalar reward or reinforcement
from the environment
5. Information about the reward given for that state /
action pair is recorded
12
13. 1133
Silent Features of Reinforcement
Learning :
Set of problems rather than a set of techniques
Without specifying how the task is to be achieved.
“RL as a tool” point of view:
RL is training by rewards and punishments.
Train tool for the computer learning.
The learning agent’s point of view:
RL is learning from trial and error with the world.
Eg. how much reward I much get if I get this.
13
14. 1144
Reinforcement Learning (Cont..)
Reinforcement Learning uses Evaluative Feedback
Purely Evaluative Feedback
Indicates how good the action taken is.
Not tell if it is the best or the worst action possible.
Basis of methods for function optimization.
Purely Instructive Feedback
Indicates the correct action to take, independently of
the action actually taken.
Eg: Supervised Learning
Associative and Non Associative:
14
15. 15
Associative & Non Associative Tasks
Associative :
Situation Dependent
Mapping from situation to the actions that are best in that
situation
Non Associative:
Situation independent
No need for associating different action with different
situations.
Learner either tries to find a single best action when the task
is stationary, or tries to track the best action as it changes
over time when the task is non stationary.
16. 1166
Reinforcement Learning (Cont..)
Exploration and exploitation
Greedy action: Action chosen with greatest estimated
value.
Greedy action: a case of Exploitation.
Non greedy action: a case of Exploration, as it
enables us to improve estimate the non-greedy
action's value.
N-armed bandit Problem:
We have to choose from n different options or actions.
We will choose the one with maximum reward.
16
17. Bandits Problem
One-Bandit
“arms”
Pull arms sequentially so as to maximize the total
expected reward
•It is non associative and evaluative
18. 1188
Action Selection Policies
Greediest action - action with the highest estimated
reward.
Ɛ -greedy
Most of the time the greediest action is chosen
Every once in a while with a small probability Ɛ, an
action is selected at random. The action is selected
uniformly, independent of the action-value estimates.
Ɛ -soft - The best action is selected with probability (1
–Ɛ) and the rest of the time a random action is chosen
uniformly.
18
19. 1199
Ɛ-Greedy Action Selection Method :
Let the a* is the greedy action at time t and Qt(a) is the
value of action a at time.
Greedy Action Selection:
Ɛ –greedy
19
Î
* = argmax
with probability 1 t a
{
-Î
random action with probability
at=
at = at
a
Qt (a)
20. 20
Action Selection Policies (Cont…)
Softmax –
Drawback of Ɛ -greedy & Ɛ -soft: Select random
actions uniformly.
Softmax remedies this by:
Assigning a weight with each actions, according
to their action-value estimate.
A random action is selected with regards to the
weight associated with each action
The worst actions are unlikely to be chosen.
This is a good approach to take where the worst
actions are very unfavorable.
21. 21
Softmax Action Selection( Cont…)
Problem with Ɛ-greedy: Neglects action values
Softmax idea: grade action probs. by estimated values.
Gibbs, or Boltzmann action selection, or exponential
weights:
t is the “computational temperature”
At t 0 the Softmax action selection method become
the same as greedy action selection.
22. 22
Incremental Implementation
Sample average:
Incremental computation:
Common update rule form:
NewEstimate = OldEstimate + StepSize[Target –
OldEstimate]
The expression [ Target - Old Estimate] is an error in the
estimate.
It is reduce by taking a step toward the target.
In proceeding the (t+1)st reward for action a the step-size
parameter will be 1(t+1).
23. Some terms in Reinforcement Learning
The Agent Learns a Policy:
2233
Policy at step t, : a mapping from states to action
probabilities will be:
Agents changes their policy with Experience.
Objective: get as much reward as possible over a
long run.
Goals and Rewards
A goal should specify what we want to achieve, not
how we want to achieve it.
23
24. 24
Some terms in RL (Cont…)
Returns
Rewards in long term
Episodes: Subsequence of interaction between agent-environment
e.g., plays of a game, trips through a
maze.
Discount return
The geometrically discounted model of return:
Used to:
To determine the present value of the future
rewards
Give more weight to earlier rewards
25. 2255
The Markov Property
Environment's response at (t+1) depends only on State
(s) & Action (a) at t,
Markov Decision Processes
Finite MDP
Transition Probabilities:
Expected Reward
Transition Graph: Summarize dynamics of finite MDP.
25
26. Value function
States-action pairs function that estimate how good it is
for the agent to be in a given state
Type of value function
2266
State-Value function
Action-Value function
26
27. 2277
Backup Diagrams
Basis of the update or backup operations
backup diagrams to provide graphical summaries of
the algorithms. (State value & Action value function)
Bellman Equation for a Policy Π
27
28. 28
Model-free and Model-based Learning
Model-based learning
Learn from the model instead of interacting with the world
Can visit arbitrary parts of the model
Also called indirect methods
E.g. Dynamic Programming
Model-free learning
Sample Reward and transition function by interacting with
the world
Also called direct methods
29. 29
Problem Solving Methods for RL
1) Dynamic programming
2) Monte Carlo methods
3) Temporal-difference learning.
30. 3300
1.Dynamic programming
Classical solution method
Require a complete and accurate model of the
environment.
Popular method for Dynamic programming
Policy Evaluation : Iterative computation of the value
function for a given policy (prediction Problem)
Policy Improvement: Computation of improved policy
for a given value function.
30
( ) { ( )} t t 1 t V s E r gV s p ¬ + +
31. Dynamic Programming
T
st
rt+1
T
T
st+1
T
T T
T
T T
T
31
( ) { ( )} t t 1 t V s E r gV s p ¬ + +
40. 40
Generalized Policy Iteration (GPI)
Consist of two iteration process,
Policy Evaluation :Making the value function
consistent with the current policy
Policy Improvement: Making the policy greedy
with respect to the current value function
41. 4411
2. Monte Carlo Methods
Features of Monte Carlo Methods
No need of Complete knowledge of environment
Based on averaging sample returns observed after visit
to that state.
Experience is divided into Episodes
Only after completing an episode, value estimates and
policies are changed.
Don't require a model
Not suited for step-by-step incremental computation
41
42. 4422
MC and DP Methods
Compute same value function
Same step as in DP
Policy evaluation
Computation of VΠ and QΠ for a fixed arbitrary
policy (Π).
Policy Improvement
Generalized Policy Iteration
42
43. 4433
To find value of a State
Estimate by experience, average the returns observed
after visit to that state.
More the return, more is the average converge to
expected value
43
46. In MCES, all returns for each state-action pair are accumulated
and averaged, irrespective of what Policy was in force when
they were observed
4466
46
48. 4488
Monte Carlo Method (Cont…)
All actions should be selected infinitely for
optimization
MC exploring starts can not converge to any
suboptimal policy.
Agents have 2 approaches
On-policy
Off-policy
48
52. Monte Carlo and Dynamic Programming
5522
MC has several advantage over DP:
Can learn from interaction with environment
No need of full models
No need to learn about ALL states
No bootstrapping
52
53. 5533
3. Temporal Difference (TD) methods
Learn from experience, like MC
Can learn directly from interaction with environment
No need for full models
Estimate values based on estimated values of next states,
like DP
Bootstrapping (like DP)
Issue to watch for: maintaining sufficient exploration
53
54. for a given policy p, compute the state-value function
target: the actual return after time t
5544
TD Prediction
Policy Evaluation (the prediction problem):
Simple every - visit Monte Carlomethod :
V(s ) V(s )+α[R V(s )] t t t t ¬ -
V p
( ) :
The simplest TD method, TD 0
V(s ) V(s )+α[r + V(s ) V(s )]
¬ g
- t t t+ 1 t+ 1
t target: an estimate of the return
54
56. Simplest TD Method
st+1
rt+1
st
T T T T T
T T
T T T T T T
56
V(st )¬V(st) +a rt+1 +g V(st+1 ) - V(st [ )]
57. Dynamic Programming
T
st
rt+1
T
T
st+1
T
T T
T
T T
T
57
( ) { ( )} t t 1 t V s E r gV s p ¬ + +
58. 5588
TD methods bootstrap and sample
Bootstrapping: update involves an estimate
MC does not bootstrap
DP bootstraps
TD bootstraps
Sampling: update does not involve an expected
value
MC samples
DP does not sample
TD samples
58
59. After every transition from a nonterminal state , do this ( ) ( ) [ ( ) ( )]
If is terminal, then 0.
1 1
5599
Learning An Action-Value Function
Estimate Qp for the current behavior policy p .
59
Q s ,a ¬Q s ,a +α r + gQ s ,a Q s ,a
t t t t t+ 1 t+ 1
t+1 t t
s Q(s ,a )=
s
t+ t+ t+1
t
-
60. Q-Learning: Off-Policy TD Control
60
One - step Q- learning :
Q st , at ( )¬Q st , at ( )+a rt +1 +g max
[ Q(s, )- +aQ ( s, a)]
t1t t a
61. 6611
Sarsa: On-Policy TD Control
S A R S A: State Action Reward State Action
61
Turn this into a control method by always updating the policy
to be greedy with respect to the current estimate:
62. 6622
Advantages of TD Learning
TD methods do not require a model of the
environment, only experience
TD, but not MC, methods can be fully incremental
You can learn before knowing the final outcome
Less memory
Less peak computation
You can learn without the final outcome
From incomplete sequences
Both MC and TD converge
62
63. 63
References: RL
Univ. of Alberta
http://www.cs.ualberta.ca/~sutton/book/ebook/node
1.html
Sutton and barto,”Reinforcement Learning an
introduction.”
Univ. of South Wales
http://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlear
ning.html