SlideShare a Scribd company logo
1 of 74
Download to read offline
Planning and Learning with Tabular Methods
Reviewed by D. Lee, 2018
Reinforcement Learning: An Introduction
Richard S. Sutton and Andrew G. Barto
Outline
1. Introduction
2. Models and Planning
3. Dyna: Integrating Planning, Acting, and Learning
4. When the Model Is Wrong
5. Prioritized Sweeping
6. Expected vs. Sample Updates
7. Trajectory Sampling
8. Planning at Decision Time
9. Heuristic Search
10. Rollout Algorithms
11. Monte Carlo Tree Search
12. Summary
Introduction
Introduction
• In this chapter we develop a unified view of reinforcement learning methods
– that require a model of the environment, such as dynamic programming and heuristic search.
– that can be used without a model, such as Monte Carlo and Temporal Difference methods.
• These are respectively called model-based and model-free reinforcement learning methods.
– Model-based methods rely on planning as their primary component.
– While model-free methods primarily rely on learning.
• The heart of both kinds of methods is
– the computation of value functions.
– based on looking ahead to future events, computing a backed-up value.
– then using computed value as an update target for an approximate value function.
4
Models and Planning
Models and Planning (1)
• By a model of the environment we mean anything that an agent can use to predict how the
environment will respond to its actions.
– 다시 말해 simulation을 하기 위해 model을 만든 것.
• deterministic model vs. stochastic model
– deterministic model : Given a state and an action, a model produces a prediction of the resultant next state and next
reward.
– stochastic model : There are several possible next states and next rewards, each with some probability of occurring.
6
Models and Planning (2)
• distribution model vs. sample model
– distribution model
• produce a description of all possibilities and their probabilities.
• The kind of model assumed in dynamic programming - estimates of the MDP's dynamics,
𝑝(𝑠′
, 𝑟|𝑠, 𝑎) - is a distribution.
• Stronger than sample models in that they can always be used to produce samples.
– sample model
• produce just one of the possibilities, sampled according to the probabilities.
• produce an individual sum drawn according to this probability distribution.
• The kind of model used in the blackjack example (by Monte Carlo Method) in Chapter 5.
• Models can be used to mimic or simulate experience.
– In either case, distribution model or sample model, the model is used to simulate the environment
and produce simulated experience.
7Source : Planning and Learning with Tabular Methods - 1
Image source : https://wonseokjung.github.io//reinforcementlearning/update/PaL/
Models and Planning (3)
• Planning
– Any computational process that takes a model as input and produces or improves a policy for interacting with the
modeled environment:
• Two distinct approaches to planning
– state-space planning
• A search through the state space for an optimal policy or an optimal path to a goal.
• Actions cause transitions from state to state.
• Value functions are computed over states.
– plan-space planning
• A search through the space of plans.
• Operators transform one plan into another plan.
• Value functions are defined over the space of plans.
• Plan-space methods are difficult to apply efficiently to the stochastic sequential decision problems
• we do not consider plan-space methods further.
8Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Models and Planning (4)
• state-space planning’s two basic ideas:
– All state-space planning methods involve computing value functions toward improving the policy.
– They compute value functions by updates or backup operations applied to simulated experience.
• This structure can be diagrammed as follows:
– Dynamic programming methods clearly fit this structure.
9Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Models and Planning (5)
• The heart of both learning and planning methods is the estimation of value functions by backing-up
update operations.
• The difference is that whereas planning uses simulated experience generated by a model, learning
methods use real experience generated by the environment.
• In many cases a learning algorithm can be substituted for the key update step of a planning method.
– Random-sample one-step tabular Q-planning
• A simple example of a planning method based on one-step tabular Q-learning and on random samples from a
sample model.
10Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Dyna: Integrating Planning, Acting,
and Learning
Dyna: Integrating Planning, Acting, and Learning (1)
• When planning is done online, while interacting with the environment, a number of interesting
issues arise.
• New information gained from interacting with the environment may change the model.
• If decision making and model learning are both computation-intensive processes, then the available
computational resources may need to be divided between them.
– Dyna-Q
• A simple architecture integrating the major functions needed in an online planning agent.
12
Dyna: Integrating Planning, Acting, and Learning (2)
• Within a planning agent, there are at least two roles for real experience:
– It can be used to improve the model. (to make it more accurately match the real environment)
• model-learning
– It can be used to directly improve the value function and policy.
• direct reinforcement learning (direct RL)
• Experience can improve value functions and policies either directly or indirectly via the model.
– It is the latter, which is sometimes called indirect reinforcement learning, that is involved in planning.
13Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Dyna: Integrating Planning, Acting, and Learning (3)
• Dyna-Q includes all of the processes shown in the diagram above - planning, acting, model-learning,
and direct RL - all occurring continually.
– The planning method is the random-sample one-step tabular Q-planning method.
– The direct RL method is one-step tabular Q-learning.
– The model-learning method is also table-based and assumes the environment is deterministic.
• After each transition 𝑆𝑡, 𝐴 𝑡 → 𝑅𝑡+1, 𝑆𝑡+1, the model records in its table entry for 𝑆𝑡, 𝐴 𝑡 the prediction that 𝑅𝑡+1, 𝑆𝑡+1
will deterministically follow.
• Thus, if the model is queried with a state-action pair that has been experienced before, it simply returns the last-
observed next state and next reward as its prediction.
14Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Dyna: Integrating Planning, Acting, and Learning (4)
• The overall architecture of Dyna agents, of which the Dyna-Q algorithm is one example, is
– The central column represents the basic interaction between agent and environment, giving rise to a trajectory of real
experience.
– The arrow on the left of the figure represents direct reinforcement learning operating on real experience to improve the
value function and the policy.
15Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Dyna: Integrating Planning, Acting, and Learning (5)
• The overall architecture of Dyna agents, of which the Dyna-Q algorithm is one example, is
– On the right are model-based processes. The model is learned from real experience and gives rise to simulated
experience.
• We use the term search control to refer to the process that selects the starting states and actions for the simulated
experiences generated by the model.
• planning update is achieved by applying reinforcement learning methods to the simulated experiences just as if they
had really happened.
16Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Dyna: Integrating Planning, Acting, and Learning (6)
• Typically, as in Dyna-Q, the same reinforcement learning method is
– used both for learning from real experience and for planning from simulated experience.
– thus the “final common path” for both learning and planning.
• Conceptually, planning, acting, model-learning, and direct RL occur simultaneously and in parallel in
Dyna agents. however, we fully specify the order in which they occur within a time step.
• The pseudocode algorithm for Dyna-Q is
17Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Dyna: Integrating Planning, Acting, and Learning (7)
• The pseudocode algorithm for Dyna-Q is
– Let us assume that there is time in each step, after acting, model-learning, and direct RL, to complete n iterations (Steps
1-3) of the Q-planning algorithm.
– Model(𝑠, 𝑎) denotes the contents of the predicted next state and reward for state-action pair (𝑠, 𝑎).
– Direct reinforcement learning, model-learning, and planning are implemented by steps (d), (e), and (f), respectively.
• If (e) and (f) were omitted, the remaining algorithm would be one-step tabular Q-learning.
18Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Dyna: Integrating Planning, Acting, and Learning (8)
• Example 8.1: Dyna Maze
– Figure 8.2:
• This is a discounted, episodic task with 𝛾 = 0.95.
• The initial action values were zero, the step-size parameter was 𝛼 = 0.1, and the exploration parameter was 𝜀 =
0.1.
• The agents varied in the number of planning steps, n, they performed per real step.
19Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Dyna: Integrating Planning, Acting, and Learning (9)
• Example 8.1: Dyna Maze
– Figure 8.3:
• Figure 8.3 shows why the planning agents found the solution so much faster than the non-planning agent.
• Shown are the policies found by the n = 0 and n = 50 agents halfway through the second episode.
– Without planning (n = 0), each episode adds only one additional step to the policy, and so only one step (the last) has been
learned so far.
– With planning, again only one step is learned during the first episode, but here during the second episode an extensive policy
has been developed that by the end of the episode will reach almost back to the start state.
20Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Dyna: Integrating Planning, Acting, and Learning (10)
• In Dyna-Q, learning and planning are accomplished by exactly the same algorithm, operating on real
experience for learning and on simulated experience for planning.
• ongoing in the background is the model-learning process.
– As new information is gained, the model is updated to better match reality.
– As the model changes, the ongoing planning process will gradually compute a different way of behaving to match the
new model.
21
When the Model Is Wrong
When the Model Is Wrong (1)
• Models may be incorrect.
– because the environment is stochastic and only a limited number of samples have been observed.
– because the model was learned using function approximation that has generalized imperfectly.
– simply because the environment has changed and its new behavior has not yet been observed.
• When the model is incorrect, the planning process is likely to compute a suboptimal policy.
• In some cases, the suboptimal policy computed by planning quickly leads to the discovery and
correction of the modeling error.
23
When the Model Is Wrong (2)
• Example 8.2: Blocking Maze (Positive example)
– Figure 8.4:
• Average performance of Dyna agents on a blocking task. The left environment was used for the first 1000 steps, the
right environment for the rest. Dyna-Q+ is Dyna-Q with an exploration bonus that encourages exploration.
• Initially, there is a short path from start to goal, to the right of the barrier, as shown in the upper left of the figure.
• After 1000 time steps, the short path is “blocked,” and a longer path is opened up along the left-hand side of the
barrier, as shown in upper right of the figure.
24Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
When the Model Is Wrong (3)
• Example 8.2: Blocking Maze (Positive example)
– Figure 8.4:
• When the environment changed, the graphs become flat, indicating a period during which the agents obtained no
reward because they were wandering around behind the barrier.
• After a while, however, they were able to find the new opening and the new optimal behavior.
25Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
When the Model Is Wrong (4)
• Example 8.3: Shortcut Maze (Negative example)
– Figure 8.5:
• Average performance of Dyna agents on a shortcut task. The left environment was used for the first 3000 steps, the
right environment for the rest.
• Initially, the optimal path is to go around the left side of the barrier (upper left).
• After 3000 steps, however, a shorter path is opened up along the right side, without disturbing the longer path
(upper right).
26Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
When the Model Is Wrong (5)
• Example 8.3: Shortcut Maze (Negative example)
– Figure 8.5:
• Greater difficulties arise when the environment changes to become better than it was before, and yet the formerly
correct policy does not reveal the improvement.
• The graph shows that the regular Dyna-Q agent never switched to the shortcut. In fact, Dyna-Q agent never realized
that it existed.
• Dyna-Q model said that there was no shortcut, so the more it planned, the less likely it was to step to the right and
discover it.
27Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
When the Model Is Wrong (6)
• The general problem here is another version of the conflict between exploration and exploitation.
– exploration means trying actions that improve the model.
– whereas exploitation means behaving in the optimal way given the current model.
• We want the agent to explore to find changes in the environment, but not so much that
performance is greatly degraded.
• The Dyna-Q+ agent that did solve the shortcut maze uses one such heuristic.
– To encourage behavior that tests long-untried actions, a special “bonus reward” is given on simulated experiences
involving these actions.
– In particular, if the modeled reward for a transition is 𝑟, and the transition has not been tried in 𝜏 time steps, then
planning updates are done as if that transition produced a reward of r + k 𝜏, for some small k.
– This encourages the agent to keep testing all accessible state transitions and even to find long sequences of actions in
order to carry out such tests.
28
Prioritized Sweeping
Prioritized Sweeping (1)
• In the Dyna agents presented in the preceding sections, simulated transitions are started in state-
action pairs selected uniformly at random from all previously experienced pairs.
• But a uniform selection is usually not the best.
– planning can be much more efficient if simulated transitions and updates are focused on particular state-action pairs.
• If simulated transitions are generated uniformly, then many wasteful updates will be made before
stumbling onto one of these useful updates.
– 굳이 planning을 할 때 uniform selection을 할 필요가 있을까?
• In the much larger problems that are our real objective, the number of states is so large that an
unfocused search would be extremely inefficient.
– Focused search에 대해서 알아보자.
30
Prioritized Sweeping (2)
• This example suggests that search might be usefully focused by working backward from goal states.
• In general, we want to work back not just from goal states but from any state whose value has
changed.
• Suppose
– that the values are initially correct given the model, as they were in the maze example prior to discovering the goal.
– that the agent discovers a change in the environment and changes its estimated value of one state, either up or down.
• Typically, this will imply that the values of many other states should also be changed, but the only
useful one-step updates are those of actions that lead directly into the one state whose value has
been changed.
• This general idea might be termed backward focusing of planning computations.
31
Prioritized Sweeping (3)
• But not all of useful updates will be equally useful.
– The values of some states may have changed a lot, whereas others may have changed little.
• In a stochastic environment, variations in estimated transition probabilities also contribute to
variations in the sizes of changes and in the urgency with which pairs need to be updated.
• It is natural to prioritize the updates according to a measure of their urgency, and perform them in
order of priority.
• This is the idea behind prioritized sweeping.
32
Prioritized Sweeping (4)
• Prioritized Sweeping for a deterministic environment
– A queue is maintained of every state-action pair whose estimated value would change nontrivially if updated, prioritized
by the size of the change.
– If the effect is greater than some small threshold 𝜃, then the pair is inserted in the queue with the new priority.
– In this way the effects of changes are efficiently propagated backward until quiescence.
– The full algorithm for the case of deterministic environments is given in the box.
backward
33Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Prioritized Sweeping (5)
• Prioritized Sweeping for a stochastic environment
34Source : Prioritized Sweeping in a Stochastic Environment
Image source : https://lenabayeva.files.wordpress.com/2015/03/aamasreport2011.pdf
Prioritized Sweeping (6)
• Prioritized sweeping is just one way of distributing computations to improve planning efficiency, and
probably not the best way.
• One of prioritized sweeping's limitations is that it uses expected updates, which in stochastic
environments may waste lots of computation on low-probability transitions.
• As we show in the following section, sample updates can in many cases get closer to the true value
function with less computation despite the variance introduced by sampling.
35
Expected vs. Sample Updates
Expected vs. Sample Updates (1)
• In the rest of this chapter, we analyze some of the component
ideas involved, starting with the relative advantages of expected
and sample updates.
• Three binary dimensions
– Frist : whether value-function updates update state values or action values.
– Second : whether value-function updates estimate the value for the optimal
policy or for an arbitrary given policy.
– Third : whether the updates are expected updates, considering all possible
events that might happen, or sample updates, considering a single sample of
what might happen.
• These Frist, Second dimensions give rise to four classes of updates
for approximating the four value functions, 𝑞∗, 𝑣∗, 𝑞 𝜋, and 𝑣 𝜋.
• These three binary dimensions give rise to eight cases, seven of
which correspond to specific algorithms, as shown in the figure.
37Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Expected vs. Sample Updates (2)
• sample updates can be done using sample transitions from the environment or a sample model.
• Implicit in that point of view is that expected updates are preferable to sample updates.
• Expected updates certainly yield a better estimate because they are uncorrupted by sampling error,
but they also require more computation, and computation is often the limiting resource in planning.
• To properly assess the relative merits of expected and sample updates for planning we must control
for their different computational requirements.
38
Expected vs. Sample Updates (3)
• consider
– the expected and sample updates for approximating 𝑞∗
– the special case of discrete states and actions
– a table-lookup representation of the approximate value function, Q
– a model in the form of estimated dynamics, Ƹ𝑝(𝑠′
, 𝑟|𝑠, 𝑎).
• The expected update for a state-action pair, s, a, is:
• The corresponding sample update for s, a, given a sample next state and reward, S’ and R (from the
model), is the Q-learning-like update:
– 𝛼 is the usual positive step-size parameter.
39
Expected vs. Sample Updates (4)
• The difference between these expected and sample updates is significant to the extent that the
environment is stochastic.
– Specifically, to the extent that, given a state and action, many possible next states may occur with various probabilities.
• For a particular starting pair, s, a, let b be the branching factor (i.e., the number of possible next
states, s’, for which Ƹ𝑝 𝑠′, 𝑟 𝑠, 𝑎 > 0).
– A red-black tree with branching factor example:
40Source : Branching factor
Image source : https://en.wikipedia.org/wiki/Branching_factor
Expected vs. Sample Updates (5)
• Expected update
– The expected update is an exact computation, resulting in a new Q(s, a) whose correctness is limited only by the
correctness of the Q(s’, a’) at successor states.
– If there is enough time to complete an expected update, then the resulting estimate is generally better than that of b
sample updates because of the absence of sampling error.
– The expected update reduces the error to zero upon its completion.
41Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Expected vs. Sample Updates (6)
• Sample update
– The sample update is in addition affected by sampling error.
– The sample update is cheaper computationally because it considers only one next state, not all possible next states.
– If there is insufficient time to complete an expected update, then sample updates are always preferable because they at
least make some improvement in the value estimate with fewer than b updates.
– sample updates reduce the error according to
𝑏−1
𝑏𝑡
where t is the number of sample updates that have been performed
(assuming sample averages, i.e., 𝛼 = 1/𝑡).
42Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Expected vs. Sample Updates (7)
• These results suggest that sample updates are likely to be superior to expected updates on problems
with large stochastic branching factors and too many states to be solved exactly.
43Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Trajectory Sampling
Trajectory Sampling (1)
• In this section we compare two ways of distributing updates.
– Exhaustive Sweep
– Trajectory Sampling
• Exhaustive Sweep
– The classical approach, from dynamic programming, is to perform sweeps through the entire state (or state-action)
space, updating each state (or state-action pair) once per sweep.
– This is problematic on large tasks because there may not be time to complete even one sweep.
– In principle, updates can be distributed any way one likes (to assure convergence, all states or state-action pairs must be
visited in the limit an infinite number of times.)
45Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Trajectory Sampling (2)
• Trajectory Sampling
– The second approach is to sample from the state or state-action space according to some distribution.
– More appealing is to distribute updates according to the on-policy distribution, that is, according to the distribution
observed when following the current policy.
• In an episodic task, one starts in a start state (or according to the starting-state distribution) and simulates until the
terminal state.
• In a continuing task, one starts anywhere and just keeps simulating.
• In either case, sample state transitions and rewards are given by the model, and sample actions are given by the
current policy.
– In other words, the on-policy distribution simulates explicit individual trajectories and performs updates at the state or
state-action pairs encountered along the way.
– We call this way of generating experience and updates trajectory sampling.
46
Planning at Decision Time
Planning at Decision Time (1)
• Planning can be used in at least two ways.
– background planning
– decision-time planning
• background planning
– The one we have considered so far in this chapter, typified by dynamic programming and Dyna, is to use planning to
gradually improve a policy or value function on the basis of simulated experience obtained from a model (either a sample
or a distribution model).
– Well ‘before’ an action is selected for any current state 𝑆𝑡, planning has played a part in improving the table entries, or
the mathematical expression, needed to select the action for many states, including 𝑆𝑡.
– Used this way, planning is not focused on the current state.
• decision-time planning
– The other way to use planning is to begin and complete it ‘after’ encountering each new state 𝑆𝑡, as a computation
whose output is the selection of a single action 𝐴 𝑡.
– More generally, planning used in this way can look much deeper than one-step-ahead and evaluate action choices
leading to many different predicted state and reward trajectories.
– Unlike the first use of planning, here planning focuses on a particular state.
– Heuristic Search, Rollout Algorithms, Monte Carlo Search Tree
48
Planning at Decision Time (2)
• Let us now take a closer look at decision-time planning.
• The values and policy created by the planning process are typically discarded after being used to
select the current action.
• in general, planning process may want to do a mix of both: focus planning on the current state and
store the results of planning because much farther along current state should return to the same
state later.
• Decision-time planning is most useful in applications in which fast responses are not required.
• In chess playing programs, for example, Decision-time planning may be permitted seconds or
minutes of computation for each move, and strong programs may plan dozens of moves ahead
within this time.
49
Heuristic Search
Heuristic Search (1)
• The classical state-space planning methods in artificial intelligence are decision-time
planning methods collectively known as heuristic search. (or Informed Search)
– Heuristic : 체험적인, 스스로 발견하게 하는 (by 네이버 영어사전)
– 너비탐색과 비슷하지만 탐색이 시작 노드에서부터 바깥쪽으로 균일하게 진행하지 않는 것이
다. 대신에, 문제의 특성에 대한 정보인 ‘Heuristic Function’에 따라 목표까지의 가장 좋은 경로상
에 있다고 판단되는 노드를 우선 방문하도록 진행된다. 이러한 탐색 방법을 최상우선(best-first)
또는 휴리스틱(heuristic) 탐색이라고 한다.
– Heuristic Search Algorithms
• Best-first Search
• Greedy best-first search
• Route-finding problem
• A*
• IDA* (Iterative Deepening A*)
• Memory-Bounded Heuristic Search
– Reference Link
• AI Study, Heuristic Search (Link)
• Informed Search(Heuristic Search) 설명과 예시 (Link)
• 인공지능 Informed Search(Heuristic Search) 다양한 탐색기법 (Link)
51Source : Informed (Heuristic) Search
Image source : http://slideplayer.com/slide/7981191/
Heuristic Search (2)
• Backing up in heuristic search
– In heuristic search, for each state encountered, a large tree of possible continuations is considered.
– The approximate value function is applied to the leaf nodes and then backed up ‘toward the current state’ at the root.
– The best of backed-up values of these nodes is chosen as the current action, and then all backed-up values are
‘discarded’.
• In conventional heuristic search no effort is made to save the backed-up values by changing the
approximate value function.
• Our greedy, 𝜀-greedy, and UCB action-selection methods are like heuristic search.
52
Heuristic Search (3)
• The distribution of updates can be altered in similar ways to focus on the current state and its likely
successors.
• As a limiting case we might use exactly the methods of heuristic search to construct a search tree,
and then perform the individual, one-step updates from bottom up, as suggested by Figure.
– Heuristic search can be implemented as a sequence of one-step updates (shown here outlined in blue) backing up values
from the leaf nodes ‘toward the root’. The ordering shown here is for a selective depth-first search.
53Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Heuristic Search (4)
• The performance improvement observed with deeper searches is not due to the use of multi-step
updates as such.
• Instead, it is due to the focus and concentration of updates on states and actions immediately
downstream from the current state.
54
Rollout Algorithms
Rollout Algorithms (1)
• Rollout algorithms are decision-time planning algorithms based on Monte Carlo control applied to
simulated trajectories that all begin at the current environment state.
– 레드카펫을 쭉 밀어서 끝까지 펴는 것과 같다.
– 알파고에 쓰인 Fast Rollout Algorithm
• 선택된 노드에서 바둑이 종료될 때까지 Random한 방법으로 진행한다.
• 속도가 빠르기 때문에 여러 번 수행할 수 있으나 적정성은 떨어진다.
• 마지막 노드 시점부터 게임 종료까지 고속 시뮬레이션(fast rollout)을 수행한다.
• Rollout algorithms estimate action values for a given policy by averaging the returns of many
simulated trajectories.
56
Rollout Algorithms (2)
• Unlike the Monte Carlo control algorithms, the goal of a rollout algorithm is not to estimate a
complete optimal action-value function, 𝑞∗, or a complete action-value function, 𝑞 𝜋, for a given
policy.
• Instead, they produce Monte Carlo estimates of action values only for each current state and for a
given policy usually called the rollout policy.
• As decision-time planning algorithms, rollout algorithms make immediate use of these action-value
estimates, then discard them.
57
Rollout Algorithms (3)
• The aim of a rollout algorithm is to improve upon the rollout policy; not to find an optimal policy.
• The better the rollout policy and the more accurate the value estimates, the better the policy
produced by a rollout algorithm is likely be.
• This involves important tradeoffs because better rollout policies typically mean that more time is
needed to simulate enough trajectories to obtain good value estimates.
– As decision-time planning methods, rollout algorithms usually have to meet strict time constraints.
• The computation time needed by a rollout algorithm depends on
– the number of actions that have to be evaluated for each decision.
– the number of time steps in the simulated trajectories needed to obtain useful sample returns.
– the time it takes the rollout policy to make decisions.
– the number of simulated trajectories needed to obtain good Monte Carlo action-value estimates.
58
Rollout Algorithms (4)
• We do not ordinarily think of rollout algorithms as learning algorithms because they do not maintain
long-term memories of values or policies.
• However, these algorithms take advantage of some of the features of reinforcement learning that
we have emphasized in this book.
• Finally, rollout algorithms take advantage of the policy improvement by acting greedily with respect
to the estimated action values.
59
Monte Carlo Tree Search
Monte Carlo Tree Search (1)
• Monte Carlo Tree Search (MCTS) is a recent and strikingly successful example of decision-time
planning.
• At its base, MCTS is a rollout algorithm as described above, but enhanced by the addition of a means
for accumulating value estimates obtained from the Monte Carlo simulations.
• MCTS is largely responsible for the improvement in computer Go from a weak amateur level in 2005
to a grandmaster level (6 dan or more) in 2015.
• MCTS has proved to be effective in a wide variety of competitive settings, including general game
playing.
61
Monte Carlo Tree Search (2)
• But MCTS is not limited to games; it can be effective for single-agent sequential decision problems
– if there is an environment model simple enough for fast multi-step simulation.
• MCTS is executed after encountering each new state to select the agent's action for that state.
• The core idea of MCTS is to successively focus multiple simulations starting at the current state
– by extending the initial portions of trajectories that have received high evaluations from earlier simulations.
• MCTS does not have to retain approximate value functions or policies from one action selection to
the next, though in many implementations it retains selected action values likely to be useful for its
next execution.
– 그러니까 근사하는 value functions or policies(𝑣 𝜋, 𝑞 𝜋 𝑜𝑟 𝜋)는 계속 유지하는 것이 아니라 선택된 action
values(𝑞1, 𝑞2, …)만 계속 유지한다. 왜냐하면 다음에 실행할 때 유용하기 때문이다.
62
Monte Carlo Tree Search (3)
• Monte Carlo value estimates form a tree rooted at the current state, as illustrated in Figure 8.10.
• For these states we have value estimates for of at least some of the actions, so we can pick among
them using an informed policy, called the tree policy, that balances exploration and exploitation.
• For example, the tree policy could select actions using an 𝜀-greedy or UCB selection rule.
63Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Monte Carlo Tree Search (4)
• In more detail, each iteration of a basic version of MCTS consists of the following
four steps as illustrated in Figure 8.10:
– 1. Selection: Starting at the root node, a tree policy based on the action values attached to the
edges of the tree traverses the tree to select a leaf node.
64Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Monte Carlo Tree Search (5)
• In more detail, each iteration of a basic version of MCTS consists of the following
four steps as illustrated in Figure 8.10:
– 1. Selection: Starting at the root node, a tree policy based on the action values attached to the
edges of the tree traverses the tree to select a leaf node.
– 2. Expansion: On some iterations (depending on details of the application), the tree is expanded
from the selected leaf node by adding one or more child nodes reached from the selected node via
unexplored actions.
65Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Monte Carlo Tree Search (6)
• In more detail, each iteration of a basic version of MCTS consists of the following
four steps as illustrated in Figure 8.10:
– 1. Selection: Starting at the root node, a tree policy based on the action values attached to the
edges of the tree traverses the tree to select a leaf node.
– 2. Expansion: On some iterations (depending on details of the application), the tree is expanded
from the selected leaf node by adding one or more child nodes reached from the selected node via
unexplored actions.
– 3. Simulation: From the selected node, or from one of its newly-added child nodes (if any),
simulation of a complete episode is run with actions selected by the rollout policy. The result is a
Monte Carlo trial with actions selected first by the tree policy and beyond the tree by the rollout
policy.
66Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Monte Carlo Tree Search (6)
• In more detail, each iteration of a basic version of MCTS consists of the following
four steps as illustrated in Figure 8.10:
– 1. Selection: Starting at the root node, a tree policy based on the action values attached to the
edges of the tree traverses the tree to select a leaf node.
– 2. Expansion: On some iterations (depending on details of the application), the tree is expanded
from the selected leaf node by adding one or more child nodes reached from the selected node via
unexplored actions.
– 3. Simulation: From the selected node, or from one of its newly-added child nodes (if any),
simulation of a complete episode is run with actions selected by the rollout policy. The result is a
Monte Carlo trial with actions selected first by the tree policy and beyond the tree by the rollout
policy.
– 4. Backup: The return generated by the simulated episode is backed up to update, or to initialize,
the action values attached to the edges of the tree traversed by the tree policy in this iteration of
MCTS. No values are saved for the states and actions visited by the rollout policy beyond the tree.
67Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Monte Carlo Tree Search (7)
• At its base, MCTS is a decision-time planning algorithm based on Monte Carlo control applied to
simulations that start from the root state
– that is, it is a kind of rollout algorithm as described in the previous section.
• It therefore benefits from online, incremental, sample-based value estimation and policy
improvement.
68
Summary
Summary (1)
• Planning requires a model of the environment.
– distribution model
• consists of the probabilities of next states and rewards for possible actions.
• Dynamic programming requires a distribution model because it uses expected updates.
– sample model
• produces single transitions and rewards generated according to these probabilities.
• what is needed to simulate interacting with the environment during which sample updates.
– Sample models are generally much easier to obtain than distribution models.
• A perspective emphasizing the surprisingly close relationships between planning optimal behavior
and learning optimal behavior.
– Both involve estimating the same value functions, and in both cases it is natural to update the estimates incrementally.
• Planning methods with acting and model-learning.
– Planning, acting, and model-learning interact in a circular fashion. (The diagram on the right)
70Source : Planning and Learning with Tabular Methods – 1 / Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto
Image source : https://wonseokjung.github.io//reinforcementlearning/update/PaL/ / https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
Summary (2)
• A number of dimensions of variation among state-space planning methods.
– One dimension is the variation in the size of updates.
• The smaller the updates, the more incremental the planning methods can be.
– Another important dimension is the distribution of updates, that is, of the focus of search.
• Prioritized sweeping focuses backward on the predecessors of states whose values have recently changed.
• On-policy trajectory sampling focuses on states or state-action pairs that the agent is likely to encounter when
controlling its environment.
• Planning can also focus forward from pertinent states.
– Classical heuristic search as studied in artificial intelligence is an example of this.
– Other examples are rollout algorithms and Monte Carlo Tree Search that benefit from online, incremental, sample-based
value estimation and policy improvement.
71
Reference
• Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto (Link)
• Planning and Learning with Tabular Methods – 1, Wonseok Jung (Link)
• Planning and Learning with Tabular Methods – 2, Wonseok Jung (Link)
• Planning and Learning with Tabular Methods – 3, Wonseok Jung (Link)
• Prioritized Sweeping in a Stochastic Environment (Link)
• Branching factor (Link)
• Informed (Heuristic) Search (Link)
• AI Study, Heuristic Search (Link)
• Informed Search(Heuristic Search) 설명과 예시 (Link)
• 인공지능 Informed Search(Heuristic Search) 다양한 탐색기법 (Link)
72
“Reinforcement Learning is LOVE♥”
Thank you

More Related Content

What's hot

AI - Introduction to Bellman Equations
AI - Introduction to Bellman EquationsAI - Introduction to Bellman Equations
AI - Introduction to Bellman EquationsAndrew Ferlitsch
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Lecture 06 production system
Lecture 06 production systemLecture 06 production system
Lecture 06 production systemHema Kashyap
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIMikko Mäkipää
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Bean Yen
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesSeung Jae Lee
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningSeung Jae Lee
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Simplilearn
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsFrancesco Casalegno
 
Introduction To Autumata Theory
 Introduction To Autumata Theory Introduction To Autumata Theory
Introduction To Autumata TheoryAbdul Rehman
 

What's hot (20)

AI - Introduction to Bellman Equations
AI - Introduction to Bellman EquationsAI - Introduction to Bellman Equations
AI - Introduction to Bellman Equations
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
NLP_KASHK:N-Grams
NLP_KASHK:N-GramsNLP_KASHK:N-Grams
NLP_KASHK:N-Grams
 
Lecture 06 production system
Lecture 06 production systemLecture 06 production system
Lecture 06 production system
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural network
 
Unit8: Uncertainty in AI
Unit8: Uncertainty in AIUnit8: Uncertainty in AI
Unit8: Uncertainty in AI
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision Processes
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
Recurrent Neural Network (RNN) | RNN LSTM Tutorial | Deep Learning Course | S...
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods
 
Introduction To Autumata Theory
 Introduction To Autumata Theory Introduction To Autumata Theory
Introduction To Autumata Theory
 

Similar to Planning and Learning with Tabular Reinforcement Methods

Using Simulation and Optimization to Inform Hiring Decisions_Ginger_Castle
Using Simulation and Optimization to Inform Hiring Decisions_Ginger_CastleUsing Simulation and Optimization to Inform Hiring Decisions_Ginger_Castle
Using Simulation and Optimization to Inform Hiring Decisions_Ginger_CastleGinger Castle
 
ODSC West 2021 – Composition in ML
ODSC West 2021 – Composition in MLODSC West 2021 – Composition in ML
ODSC West 2021 – Composition in MLBryan Bischof
 
Final hrm project 2003
Final hrm project 2003Final hrm project 2003
Final hrm project 2003Adil Shaikh
 
Operations Research - Models
Operations Research - ModelsOperations Research - Models
Operations Research - ModelsSundar B N
 
Application of linear programming technique for staff training of register se...
Application of linear programming technique for staff training of register se...Application of linear programming technique for staff training of register se...
Application of linear programming technique for staff training of register se...Enamul Islam
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxVenkateswaraBabuRavi
 
Machine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxMachine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxiaeronlineexm
 
Tqm unit 3
Tqm unit 3Tqm unit 3
Tqm unit 3jah7777
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Unit 1 - ML - Introduction to Machine Learning.pptx
Unit 1 - ML - Introduction to Machine Learning.pptxUnit 1 - ML - Introduction to Machine Learning.pptx
Unit 1 - ML - Introduction to Machine Learning.pptxjawad184956
 
CSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agentsCSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agentsbutest
 
CSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agentsCSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agentsbutest
 
Week 2 24 - 28 July 2023 (1).pptx
Week 2 24 - 28 July 2023 (1).pptxWeek 2 24 - 28 July 2023 (1).pptx
Week 2 24 - 28 July 2023 (1).pptxXuanQin3
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docxfelicidaddinwoodie
 
[CIKM 2014] Deviation-Based Contextual SLIM Recommenders
[CIKM 2014] Deviation-Based Contextual SLIM Recommenders[CIKM 2014] Deviation-Based Contextual SLIM Recommenders
[CIKM 2014] Deviation-Based Contextual SLIM RecommendersYONG ZHENG
 
Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Xiaohu ZHU
 

Similar to Planning and Learning with Tabular Reinforcement Methods (20)

Using Simulation and Optimization to Inform Hiring Decisions_Ginger_Castle
Using Simulation and Optimization to Inform Hiring Decisions_Ginger_CastleUsing Simulation and Optimization to Inform Hiring Decisions_Ginger_Castle
Using Simulation and Optimization to Inform Hiring Decisions_Ginger_Castle
 
ODSC West 2021 – Composition in ML
ODSC West 2021 – Composition in MLODSC West 2021 – Composition in ML
ODSC West 2021 – Composition in ML
 
Final hrm project 2003
Final hrm project 2003Final hrm project 2003
Final hrm project 2003
 
Operations Research - Models
Operations Research - ModelsOperations Research - Models
Operations Research - Models
 
Application of linear programming technique for staff training of register se...
Application of linear programming technique for staff training of register se...Application of linear programming technique for staff training of register se...
Application of linear programming technique for staff training of register se...
 
Doing Good In Design - Future wheel
Doing Good In Design - Future wheel Doing Good In Design - Future wheel
Doing Good In Design - Future wheel
 
Machine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptxMachine learning ppt unit one syllabuspptx
Machine learning ppt unit one syllabuspptx
 
Machine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptxMachine Learning with Python- Methods for Machine Learning.pptx
Machine Learning with Python- Methods for Machine Learning.pptx
 
Tqm unit 3
Tqm unit 3Tqm unit 3
Tqm unit 3
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Unit 1 - ML - Introduction to Machine Learning.pptx
Unit 1 - ML - Introduction to Machine Learning.pptxUnit 1 - ML - Introduction to Machine Learning.pptx
Unit 1 - ML - Introduction to Machine Learning.pptx
 
CSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agentsCSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agents
 
CSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agentsCSE333 project initial spec: Learning agents
CSE333 project initial spec: Learning agents
 
Week 2 24 - 28 July 2023 (1).pptx
Week 2 24 - 28 July 2023 (1).pptxWeek 2 24 - 28 July 2023 (1).pptx
Week 2 24 - 28 July 2023 (1).pptx
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
1PPA 670 Public Policy AnalysisBasic Policy Terms an.docx
 
new quality tools
 new quality tools new quality tools
new quality tools
 
[CIKM 2014] Deviation-Based Contextual SLIM Recommenders
[CIKM 2014] Deviation-Based Contextual SLIM Recommenders[CIKM 2014] Deviation-Based Contextual SLIM Recommenders
[CIKM 2014] Deviation-Based Contextual SLIM Recommenders
 
Shanghai deep learning meetup 4
Shanghai deep learning meetup 4Shanghai deep learning meetup 4
Shanghai deep learning meetup 4
 
New Quality Tools
New Quality ToolsNew Quality Tools
New Quality Tools
 

More from Dongmin Lee

Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation LearningDongmin Lee
 
Character Controllers using Motion VAEs
Character Controllers using Motion VAEsCharacter Controllers using Motion VAEs
Character Controllers using Motion VAEsDongmin Lee
 
Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation LearningDongmin Lee
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Dongmin Lee
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...Dongmin Lee
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningDongmin Lee
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
Let's do Inverse RL
Let's do Inverse RLLet's do Inverse RL
Let's do Inverse RLDongmin Lee
 
모두를 위한 PG여행 가이드
모두를 위한 PG여행 가이드모두를 위한 PG여행 가이드
모두를 위한 PG여행 가이드Dongmin Lee
 
Safe Reinforcement Learning
Safe Reinforcement LearningSafe Reinforcement Learning
Safe Reinforcement LearningDongmin Lee
 
안.전.제.일. 강화학습!
안.전.제.일. 강화학습!안.전.제.일. 강화학습!
안.전.제.일. 강화학습!Dongmin Lee
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2Dongmin Lee
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1Dongmin Lee
 
강화학습의 개요
강화학습의 개요강화학습의 개요
강화학습의 개요Dongmin Lee
 

More from Dongmin Lee (14)

Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation Learning
 
Character Controllers using Motion VAEs
Character Controllers using Motion VAEsCharacter Controllers using Motion VAEs
Character Controllers using Motion VAEs
 
Causal Confusion in Imitation Learning
Causal Confusion in Imitation LearningCausal Confusion in Imitation Learning
Causal Confusion in Imitation Learning
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
PRM-RL: Long-range Robotics Navigation Tasks by Combining Reinforcement Learn...
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Let's do Inverse RL
Let's do Inverse RLLet's do Inverse RL
Let's do Inverse RL
 
모두를 위한 PG여행 가이드
모두를 위한 PG여행 가이드모두를 위한 PG여행 가이드
모두를 위한 PG여행 가이드
 
Safe Reinforcement Learning
Safe Reinforcement LearningSafe Reinforcement Learning
Safe Reinforcement Learning
 
안.전.제.일. 강화학습!
안.전.제.일. 강화학습!안.전.제.일. 강화학습!
안.전.제.일. 강화학습!
 
강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2강화학습 알고리즘의 흐름도 Part 2
강화학습 알고리즘의 흐름도 Part 2
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1
 
강화학습의 개요
강화학습의 개요강화학습의 개요
강화학습의 개요
 

Recently uploaded

G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptxkhadijarafiq2012
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionPriyansha Singh
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 

Recently uploaded (20)

G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptx
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorption
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 

Planning and Learning with Tabular Reinforcement Methods

  • 1. Planning and Learning with Tabular Methods Reviewed by D. Lee, 2018 Reinforcement Learning: An Introduction Richard S. Sutton and Andrew G. Barto
  • 2. Outline 1. Introduction 2. Models and Planning 3. Dyna: Integrating Planning, Acting, and Learning 4. When the Model Is Wrong 5. Prioritized Sweeping 6. Expected vs. Sample Updates 7. Trajectory Sampling 8. Planning at Decision Time 9. Heuristic Search 10. Rollout Algorithms 11. Monte Carlo Tree Search 12. Summary
  • 4. Introduction • In this chapter we develop a unified view of reinforcement learning methods – that require a model of the environment, such as dynamic programming and heuristic search. – that can be used without a model, such as Monte Carlo and Temporal Difference methods. • These are respectively called model-based and model-free reinforcement learning methods. – Model-based methods rely on planning as their primary component. – While model-free methods primarily rely on learning. • The heart of both kinds of methods is – the computation of value functions. – based on looking ahead to future events, computing a backed-up value. – then using computed value as an update target for an approximate value function. 4
  • 6. Models and Planning (1) • By a model of the environment we mean anything that an agent can use to predict how the environment will respond to its actions. – 다시 말해 simulation을 하기 위해 model을 만든 것. • deterministic model vs. stochastic model – deterministic model : Given a state and an action, a model produces a prediction of the resultant next state and next reward. – stochastic model : There are several possible next states and next rewards, each with some probability of occurring. 6
  • 7. Models and Planning (2) • distribution model vs. sample model – distribution model • produce a description of all possibilities and their probabilities. • The kind of model assumed in dynamic programming - estimates of the MDP's dynamics, 𝑝(𝑠′ , 𝑟|𝑠, 𝑎) - is a distribution. • Stronger than sample models in that they can always be used to produce samples. – sample model • produce just one of the possibilities, sampled according to the probabilities. • produce an individual sum drawn according to this probability distribution. • The kind of model used in the blackjack example (by Monte Carlo Method) in Chapter 5. • Models can be used to mimic or simulate experience. – In either case, distribution model or sample model, the model is used to simulate the environment and produce simulated experience. 7Source : Planning and Learning with Tabular Methods - 1 Image source : https://wonseokjung.github.io//reinforcementlearning/update/PaL/
  • 8. Models and Planning (3) • Planning – Any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment: • Two distinct approaches to planning – state-space planning • A search through the state space for an optimal policy or an optimal path to a goal. • Actions cause transitions from state to state. • Value functions are computed over states. – plan-space planning • A search through the space of plans. • Operators transform one plan into another plan. • Value functions are defined over the space of plans. • Plan-space methods are difficult to apply efficiently to the stochastic sequential decision problems • we do not consider plan-space methods further. 8Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 9. Models and Planning (4) • state-space planning’s two basic ideas: – All state-space planning methods involve computing value functions toward improving the policy. – They compute value functions by updates or backup operations applied to simulated experience. • This structure can be diagrammed as follows: – Dynamic programming methods clearly fit this structure. 9Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 10. Models and Planning (5) • The heart of both learning and planning methods is the estimation of value functions by backing-up update operations. • The difference is that whereas planning uses simulated experience generated by a model, learning methods use real experience generated by the environment. • In many cases a learning algorithm can be substituted for the key update step of a planning method. – Random-sample one-step tabular Q-planning • A simple example of a planning method based on one-step tabular Q-learning and on random samples from a sample model. 10Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 11. Dyna: Integrating Planning, Acting, and Learning
  • 12. Dyna: Integrating Planning, Acting, and Learning (1) • When planning is done online, while interacting with the environment, a number of interesting issues arise. • New information gained from interacting with the environment may change the model. • If decision making and model learning are both computation-intensive processes, then the available computational resources may need to be divided between them. – Dyna-Q • A simple architecture integrating the major functions needed in an online planning agent. 12
  • 13. Dyna: Integrating Planning, Acting, and Learning (2) • Within a planning agent, there are at least two roles for real experience: – It can be used to improve the model. (to make it more accurately match the real environment) • model-learning – It can be used to directly improve the value function and policy. • direct reinforcement learning (direct RL) • Experience can improve value functions and policies either directly or indirectly via the model. – It is the latter, which is sometimes called indirect reinforcement learning, that is involved in planning. 13Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 14. Dyna: Integrating Planning, Acting, and Learning (3) • Dyna-Q includes all of the processes shown in the diagram above - planning, acting, model-learning, and direct RL - all occurring continually. – The planning method is the random-sample one-step tabular Q-planning method. – The direct RL method is one-step tabular Q-learning. – The model-learning method is also table-based and assumes the environment is deterministic. • After each transition 𝑆𝑡, 𝐴 𝑡 → 𝑅𝑡+1, 𝑆𝑡+1, the model records in its table entry for 𝑆𝑡, 𝐴 𝑡 the prediction that 𝑅𝑡+1, 𝑆𝑡+1 will deterministically follow. • Thus, if the model is queried with a state-action pair that has been experienced before, it simply returns the last- observed next state and next reward as its prediction. 14Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 15. Dyna: Integrating Planning, Acting, and Learning (4) • The overall architecture of Dyna agents, of which the Dyna-Q algorithm is one example, is – The central column represents the basic interaction between agent and environment, giving rise to a trajectory of real experience. – The arrow on the left of the figure represents direct reinforcement learning operating on real experience to improve the value function and the policy. 15Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 16. Dyna: Integrating Planning, Acting, and Learning (5) • The overall architecture of Dyna agents, of which the Dyna-Q algorithm is one example, is – On the right are model-based processes. The model is learned from real experience and gives rise to simulated experience. • We use the term search control to refer to the process that selects the starting states and actions for the simulated experiences generated by the model. • planning update is achieved by applying reinforcement learning methods to the simulated experiences just as if they had really happened. 16Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 17. Dyna: Integrating Planning, Acting, and Learning (6) • Typically, as in Dyna-Q, the same reinforcement learning method is – used both for learning from real experience and for planning from simulated experience. – thus the “final common path” for both learning and planning. • Conceptually, planning, acting, model-learning, and direct RL occur simultaneously and in parallel in Dyna agents. however, we fully specify the order in which they occur within a time step. • The pseudocode algorithm for Dyna-Q is 17Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 18. Dyna: Integrating Planning, Acting, and Learning (7) • The pseudocode algorithm for Dyna-Q is – Let us assume that there is time in each step, after acting, model-learning, and direct RL, to complete n iterations (Steps 1-3) of the Q-planning algorithm. – Model(𝑠, 𝑎) denotes the contents of the predicted next state and reward for state-action pair (𝑠, 𝑎). – Direct reinforcement learning, model-learning, and planning are implemented by steps (d), (e), and (f), respectively. • If (e) and (f) were omitted, the remaining algorithm would be one-step tabular Q-learning. 18Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 19. Dyna: Integrating Planning, Acting, and Learning (8) • Example 8.1: Dyna Maze – Figure 8.2: • This is a discounted, episodic task with 𝛾 = 0.95. • The initial action values were zero, the step-size parameter was 𝛼 = 0.1, and the exploration parameter was 𝜀 = 0.1. • The agents varied in the number of planning steps, n, they performed per real step. 19Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 20. Dyna: Integrating Planning, Acting, and Learning (9) • Example 8.1: Dyna Maze – Figure 8.3: • Figure 8.3 shows why the planning agents found the solution so much faster than the non-planning agent. • Shown are the policies found by the n = 0 and n = 50 agents halfway through the second episode. – Without planning (n = 0), each episode adds only one additional step to the policy, and so only one step (the last) has been learned so far. – With planning, again only one step is learned during the first episode, but here during the second episode an extensive policy has been developed that by the end of the episode will reach almost back to the start state. 20Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 21. Dyna: Integrating Planning, Acting, and Learning (10) • In Dyna-Q, learning and planning are accomplished by exactly the same algorithm, operating on real experience for learning and on simulated experience for planning. • ongoing in the background is the model-learning process. – As new information is gained, the model is updated to better match reality. – As the model changes, the ongoing planning process will gradually compute a different way of behaving to match the new model. 21
  • 22. When the Model Is Wrong
  • 23. When the Model Is Wrong (1) • Models may be incorrect. – because the environment is stochastic and only a limited number of samples have been observed. – because the model was learned using function approximation that has generalized imperfectly. – simply because the environment has changed and its new behavior has not yet been observed. • When the model is incorrect, the planning process is likely to compute a suboptimal policy. • In some cases, the suboptimal policy computed by planning quickly leads to the discovery and correction of the modeling error. 23
  • 24. When the Model Is Wrong (2) • Example 8.2: Blocking Maze (Positive example) – Figure 8.4: • Average performance of Dyna agents on a blocking task. The left environment was used for the first 1000 steps, the right environment for the rest. Dyna-Q+ is Dyna-Q with an exploration bonus that encourages exploration. • Initially, there is a short path from start to goal, to the right of the barrier, as shown in the upper left of the figure. • After 1000 time steps, the short path is “blocked,” and a longer path is opened up along the left-hand side of the barrier, as shown in upper right of the figure. 24Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 25. When the Model Is Wrong (3) • Example 8.2: Blocking Maze (Positive example) – Figure 8.4: • When the environment changed, the graphs become flat, indicating a period during which the agents obtained no reward because they were wandering around behind the barrier. • After a while, however, they were able to find the new opening and the new optimal behavior. 25Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 26. When the Model Is Wrong (4) • Example 8.3: Shortcut Maze (Negative example) – Figure 8.5: • Average performance of Dyna agents on a shortcut task. The left environment was used for the first 3000 steps, the right environment for the rest. • Initially, the optimal path is to go around the left side of the barrier (upper left). • After 3000 steps, however, a shorter path is opened up along the right side, without disturbing the longer path (upper right). 26Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 27. When the Model Is Wrong (5) • Example 8.3: Shortcut Maze (Negative example) – Figure 8.5: • Greater difficulties arise when the environment changes to become better than it was before, and yet the formerly correct policy does not reveal the improvement. • The graph shows that the regular Dyna-Q agent never switched to the shortcut. In fact, Dyna-Q agent never realized that it existed. • Dyna-Q model said that there was no shortcut, so the more it planned, the less likely it was to step to the right and discover it. 27Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 28. When the Model Is Wrong (6) • The general problem here is another version of the conflict between exploration and exploitation. – exploration means trying actions that improve the model. – whereas exploitation means behaving in the optimal way given the current model. • We want the agent to explore to find changes in the environment, but not so much that performance is greatly degraded. • The Dyna-Q+ agent that did solve the shortcut maze uses one such heuristic. – To encourage behavior that tests long-untried actions, a special “bonus reward” is given on simulated experiences involving these actions. – In particular, if the modeled reward for a transition is 𝑟, and the transition has not been tried in 𝜏 time steps, then planning updates are done as if that transition produced a reward of r + k 𝜏, for some small k. – This encourages the agent to keep testing all accessible state transitions and even to find long sequences of actions in order to carry out such tests. 28
  • 30. Prioritized Sweeping (1) • In the Dyna agents presented in the preceding sections, simulated transitions are started in state- action pairs selected uniformly at random from all previously experienced pairs. • But a uniform selection is usually not the best. – planning can be much more efficient if simulated transitions and updates are focused on particular state-action pairs. • If simulated transitions are generated uniformly, then many wasteful updates will be made before stumbling onto one of these useful updates. – 굳이 planning을 할 때 uniform selection을 할 필요가 있을까? • In the much larger problems that are our real objective, the number of states is so large that an unfocused search would be extremely inefficient. – Focused search에 대해서 알아보자. 30
  • 31. Prioritized Sweeping (2) • This example suggests that search might be usefully focused by working backward from goal states. • In general, we want to work back not just from goal states but from any state whose value has changed. • Suppose – that the values are initially correct given the model, as they were in the maze example prior to discovering the goal. – that the agent discovers a change in the environment and changes its estimated value of one state, either up or down. • Typically, this will imply that the values of many other states should also be changed, but the only useful one-step updates are those of actions that lead directly into the one state whose value has been changed. • This general idea might be termed backward focusing of planning computations. 31
  • 32. Prioritized Sweeping (3) • But not all of useful updates will be equally useful. – The values of some states may have changed a lot, whereas others may have changed little. • In a stochastic environment, variations in estimated transition probabilities also contribute to variations in the sizes of changes and in the urgency with which pairs need to be updated. • It is natural to prioritize the updates according to a measure of their urgency, and perform them in order of priority. • This is the idea behind prioritized sweeping. 32
  • 33. Prioritized Sweeping (4) • Prioritized Sweeping for a deterministic environment – A queue is maintained of every state-action pair whose estimated value would change nontrivially if updated, prioritized by the size of the change. – If the effect is greater than some small threshold 𝜃, then the pair is inserted in the queue with the new priority. – In this way the effects of changes are efficiently propagated backward until quiescence. – The full algorithm for the case of deterministic environments is given in the box. backward 33Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 34. Prioritized Sweeping (5) • Prioritized Sweeping for a stochastic environment 34Source : Prioritized Sweeping in a Stochastic Environment Image source : https://lenabayeva.files.wordpress.com/2015/03/aamasreport2011.pdf
  • 35. Prioritized Sweeping (6) • Prioritized sweeping is just one way of distributing computations to improve planning efficiency, and probably not the best way. • One of prioritized sweeping's limitations is that it uses expected updates, which in stochastic environments may waste lots of computation on low-probability transitions. • As we show in the following section, sample updates can in many cases get closer to the true value function with less computation despite the variance introduced by sampling. 35
  • 37. Expected vs. Sample Updates (1) • In the rest of this chapter, we analyze some of the component ideas involved, starting with the relative advantages of expected and sample updates. • Three binary dimensions – Frist : whether value-function updates update state values or action values. – Second : whether value-function updates estimate the value for the optimal policy or for an arbitrary given policy. – Third : whether the updates are expected updates, considering all possible events that might happen, or sample updates, considering a single sample of what might happen. • These Frist, Second dimensions give rise to four classes of updates for approximating the four value functions, 𝑞∗, 𝑣∗, 𝑞 𝜋, and 𝑣 𝜋. • These three binary dimensions give rise to eight cases, seven of which correspond to specific algorithms, as shown in the figure. 37Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 38. Expected vs. Sample Updates (2) • sample updates can be done using sample transitions from the environment or a sample model. • Implicit in that point of view is that expected updates are preferable to sample updates. • Expected updates certainly yield a better estimate because they are uncorrupted by sampling error, but they also require more computation, and computation is often the limiting resource in planning. • To properly assess the relative merits of expected and sample updates for planning we must control for their different computational requirements. 38
  • 39. Expected vs. Sample Updates (3) • consider – the expected and sample updates for approximating 𝑞∗ – the special case of discrete states and actions – a table-lookup representation of the approximate value function, Q – a model in the form of estimated dynamics, Ƹ𝑝(𝑠′ , 𝑟|𝑠, 𝑎). • The expected update for a state-action pair, s, a, is: • The corresponding sample update for s, a, given a sample next state and reward, S’ and R (from the model), is the Q-learning-like update: – 𝛼 is the usual positive step-size parameter. 39
  • 40. Expected vs. Sample Updates (4) • The difference between these expected and sample updates is significant to the extent that the environment is stochastic. – Specifically, to the extent that, given a state and action, many possible next states may occur with various probabilities. • For a particular starting pair, s, a, let b be the branching factor (i.e., the number of possible next states, s’, for which Ƹ𝑝 𝑠′, 𝑟 𝑠, 𝑎 > 0). – A red-black tree with branching factor example: 40Source : Branching factor Image source : https://en.wikipedia.org/wiki/Branching_factor
  • 41. Expected vs. Sample Updates (5) • Expected update – The expected update is an exact computation, resulting in a new Q(s, a) whose correctness is limited only by the correctness of the Q(s’, a’) at successor states. – If there is enough time to complete an expected update, then the resulting estimate is generally better than that of b sample updates because of the absence of sampling error. – The expected update reduces the error to zero upon its completion. 41Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 42. Expected vs. Sample Updates (6) • Sample update – The sample update is in addition affected by sampling error. – The sample update is cheaper computationally because it considers only one next state, not all possible next states. – If there is insufficient time to complete an expected update, then sample updates are always preferable because they at least make some improvement in the value estimate with fewer than b updates. – sample updates reduce the error according to 𝑏−1 𝑏𝑡 where t is the number of sample updates that have been performed (assuming sample averages, i.e., 𝛼 = 1/𝑡). 42Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 43. Expected vs. Sample Updates (7) • These results suggest that sample updates are likely to be superior to expected updates on problems with large stochastic branching factors and too many states to be solved exactly. 43Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 45. Trajectory Sampling (1) • In this section we compare two ways of distributing updates. – Exhaustive Sweep – Trajectory Sampling • Exhaustive Sweep – The classical approach, from dynamic programming, is to perform sweeps through the entire state (or state-action) space, updating each state (or state-action pair) once per sweep. – This is problematic on large tasks because there may not be time to complete even one sweep. – In principle, updates can be distributed any way one likes (to assure convergence, all states or state-action pairs must be visited in the limit an infinite number of times.) 45Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 46. Trajectory Sampling (2) • Trajectory Sampling – The second approach is to sample from the state or state-action space according to some distribution. – More appealing is to distribute updates according to the on-policy distribution, that is, according to the distribution observed when following the current policy. • In an episodic task, one starts in a start state (or according to the starting-state distribution) and simulates until the terminal state. • In a continuing task, one starts anywhere and just keeps simulating. • In either case, sample state transitions and rewards are given by the model, and sample actions are given by the current policy. – In other words, the on-policy distribution simulates explicit individual trajectories and performs updates at the state or state-action pairs encountered along the way. – We call this way of generating experience and updates trajectory sampling. 46
  • 48. Planning at Decision Time (1) • Planning can be used in at least two ways. – background planning – decision-time planning • background planning – The one we have considered so far in this chapter, typified by dynamic programming and Dyna, is to use planning to gradually improve a policy or value function on the basis of simulated experience obtained from a model (either a sample or a distribution model). – Well ‘before’ an action is selected for any current state 𝑆𝑡, planning has played a part in improving the table entries, or the mathematical expression, needed to select the action for many states, including 𝑆𝑡. – Used this way, planning is not focused on the current state. • decision-time planning – The other way to use planning is to begin and complete it ‘after’ encountering each new state 𝑆𝑡, as a computation whose output is the selection of a single action 𝐴 𝑡. – More generally, planning used in this way can look much deeper than one-step-ahead and evaluate action choices leading to many different predicted state and reward trajectories. – Unlike the first use of planning, here planning focuses on a particular state. – Heuristic Search, Rollout Algorithms, Monte Carlo Search Tree 48
  • 49. Planning at Decision Time (2) • Let us now take a closer look at decision-time planning. • The values and policy created by the planning process are typically discarded after being used to select the current action. • in general, planning process may want to do a mix of both: focus planning on the current state and store the results of planning because much farther along current state should return to the same state later. • Decision-time planning is most useful in applications in which fast responses are not required. • In chess playing programs, for example, Decision-time planning may be permitted seconds or minutes of computation for each move, and strong programs may plan dozens of moves ahead within this time. 49
  • 51. Heuristic Search (1) • The classical state-space planning methods in artificial intelligence are decision-time planning methods collectively known as heuristic search. (or Informed Search) – Heuristic : 체험적인, 스스로 발견하게 하는 (by 네이버 영어사전) – 너비탐색과 비슷하지만 탐색이 시작 노드에서부터 바깥쪽으로 균일하게 진행하지 않는 것이 다. 대신에, 문제의 특성에 대한 정보인 ‘Heuristic Function’에 따라 목표까지의 가장 좋은 경로상 에 있다고 판단되는 노드를 우선 방문하도록 진행된다. 이러한 탐색 방법을 최상우선(best-first) 또는 휴리스틱(heuristic) 탐색이라고 한다. – Heuristic Search Algorithms • Best-first Search • Greedy best-first search • Route-finding problem • A* • IDA* (Iterative Deepening A*) • Memory-Bounded Heuristic Search – Reference Link • AI Study, Heuristic Search (Link) • Informed Search(Heuristic Search) 설명과 예시 (Link) • 인공지능 Informed Search(Heuristic Search) 다양한 탐색기법 (Link) 51Source : Informed (Heuristic) Search Image source : http://slideplayer.com/slide/7981191/
  • 52. Heuristic Search (2) • Backing up in heuristic search – In heuristic search, for each state encountered, a large tree of possible continuations is considered. – The approximate value function is applied to the leaf nodes and then backed up ‘toward the current state’ at the root. – The best of backed-up values of these nodes is chosen as the current action, and then all backed-up values are ‘discarded’. • In conventional heuristic search no effort is made to save the backed-up values by changing the approximate value function. • Our greedy, 𝜀-greedy, and UCB action-selection methods are like heuristic search. 52
  • 53. Heuristic Search (3) • The distribution of updates can be altered in similar ways to focus on the current state and its likely successors. • As a limiting case we might use exactly the methods of heuristic search to construct a search tree, and then perform the individual, one-step updates from bottom up, as suggested by Figure. – Heuristic search can be implemented as a sequence of one-step updates (shown here outlined in blue) backing up values from the leaf nodes ‘toward the root’. The ordering shown here is for a selective depth-first search. 53Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 54. Heuristic Search (4) • The performance improvement observed with deeper searches is not due to the use of multi-step updates as such. • Instead, it is due to the focus and concentration of updates on states and actions immediately downstream from the current state. 54
  • 56. Rollout Algorithms (1) • Rollout algorithms are decision-time planning algorithms based on Monte Carlo control applied to simulated trajectories that all begin at the current environment state. – 레드카펫을 쭉 밀어서 끝까지 펴는 것과 같다. – 알파고에 쓰인 Fast Rollout Algorithm • 선택된 노드에서 바둑이 종료될 때까지 Random한 방법으로 진행한다. • 속도가 빠르기 때문에 여러 번 수행할 수 있으나 적정성은 떨어진다. • 마지막 노드 시점부터 게임 종료까지 고속 시뮬레이션(fast rollout)을 수행한다. • Rollout algorithms estimate action values for a given policy by averaging the returns of many simulated trajectories. 56
  • 57. Rollout Algorithms (2) • Unlike the Monte Carlo control algorithms, the goal of a rollout algorithm is not to estimate a complete optimal action-value function, 𝑞∗, or a complete action-value function, 𝑞 𝜋, for a given policy. • Instead, they produce Monte Carlo estimates of action values only for each current state and for a given policy usually called the rollout policy. • As decision-time planning algorithms, rollout algorithms make immediate use of these action-value estimates, then discard them. 57
  • 58. Rollout Algorithms (3) • The aim of a rollout algorithm is to improve upon the rollout policy; not to find an optimal policy. • The better the rollout policy and the more accurate the value estimates, the better the policy produced by a rollout algorithm is likely be. • This involves important tradeoffs because better rollout policies typically mean that more time is needed to simulate enough trajectories to obtain good value estimates. – As decision-time planning methods, rollout algorithms usually have to meet strict time constraints. • The computation time needed by a rollout algorithm depends on – the number of actions that have to be evaluated for each decision. – the number of time steps in the simulated trajectories needed to obtain useful sample returns. – the time it takes the rollout policy to make decisions. – the number of simulated trajectories needed to obtain good Monte Carlo action-value estimates. 58
  • 59. Rollout Algorithms (4) • We do not ordinarily think of rollout algorithms as learning algorithms because they do not maintain long-term memories of values or policies. • However, these algorithms take advantage of some of the features of reinforcement learning that we have emphasized in this book. • Finally, rollout algorithms take advantage of the policy improvement by acting greedily with respect to the estimated action values. 59
  • 61. Monte Carlo Tree Search (1) • Monte Carlo Tree Search (MCTS) is a recent and strikingly successful example of decision-time planning. • At its base, MCTS is a rollout algorithm as described above, but enhanced by the addition of a means for accumulating value estimates obtained from the Monte Carlo simulations. • MCTS is largely responsible for the improvement in computer Go from a weak amateur level in 2005 to a grandmaster level (6 dan or more) in 2015. • MCTS has proved to be effective in a wide variety of competitive settings, including general game playing. 61
  • 62. Monte Carlo Tree Search (2) • But MCTS is not limited to games; it can be effective for single-agent sequential decision problems – if there is an environment model simple enough for fast multi-step simulation. • MCTS is executed after encountering each new state to select the agent's action for that state. • The core idea of MCTS is to successively focus multiple simulations starting at the current state – by extending the initial portions of trajectories that have received high evaluations from earlier simulations. • MCTS does not have to retain approximate value functions or policies from one action selection to the next, though in many implementations it retains selected action values likely to be useful for its next execution. – 그러니까 근사하는 value functions or policies(𝑣 𝜋, 𝑞 𝜋 𝑜𝑟 𝜋)는 계속 유지하는 것이 아니라 선택된 action values(𝑞1, 𝑞2, …)만 계속 유지한다. 왜냐하면 다음에 실행할 때 유용하기 때문이다. 62
  • 63. Monte Carlo Tree Search (3) • Monte Carlo value estimates form a tree rooted at the current state, as illustrated in Figure 8.10. • For these states we have value estimates for of at least some of the actions, so we can pick among them using an informed policy, called the tree policy, that balances exploration and exploitation. • For example, the tree policy could select actions using an 𝜀-greedy or UCB selection rule. 63Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 64. Monte Carlo Tree Search (4) • In more detail, each iteration of a basic version of MCTS consists of the following four steps as illustrated in Figure 8.10: – 1. Selection: Starting at the root node, a tree policy based on the action values attached to the edges of the tree traverses the tree to select a leaf node. 64Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 65. Monte Carlo Tree Search (5) • In more detail, each iteration of a basic version of MCTS consists of the following four steps as illustrated in Figure 8.10: – 1. Selection: Starting at the root node, a tree policy based on the action values attached to the edges of the tree traverses the tree to select a leaf node. – 2. Expansion: On some iterations (depending on details of the application), the tree is expanded from the selected leaf node by adding one or more child nodes reached from the selected node via unexplored actions. 65Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 66. Monte Carlo Tree Search (6) • In more detail, each iteration of a basic version of MCTS consists of the following four steps as illustrated in Figure 8.10: – 1. Selection: Starting at the root node, a tree policy based on the action values attached to the edges of the tree traverses the tree to select a leaf node. – 2. Expansion: On some iterations (depending on details of the application), the tree is expanded from the selected leaf node by adding one or more child nodes reached from the selected node via unexplored actions. – 3. Simulation: From the selected node, or from one of its newly-added child nodes (if any), simulation of a complete episode is run with actions selected by the rollout policy. The result is a Monte Carlo trial with actions selected first by the tree policy and beyond the tree by the rollout policy. 66Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 67. Monte Carlo Tree Search (6) • In more detail, each iteration of a basic version of MCTS consists of the following four steps as illustrated in Figure 8.10: – 1. Selection: Starting at the root node, a tree policy based on the action values attached to the edges of the tree traverses the tree to select a leaf node. – 2. Expansion: On some iterations (depending on details of the application), the tree is expanded from the selected leaf node by adding one or more child nodes reached from the selected node via unexplored actions. – 3. Simulation: From the selected node, or from one of its newly-added child nodes (if any), simulation of a complete episode is run with actions selected by the rollout policy. The result is a Monte Carlo trial with actions selected first by the tree policy and beyond the tree by the rollout policy. – 4. Backup: The return generated by the simulated episode is backed up to update, or to initialize, the action values attached to the edges of the tree traversed by the tree policy in this iteration of MCTS. No values are saved for the states and actions visited by the rollout policy beyond the tree. 67Source : Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 68. Monte Carlo Tree Search (7) • At its base, MCTS is a decision-time planning algorithm based on Monte Carlo control applied to simulations that start from the root state – that is, it is a kind of rollout algorithm as described in the previous section. • It therefore benefits from online, incremental, sample-based value estimation and policy improvement. 68
  • 70. Summary (1) • Planning requires a model of the environment. – distribution model • consists of the probabilities of next states and rewards for possible actions. • Dynamic programming requires a distribution model because it uses expected updates. – sample model • produces single transitions and rewards generated according to these probabilities. • what is needed to simulate interacting with the environment during which sample updates. – Sample models are generally much easier to obtain than distribution models. • A perspective emphasizing the surprisingly close relationships between planning optimal behavior and learning optimal behavior. – Both involve estimating the same value functions, and in both cases it is natural to update the estimates incrementally. • Planning methods with acting and model-learning. – Planning, acting, and model-learning interact in a circular fashion. (The diagram on the right) 70Source : Planning and Learning with Tabular Methods – 1 / Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto Image source : https://wonseokjung.github.io//reinforcementlearning/update/PaL/ / https://drive.google.com/file/d/1xeUDVGWGUUv1-ccUMAZHJLej2C7aAFWY/view
  • 71. Summary (2) • A number of dimensions of variation among state-space planning methods. – One dimension is the variation in the size of updates. • The smaller the updates, the more incremental the planning methods can be. – Another important dimension is the distribution of updates, that is, of the focus of search. • Prioritized sweeping focuses backward on the predecessors of states whose values have recently changed. • On-policy trajectory sampling focuses on states or state-action pairs that the agent is likely to encounter when controlling its environment. • Planning can also focus forward from pertinent states. – Classical heuristic search as studied in artificial intelligence is an example of this. – Other examples are rollout algorithms and Monte Carlo Tree Search that benefit from online, incremental, sample-based value estimation and policy improvement. 71
  • 72. Reference • Reinforcement Learning : An Introduction, Richard S. Sutton and Andrew G. Barto (Link) • Planning and Learning with Tabular Methods – 1, Wonseok Jung (Link) • Planning and Learning with Tabular Methods – 2, Wonseok Jung (Link) • Planning and Learning with Tabular Methods – 3, Wonseok Jung (Link) • Prioritized Sweeping in a Stochastic Environment (Link) • Branching factor (Link) • Informed (Heuristic) Search (Link) • AI Study, Heuristic Search (Link) • Informed Search(Heuristic Search) 설명과 예시 (Link) • 인공지능 Informed Search(Heuristic Search) 다양한 탐색기법 (Link) 72