25 introduction reinforcement_learning

1. Introduction to Artiﬁcial Intelligence Reinforcement Learning Andres Mendez-Vazquez April 26, 2019 1 / 111

2. Images/cinvestav Outline 1 Reinforcement Learning Introduction Example A K-Armed Bandit Problem Exploration vs Exploitation The Elements of Reinforcement Learning Limitations 2 Formalizing Reinforcement Learning Introduction Markov Process and Markov Decision Process Return, Policy and Value function 3 Solution Methods Multi-armed Bandits Problems of Implementations General Formula in Reinforcement Learning A Simple Bandit Algorithm Dynamic Programming Policy Evaluation Policy Iteration Policy and Value Improvement Example Temporal-Difference Learning A Small Review of Monte Carlo Methods The Main Idea The Monte Carlo Principle Going Back to Temporal Difference Q-learning: Off-Policy TD Control 4 The Neural Network Approach Introduction, TD-Gammon TD(λ) by Back-Propagation Conclusions 2 / 111

4. Images/cinvestav Reinforcement Learning is learning what to do What does Reinforcement Learning want? Maximize a numerical reward signal The learner is not told which actions to take A discovery must be performed to obtain the actions that yield the most rewards 4 / 111

5. Images/cinvestav Reinforcement Learning is learning what to do What does Reinforcement Learning want? Maximize a numerical reward signal The learner is not told which actions to take A discovery must be performed to obtain the actions that yield the most rewards 4 / 111

6. Images/cinvestav These ideas come from Dynamical Systems Theory Speciﬁcally As the optimal control of incompletely-known Markov Decision Processes (MDP). Thus, the device must do the following Interact with the environment to learn by using punishments and rewards. 5 / 111

7. Images/cinvestav These ideas come from Dynamical Systems Theory Speciﬁcally As the optimal control of incompletely-known Markov Decision Processes (MDP). Thus, the device must do the following Interact with the environment to learn by using punishments and rewards. 5 / 111

9. Images/cinvestav A mobile robot decides whether it should enter a new room in search of more trash to collect or going back to charge its battery It needs to take a decision based on the current charge level of its battery how quickly it can arrive to its base 7 / 111

10. Images/cinvestav A mobile robot decides whether it should enter a new room in search of more trash to collect or going back to charge its battery It needs to take a decision based on the current charge level of its battery how quickly it can arrive to its base 7 / 111

12. Images/cinvestav A K-Armed Bandit Problem Definition You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen From an stationary probability distribution depending on the actions. Definition Stationary Probability A stationary distribution of a Markov chain is a probability distribution that remains unchanged in the Markov chain as time progresses. 9 / 111

15. Images/cinvestav Example A Two State System 10 / 111

16. Images/cinvestav Therefore Each of the k actions has an expected value q∗ (a) = E [Rt|At = a] Something Notable If you knew the value of each action, It would be trivial to solve the k-armed bandit problem. 11 / 111

17. Images/cinvestav Therefore Each of the k actions has an expected value q∗ (a) = E [Rt|At = a] Something Notable If you knew the value of each action, It would be trivial to solve the k-armed bandit problem. 11 / 111

18. Images/cinvestav What do we want? To ﬁnd the estimated value of action a at time t, Qt (a) |Qt (a) − q∗ (a)| < with as small as possible 12 / 111

20. Images/cinvestav Greedy Actions Intuitions To choose the greatest E [Rt|At = a] This is know as Exploitation You are exploiting your current knowledge of the values of the actions. Contrasting to Exploration When you select the non-gready actions to obtain a better knowledge of the non-gready actions. 14 / 111

23. Images/cinvestav An Interesting Conundrum It has been observed that Exploitation maximizes the expected reward on the one step, Exploration may produce the greater total reward in the long run. Thus, the idea The value of a state s is the total amount of reward an agent can expect to accumulate over the future, starting from s. Thus, Values They indicate the long-term proﬁt of states after taking into account The states that follow and their rewards. 15 / 111

27. Images/cinvestav Therefore, the parts of a Reinforcement Learning A Policy It defines the learning agent’s way of behaving at a given time. The policy is the core of a reinforcement learning system In general, policies may be stochastic A Reward Signal It defines the goal of a reinforcement learning problem. The system’s sole objective is to maximize the total reward over the long run This encourage Exploitation It is immediate... you might want something with more long term. A Value Function It specifies what is good in the long run. This encourage Exploration 17 / 111

36. Images/cinvestav Finally A Model of the Environment This simulates the behavior of the environment Allowing to make inferences on how the environment can behave Something Notable Models are used for planning Meaning we made decisions before they are executed Thus, Reinforcement Learning Methods using models are called model-based 18 / 111

37. Images/cinvestav Finally A Model of the Environment This simulates the behavior of the environment Allowing to make inferences on how the environment can behave Something Notable Models are used for planning Meaning we made decisions before they are executed Thus, Reinforcement Learning Methods using models are called model-based 18 / 111

39. Images/cinvestav Limitations Given that Reinforcement learning relies heavily on the concept of state Therefore, it has the same problem as in Markov Decision Processes Deﬁning the state to obtain a better representation of the search space. This is where the creative part of the problems come to be As we have seen, solutions in AI really a lot in the creativity of the solution... 20 / 111

43. Images/cinvestav The System–Environment Interaction In a Markov Decision Process 22 / 111

45. Images/cinvestav Markov Process and Markov Decision Process Deﬁnition of Markov Process A sequence of states is Markov if and only if the probability of moving to the next state st+1 depends only on the present state st P [st+1|st] = P [st+1|s1, ..., st] Something Notable We always talk about time-homogeneous Markov chain in Reinforcement Learning: the probability of the transition is independent of t P [st+1 = s |st = s] = P [st = s |st−1 = s] 24 / 111

46. Images/cinvestav Markov Process and Markov Decision Process Deﬁnition of Markov Process A sequence of states is Markov if and only if the probability of moving to the next state st+1 depends only on the present state st P [st+1|st] = P [st+1|s1, ..., st] Something Notable We always talk about time-homogeneous Markov chain in Reinforcement Learning: the probability of the transition is independent of t P [st+1 = s |st = s] = P [st = s |st−1 = s] 24 / 111

47. Images/cinvestav Formally Deﬁnition (Markov Process) a Markov Process (or Markov Chain) is a tuple (S, P), where 1 S is a ﬁnite set of states 2 P is a state transition probability matrix. Pss = P [st+1 = s |st = s] Thus, the dynamic of the Markov process is as follow s0 Ps0s1 −→ s1 Ps1s2 −→ s2 Ps2s3 −→ s3 −→ · · · 25 / 111

48. Images/cinvestav Formally Deﬁnition (Markov Process) a Markov Process (or Markov Chain) is a tuple (S, P), where 1 S is a ﬁnite set of states 2 P is a state transition probability matrix. Pss = P [st+1 = s |st = s] Thus, the dynamic of the Markov process is as follow s0 Ps0s1 −→ s1 Ps1s2 −→ s2 Ps2s3 −→ s3 −→ · · · 25 / 111

49. Images/cinvestav Then, we get the following familiar definition Definition MDP is a tuple S, A, T , R, α 1 S is a finite set of states 2 A is a finite set of actions 3 P is a state transition probability matrix. Pss = P st+1 = s |st = s, At = a 4 α ∈ [0, 1] is called the discount factor 5 R : S × A → R is a reward function Thus, the dynamic of the Markov Decision Process is as follow using a probabilty to pick an action s0 a0 −→ s1 a1 −→ s2 a2 −→ s3 −→ · · · 26 / 111

57. Images/cinvestav The Goal of Reinforcement Learning We want to maximize the expected value of the return i.e. To obtain the optimal policy!!! Thus, as in our previous study of MDP’s Gt = Rt+1 + αRt+2 + · · · = ∞ k=0 αk Rt+k+1 28 / 111

58. Images/cinvestav The Goal of Reinforcement Learning We want to maximize the expected value of the return i.e. To obtain the optimal policy!!! Thus, as in our previous study of MDP’s Gt = Rt+1 + αRt+2 + · · · = ∞ k=0 αk Rt+k+1 28 / 111

59. Images/cinvestav Remarks Interpretation The discounted future rewards can be interpreted as the current value of future rewards. The reward α α close to 0 leads to “short-sighted” evaluation... you are only looking for immediate rewards!!! α close to 1 leads to “long-sighted” evaluation... you are willing to wait for a better reward!!! 29 / 111

60. Images/cinvestav Remarks Interpretation The discounted future rewards can be interpreted as the current value of future rewards. The reward α α close to 0 leads to “short-sighted” evaluation... you are only looking for immediate rewards!!! α close to 1 leads to “long-sighted” evaluation... you are willing to wait for a better reward!!! 29 / 111

61. Images/cinvestav Why to use this discount factor? Basically 1 Discount factor avoids inﬁnite returns in cyclic Markov process given the property of the geometry series. 2 There is certain uncertainty about the future reward. 3 If the reward is ﬁnancial, immediate rewards may earn more interest than delayed rewards. 4 Animal/human behavior shows preference for immediate reward. 30 / 111

65. Images/cinvestav A Policy Using our MDP ideas... A policy π is a distribution over actions given states π (a|s) = P [At = a|St = s] A policy guides the choice of action at a given state. And it is time independent Then, we introduce two value functions State-value function. Action-value function. 31 / 111

66. Images/cinvestav A Policy Using our MDP ideas... A policy π is a distribution over actions given states π (a|s) = P [At = a|St = s] A policy guides the choice of action at a given state. And it is time independent Then, we introduce two value functions State-value function. Action-value function. 31 / 111

67. Images/cinvestav The state-value function vπ of an MDP It is deﬁned as vπ (s) = Eπ [Gt|St = s] This gives the long-term value of the state s vπ (s) = Eπ [Gt|St = s] = Eπ ∞ k=0 αk Rt+k+1|St = s = Eπ [Rt+1 + αGt+1|St = s] = Eπ [Rt+1|St = s] Immediate Reward + Eπ [αvπ (St+1) |St = s] Discount value of successor state 32 / 111

71. Images/cinvestav The Action-Value Function qπ (s, a) It is the expected return starting from state s Starting from state s, taking action a, and then following policy π qπ (s, a) = Eπ [Gt|St = a, At = a] We can also obtain the following decomposition qπ (s, a) = Eπ [Rt+1 + αqπ (St+1, At+1) |St = s, At = a] 33 / 111

72. Images/cinvestav The Action-Value Function qπ (s, a) It is the expected return starting from state s Starting from state s, taking action a, and then following policy π qπ (s, a) = Eπ [Gt|St = a, At = a] We can also obtain the following decomposition qπ (s, a) = Eπ [Rt+1 + αqπ (St+1, At+1) |St = s, At = a] 33 / 111

73. Images/cinvestav Now... To simplify notations, we deﬁne Ra s = E [Rt+1|St = s, At = a] Then, we have vπ (s) = a∈A π (a|s) qπ (s, a) qπ (s, a) = Ra s + α s ∈S Pa ss vπ s 34 / 111

74. Images/cinvestav Now... To simplify notations, we deﬁne Ra s = E [Rt+1|St = s, At = a] Then, we have vπ (s) = a∈A π (a|s) qπ (s, a) qπ (s, a) = Ra s + α s ∈S Pa ss vπ s 34 / 111

75. Images/cinvestav Then Expressing qπ (s, a) in terms of vπ (s) in the expression of vπ (s) - The Bellman Equation vπ (s) = a∈A π (a|s)  Ra s + α s ∈S Pa ss vπ s   The Bellman equation relates the state-value function of one state with that of other states. 35 / 111

76. Images/cinvestav Then Expressing qπ (s, a) in terms of vπ (s) in the expression of vπ (s) - The Bellman Equation vπ (s) = a∈A π (a|s)  Ra s + α s ∈S Pa ss vπ s   The Bellman equation relates the state-value function of one state with that of other states. 35 / 111

77. Images/cinvestav Also Bellman equation for qπ (s, a) qπ (s, a) = Ra s + α s ∈S Pa ss a∈A π (a|s) qπ (s, a) How do we solve this problem? We have the following solutions 36 / 111

78. Images/cinvestav Also Bellman equation for qπ (s, a) qπ (s, a) = Ra s + α s ∈S Pa ss a∈A π (a|s) qπ (s, a) How do we solve this problem? We have the following solutions 36 / 111

80. Images/cinvestav Action-Value Methods One natural way to estimate this is by averaging the rewards actually received Qt (a) = t−1 i=1 RiIAi=a t−1 i=1 IAi=a Nevertheless If the denominator is zero, we deﬁne Qt (a) at some default value Now, as the denominator goes to inﬁnity, by the law of large numbers, Qt (a) −→ q∗ (a) 38 / 111

83. Images/cinvestav Furthermore We call this the sample-average method Not the best, but allows to introduce a way to select actions... A classic one us the greedy estimate At = arg max a Qt (a) Therefore Greedy action selection always exploits current knowledge to maximize immediate reward. it spends no time in inferior actions to see if they might really be better. 39 / 111

87. Images/cinvestav And Here a Heuristic A Simple Alternative is to act greedily most of the time every once in a while, say with small probability , select randomly from among all the actions with equal probability. This is called -greedy method In the limit to inﬁnity, this method ensures: Qt (a) −→ q∗ (a) 40 / 111

88. Images/cinvestav And Here a Heuristic A Simple Alternative is to act greedily most of the time every once in a while, say with small probability , select randomly from among all the actions with equal probability. This is called -greedy method In the limit to inﬁnity, this method ensures: Qt (a) −→ q∗ (a) 40 / 111

89. Images/cinvestav It is like the Mean Filter We have a nice result 41 / 111

90. Images/cinvestav Example In the following example The q∗ (a) were selected according to a Gaussian Distribution with mean 0 and variance 1. Then, when an action is selected, the reward Rt was selected from a normal distribution with mean q∗ (At) and variance 1. We have then 42 / 111

91. Images/cinvestav Example In the following example The q∗ (a) were selected according to a Gaussian Distribution with mean 0 and variance 1. Then, when an action is selected, the reward Rt was selected from a normal distribution with mean q∗ (At) and variance 1. We have then 42 / 111

92. Images/cinvestav Now, as the sampling goes up We get 43 / 111

94. Images/cinvestav Imagine the following Let Ri denotes the reward after the ith of this action Qn = R1 + R2 + · · · + Rn−1 n − 1 The problem of this technique The Amount of Memory and Computational Power to keep updates the estimates as more reward happens 45 / 111

95. Images/cinvestav Imagine the following Let Ri denotes the reward after the ith of this action Qn = R1 + R2 + · · · + Rn−1 n − 1 The problem of this technique The Amount of Memory and Computational Power to keep updates the estimates as more reward happens 45 / 111

96. Images/cinvestav Therefore We need something more eﬃcient We can do something like that by realizing the following: Qn+1 = 1 n n i=1 Rn = 1 n Rn + n i=1 Ei = 1 n [Rn + (n − 1) Qn] = Qn + 1 n [Rn − Qn] 46 / 111

101. Images/cinvestav General Formula in Reinforcement Learning The Pattern Appears almost all the time in RL New Estimate = Old Estimate + Step Size [Target − Old Estimate] Looks a lot as a Gradient Descent if we decide Assume a Taylor Series f (x) = f (a) + f (a) (x − a) =⇒ f (a) = f (x) − f (a) x − a Then xn+1 = xn + γf (a) = xn + γ f (x) − f (a) x − a = xn + γ [f (x) − f (a)] 48 / 111

105. Images/cinvestav A Simple Bandit Algorithm Initialize for a = 1 to k 1 Q (a) ← 0 2 N (a) ← 0 Loop until certain threshold γ 1 A ← arg maxa Q (a) with probability 1 − A random action with probability 2 R ← bandit (A) 3 N (A) ← N (A) + 1 4 Q (A) ← Q (A) + 1 N(A) [R − Q (A)] 50 / 111

106. Images/cinvestav A Simple Bandit Algorithm Initialize for a = 1 to k 1 Q (a) ← 0 2 N (a) ← 0 Loop until certain threshold γ 1 A ← arg maxa Q (a) with probability 1 − A random action with probability 2 R ← bandit (A) 3 N (A) ← N (A) + 1 4 Q (A) ← Q (A) + 1 N(A) [R − Q (A)] 50 / 111

107. Images/cinvestav Remarks Remember This is works well for a stationary problem For more in non-stationary problems “Reinforcement Learning: An Introduction“ by Sutton et al. at chapter 2. 51 / 111

108. Images/cinvestav Remarks Remember This is works well for a stationary problem For more in non-stationary problems “Reinforcement Learning: An Introduction“ by Sutton et al. at chapter 2. 51 / 111

110. Images/cinvestav Dynamic Programming As in many things in Computer Science We love the Divide Et Impera... Then In Reinforcement Learning, Bellman equation gives recursive decompositions, and value function stores and reuses solutions. 53 / 111

111. Images/cinvestav Dynamic Programming As in many things in Computer Science We love the Divide Et Impera... Then In Reinforcement Learning, Bellman equation gives recursive decompositions, and value function stores and reuses solutions. 53 / 111

113. Images/cinvestav Policy Evaluation For this, we assume as always The problem can be solved with the Bellman Equation We start from an initial guess v1 Then, we apply Bellman Iteratively v1 → v2 → · · · → vπ For this, we use synchronous updates We update the value functions of the present iteration at the same time based on the that of the previous iteration. 55 / 111

116. Images/cinvestav Then At each iteration k + 1, for all states s ∈ S We update vk+1 from vk (s ) according to Bellman Equations, where s is a successor state of s: vk+1 (s) = a∈A π (a|s)  Ra s + α s ∈S Pa ss vk s   Something Notable The iterative process stops when the maximum diﬀerence between value function at the current step and that at the previous step is smaller than some small positive constant. 56 / 111

117. Images/cinvestav Then At each iteration k + 1, for all states s ∈ S We update vk+1 from vk (s ) according to Bellman Equations, where s is a successor state of s: vk+1 (s) = a∈A π (a|s)  Ra s + α s ∈S Pa ss vk s   Something Notable The iterative process stops when the maximum diﬀerence between value function at the current step and that at the previous step is smaller than some small positive constant. 56 / 111

119. Images/cinvestav However Our ultimate goal is to ﬁnd the optimal policy For this, we can use policy iteration process Algorithm 1 Initialize π randomly 2 Repeat until the previous policy and the current policy are the same. 1 Evaluate vπ by policy evaluation. 2 Using synchronous updates, for each state s, let π (s) = arg max a∈A q (s, a) = arg max a∈A  Ra s + α s ∈S Pa ss vπ s   58 / 111

120. Images/cinvestav However Our ultimate goal is to ﬁnd the optimal policy For this, we can use policy iteration process Algorithm 1 Initialize π randomly 2 Repeat until the previous policy and the current policy are the same. 1 Evaluate vπ by policy evaluation. 2 Using synchronous updates, for each state s, let π (s) = arg max a∈A q (s, a) = arg max a∈A  Ra s + α s ∈S Pa ss vπ s   58 / 111

121. Images/cinvestav Now Something Notable This policy iteration algorithm will improve the policy for each step. Assume that We have a deterministic policy π and after one step we get π . We know that the value improves in one step because qπ s, π (s) = max a∈A qπ (s, a) ≥ qπ (s, π (s)) = vπ (s) 59 / 111

124. Images/cinvestav Then We can see that vπ (s) ≤ qπ s, π (s) = Eπ [Rt+1 + αvπ (St+1) |St = s] ≤ Eπ Rt+1 + αqπ St+1, π (St+1) |St = s ≤ Eπ Rt+1 + αRt+2 + α2 qπ St+2, π (St+2) |St = s ≤ Eπ [Rt+1 + αRt+2 + · · · |St = s] = vπ (s) 60 / 111

128. Images/cinvestav Then If improvements stop qπ s, π (s) = max a∈A qπ (s, a) = qπ (s, π (s)) = vπ (s) This means that the Bellman equation has been satisﬁed vπ (s) = max a∈A qπ (s, a) Therefore, vπ (s) = v∗ (s) for all s ∈ S and π is an optimal policy. 61 / 111

129. Images/cinvestav Then If improvements stop qπ s, π (s) = max a∈A qπ (s, a) = qπ (s, π (s)) = vπ (s) This means that the Bellman equation has been satisﬁed vπ (s) = max a∈A qπ (s, a) Therefore, vπ (s) = v∗ (s) for all s ∈ S and π is an optimal policy. 61 / 111

131. Images/cinvestav Once the policy has been improved We can iteratively a sequence of monotonically improving policies and values π0 E −→ vπ0 I → π1 E −→ vπ1 I → π2 E −→ · · · I −→ π∗ E → v∗ with E = evaluation and I = improvement. 63 / 111

132. Images/cinvestav Policy Iteration (using iterative policy evaluation) Step 1. Initialization V (s) ∈ R and π (s) ∈ A (s) arbitrarily for all s ∈ S Step 2. Policy Evaluation 1 Loop: 2 ∆ ← 0 3 Loop for each s ∈ S 4 v ← V (s) 5 V (s) ← s ,r p (s , r|s, π (s)) [r + αV (s )] 6 ∆ ← max (∆, |v − V (s)|) 7 until ∆ < θ (A small threshold for accuracy) 64 / 111

133. Images/cinvestav Policy Iteration (using iterative policy evaluation) Step 1. Initialization V (s) ∈ R and π (s) ∈ A (s) arbitrarily for all s ∈ S Step 2. Policy Evaluation 1 Loop: 2 ∆ ← 0 3 Loop for each s ∈ S 4 v ← V (s) 5 V (s) ← s ,r p (s , r|s, π (s)) [r + αV (s )] 6 ∆ ← max (∆, |v − V (s)|) 7 until ∆ < θ (A small threshold for accuracy) 64 / 111

134. Images/cinvestav Further Step 3. Policy Improvement 1 Policy-stable ← True 2 For each s ∈ S: 3 old_action ← π (s) 4 π (s) ← s ,r p (s .r|s, a) [r + αV (s )] 5 if old_action= π (s) then policy-stable ← False 6 If policy-stable then stop and return V ≈ v∗ and π ≈ π∗ else go to 2. 65 / 111

136. Images/cinvestav Jack’s Car Rental Setup We have a Car’s Rental with two places (s1, s2). And each time Jack is available he rents it out and is credited $10 by the company. If Jack’s is out of cars at that location, then the business is lost. The cars are available for renting the next day Jack’s has options It can move them between the two locations overnight, at a cost of $2 dollars and only 5 of them can be moved. 67 / 111

141. Images/cinvestav We choose a probability for modeling requests and returns Additionally, renting at each location has a Poisson distribution P (n|λ) = λn n! e−λ We simplify the problem by having an upper-bound n = 20 Basically, any other car generated by the return simply disappear!!! 68 / 111

142. Images/cinvestav We choose a probability for modeling requests and returns Additionally, renting at each location has a Poisson distribution P (n|λ) = λn n! e−λ We simplify the problem by having an upper-bound n = 20 Basically, any other car generated by the return simply disappear!!! 68 / 111

143. Images/cinvestav Finally We have the following λ s for request and returns 1 Request s1 we have λr1 = 3 2 Request s2 we have λr2 = 4 3 Return s1 we have λr1 = 3 4 Return s2 we have λr1 = 2 Actually, it deﬁnes how many cars are requested or returned in a time interval (One day) 69 / 111

147. Images/cinvestav Finally We have the following λ s for request and returns 1 Request s1 we have λr1 = 3 2 Request s2 we have λr2 = 4 3 Return s1 we have λr1 = 3 4 Return s2 we have λr1 = 2 Actually, it deﬁnes how many cars are requested or returned in a time interval (One day) Time Interval Events 69 / 111

148. Images/cinvestav Thus, we have the following running With a α = 0.9 70 / 111

150. Images/cinvestav History There is an original earlier work By Arthur Samuel and it was used by Sutton to invent Temporal Diﬀerence-Lambda It was famously applied by Gerald Tesauro To build TD-Gammon A program that learned to play backgammon at the level of expert human players 72 / 111

151. Images/cinvestav History There is an original earlier work By Arthur Samuel and it was used by Sutton to invent Temporal Diﬀerence-Lambda It was famously applied by Gerald Tesauro To build TD-Gammon A program that learned to play backgammon at the level of expert human players 72 / 111

152. Images/cinvestav Temporal-Diﬀerence (TD) Learning Something Notable Possibly the central and novel idea of reinforcement learning, It has similarity with Monte Carlo Methods They use experience to solve the prediction problem (For More look at Sutton’s Book) Monte Carlo methods wait until Gt is known V (St) ← V (St) + γ [Gt − V (St)] and Gt = ∞ k=0 αk Rt+k+1 Let us call this method constant-α Monte Carlo. 73 / 111

156. Images/cinvestav Fundamental Theorem of Simulation Cumulative Distribution Function With every random variable, we associate a function called Cumulative Distribution Function (CDF) which is deﬁned as follows: FX(x) = P(f(X) ≤ x) With properties: FX(x) ≥ 0 FX(x) in a non-decreasing function of X. 75 / 111

159. Images/cinvestav Graphically Example Uniform Distribution 1 76 / 111

160. Images/cinvestav Graphically Example Gaussian Distribution 77 / 111

162. Images/cinvestav Fundamental Theorem of Simulation Theorem Simulating X ∼ f (x) It is equivalent to simulating (X, U) ∼ U ((x, u) |0 < u < f (x)) 79 / 111

163. Images/cinvestav Fundamental Theorem of Simulation Theorem Simulating X ∼ f (x) It is equivalent to simulating (X, U) ∼ U ((x, u) |0 < u < f (x)) 79 / 111

164. Images/cinvestav Geometrically We have Proof We have that f (x) = f(x) 0 du the marginal pdf of (X, U). 80 / 111

165. Images/cinvestav Geometrically We have Proof We have that f (x) = f(x) 0 du the marginal pdf of (X, U). 80 / 111

166. Images/cinvestav Therefore One thing that is made clear by this theorem is that we can generate X in three ways 1 First generate X ∼ f, and then U|X = x ∼ U {u|u ≤ f (x)} , but this is useless because we already have X and do not need X. 2 First generate U, and then X|U = u ∼ U {x|u ≤ f (x)} . 3 Generate (X, U) jointly, a smart idea. it allows us to generate data on a larger set were simulation is easier. Not as simple as it looks. 81 / 111

170. Images/cinvestav Markov Chains The random process Xt ∈ S for t = 1, 2, ..., T has a Markov property iif p (XT |XT−1, XT−2, ..., X1) = p (XT |XT−1) Finite-state Discrete Time Markov Chains |S| < ∞ It can be completly speciﬁed by the transition matrix. P = [pij] with pij = P [Xt = j|Xt−1 = i] For Irreducible chains, the stationary distribution π is long-term proportion of time that the chain spends in each state. Computed by π = πP . 82 / 111

175. Images/cinvestav The Idea Draw an i.i.d. set of samples x(i) N i=1 From a target density p (x) on a high-dimensional X. The set of possible conﬁgurations of a system. Thus, we can use those samples to approximate the target density pN (x) = 1 N N i=1 δx(i) (x) where δx(i) (x) denotes the Dirac Delta mass located at x(i). 84 / 111

178. Images/cinvestav Thus Given, the Dirac Delta δ (t) = 0 t = 0 ∞ t = 0 with t2 t1 δ(t)dt = 1 Therefore δ (t) = lim σ→0 1 √ 2πσ e− t2 2σ2 85 / 111

179. Images/cinvestav Thus Given, the Dirac Delta δ (t) = 0 t = 0 ∞ t = 0 with t2 t1 δ(t)dt = 1 Therefore δ (t) = lim σ→0 1 √ 2πσ e− t2 2σ2 85 / 111

180. Images/cinvestav Consequently Approximate the integrals IN (f) = 1 N N i=1 f x(i) a.s. −→ N→∞ I (f) = X f (x) p (x) dx Where Variance of f (x) σ2 f = Ep(x) f2 (x) − I2 (f) < ∞ Therefore the variance of the estimator IN (f) EIN (f) = var (IN (f)) = σ2 f N 86 / 111

183. Images/cinvestav Then By Robert & Casella 1999, Section 3.2 √ N (IN (f) − I (f)) =⇒ N−→∞ N 0, σ2 f 87 / 111

184. Images/cinvestav Then By Robert & Casella 1999, Section 3.2 √ N (IN (f) − I (f)) =⇒ N−→∞ N 0, σ2 f 87 / 111

186. Images/cinvestav What if we do not wait for Gt TD methods need to wait only until the next time step At time t + 1, it uses the observed reward Rt+1 and the estimate V (St+1) The simplest TD method V (St) ← V (St) + γ [Rt+1 + αV (St+1) − V (St)] Immediately on transition to St+1 and receiving Rt+1 . 89 / 111

187. Images/cinvestav What if we do not wait for Gt TD methods need to wait only until the next time step At time t + 1, it uses the observed reward Rt+1 and the estimate V (St+1) The simplest TD method V (St) ← V (St) + γ [Rt+1 + αV (St+1) − V (St)] Immediately on transition to St+1 and receiving Rt+1 . 89 / 111

188. Images/cinvestav This method is called TD(0) or one-step TD Actually There are TD with n-step TD methods Diﬀerences with the Monte Carlo Methods In Monte Carlo update, the target is Gt In TD update, the target is Rt + αV (St+1) 90 / 111

189. Images/cinvestav This method is called TD(0) or one-step TD Actually There are TD with n-step TD methods Diﬀerences with the Monte Carlo Methods In Monte Carlo update, the target is Gt In TD update, the target is Rt + αV (St+1) 90 / 111

190. Images/cinvestav TD as a Bootstrapping Method Quite similar to Dynamic Additionally, we know that vπ (s) = Eπ [Gt|St] = Eπ [Rt+1 + αGt+1|St] = Eπ [Rt+1 + αvπ (St+1) |St] Monte Carlo methods use an estimate of vπ (s) = Eπ [Gt|St] Instead, Dynamic Programming methods use an estimate of vπ (s) = Eπ [Rt+1 + αvπ (St+1) |St]. 91 / 111

191. Images/cinvestav TD as a Bootstrapping Method Quite similar to Dynamic Additionally, we know that vπ (s) = Eπ [Gt|St] = Eπ [Rt+1 + αGt+1|St] = Eπ [Rt+1 + αvπ (St+1) |St] Monte Carlo methods use an estimate of vπ (s) = Eπ [Gt|St] Instead, Dynamic Programming methods use an estimate of vπ (s) = Eπ [Rt+1 + αvπ (St+1) |St]. 91 / 111

192. Images/cinvestav Furthermore The Monte Carlo target is an estimate because the expected value in (6.3) is not known, but a sample is used instead The DP target is an estimate because vπ(St + 1) is not known and the current estimate,V (St+1), is used instead. Here TD goes beyond and combines both ideas It combines the sampling of Monte Carlo with the bootstrapping of DP. 92 / 111

195. Images/cinvestav As in many methods there is an error And it is used to improve the policy δt = Rt+1 + αV (St+1) − V (St) Because the TD error depends on the next state and next reward It is not actually available until one time step later. δt is the error in V (St) At time t + 1 93 / 111

198. Images/cinvestav Finally If the array V does not change during the episode, The Monte Carlo Error can be written as a sum of TD errors = Gt − V (St) = Rt+1 + αGt+1 − V (St) + αV (St+1) − αV (St+1) = δt + α (Gt+1 − V (St+1)) = δt + αδt+! + α2 (Gt+2 − V (St+2)) = T−1 k=t αk−t δk 94 / 111

199. Images/cinvestav Optimality of TD(0) Under batch updating TD(0) converges deterministically to a single answer independent as long as γ is chosen to be suﬃciently small. V (St) ← V (St) + γ [Gt − V (St)] and 95 / 111

201. Images/cinvestav Q-learning: Off-Policy TD Control One of the early breakthroughs in reinforcement learning The development of an off-policy TD control algorithm, Q-learning (Watkind, 1989) It is defined as Q (St, At) ← Q (St, At) + γ Rt+1 + α max a Q (St+1, a) − Q (St, At) Here The learned action-value function, Q, directly approximates q∗ the optimal action-value function independent of the policy being followed q. 97 / 111

204. Images/cinvestav Furthermore The policy still has an eﬀect to determine which state-action pairs are visited and updated. Therefore All that is required for correct convergence is that all pairs continue to be updated Under this assumption and a variant of the usual stochastic approximation conditions Q has been shown to converge with probability 1 to q∗ 98 / 111

207. Images/cinvestav Q-learning (Oﬀ-Policy TD control) for estimating π ≈ π∗ Algorithm parameters α ∈ (0, 1] and small > 0 Initialize Initialize Q(s, a), for all s ∈ S+ , a ∈ A, arbitrarily except that Q (terminal, ·) = 0 99 / 111

208. Images/cinvestav Q-learning (Oﬀ-Policy TD control) for estimating π ≈ π∗ Algorithm parameters α ∈ (0, 1] and small > 0 Initialize Initialize Q(s, a), for all s ∈ S+ , a ∈ A, arbitrarily except that Q (terminal, ·) = 0 99 / 111

209. Images/cinvestav Then Loop for each episode 1 Initialize S 2 Loop for each step of episode: 3 Choose A from S using policy derived from Q (for example -greedy) 4 Take action A, observe R, S Q (S, A) ← Q (S, A) + γ R + α max a Q S , a − Q (S, A) 5 S ← S 6 Until S is terminal 100 / 111

211. Images/cinvestav One of the earliest successes Backgammon is a major game with numerous tournaments and regular world championship matches. It is a stochastic game with a full description of the world 102 / 111

212. Images/cinvestav One of the earliest successes Backgammon is a major game with numerous tournaments and regular world championship matches. It is a stochastic game with a full description of the world 102 / 111

213. Images/cinvestav What do they used? TD-Gammon used a nonlinear form of TD (λ) An estimated value ˆv (s, w) is used to estimate the probability of winning from state s. Then To achieve this, rewards were deﬁned as zero for all time steps except those on which the game is won. 103 / 111

214. Images/cinvestav What do they used? TD-Gammon used a nonlinear form of TD (λ) An estimated value ˆv (s, w) is used to estimate the probability of winning from state s. Then To achieve this, rewards were deﬁned as zero for all time steps except those on which the game is won. 103 / 111

215. Images/cinvestav Implementation of the value function in TD-Gammon They took the decision to use a standard multilayer ANN 104 / 111

216. Images/cinvestav The interesting part Tesauro’s representation For each point on the backgammon board Four units indicate the number of white pieces on the point Basically If no white pieces, all four units took zero values, if there was one white piece, the ﬁrst unit takes a one value And so on... Therefore, given four white pieces and another four black pieces at each of the 24 points We have 4*24+4*24=192 units 105 / 111

219. Images/cinvestav Then We have that Two additional units encoded the number of white and black pieces on the bar (Each took the value n/2 with n the number on the bar) Two other represent the number of black and white pieces already (these took the value n/15) two units indicated in a binary fashion the white or black turn. 106 / 111

221. Images/cinvestav TD(λ) by Back-Propagation TD-Gammon uses a semi-gradient of the TD (λ) TD (λ) has the following general update rule: wt+1 = wt + γ [Rt+1 + αˆv (St+1, wt) − ˆv (St, wt)] zt with wt is the vector of all modiﬁable parameters zt is a vector of eligibility traces Eligibility traces unify and generalize TD and Monte Carlo methods. It works as Long-Short Term Memory The rough idea is that when a component of wt produces an estimate zt is bumped out and then begins to fade away. 108 / 111

224. Images/cinvestav How this is done? Update Rule zt = αλzt−1 + ˆv (St, wt) with z0 = 0 Something Notable The gradient in this equation can be computed eﬃciently by the back-propapagation procedure For this, we set the error in the ANN ˆv (St+1, wt) − ˆv (St, wt) 109 / 111

228. Images/cinvestav Conclusions With this new tools as Reinforcement Learning The New Artiﬁcial Intelligence has a lot to do in the world Thanks to be in this class As they say we have a lot to do... 111 / 111

229. Images/cinvestav Conclusions With this new tools as Reinforcement Learning The New Artiﬁcial Intelligence has a lot to do in the world Thanks to be in this class As they say we have a lot to do... 111 / 111

25 introduction reinforcement_learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 25 introduction reinforcement_learning

Similar to 25 introduction reinforcement_learning (20)

More from Andres Mendez-Vazquez

More from Andres Mendez-Vazquez (20)

Recently uploaded

Recently uploaded (20)

25 introduction reinforcement_learning