2. RL- Objectives
• State (s) - The place where the user agent is right now.
Examples:
1. A position on a chess board
2. A potential customer in a sales web
• Action (a)- An action that a user can take while he is in a state.
Examples:
1. Knight pawn captures bishop
2. The user buys a ticket
• Reward (r) - The reward that is obtained due to the action
Examples:
1. A better worse position
2. More money or more clicks
3. Basic Concepts
• Policy (π)- The “strategy” in which the agent decides which action to
take. Abstractly speaking the policy is simply a probability function that is
defined for each state
• Episode – A sequence of states and their actions
The Objective:
Predicting the expected future reward given the current state (s) :
1. Which actions should we take in order to maximize our gain
2. Which actions should we take in order to maximize the click rate
Value function (𝑉π (𝑠) ) – The expected reward of the episode given a policy
π and a state 𝑠
=>Bellman Equation
5. Optimal Function
We wish to find the optimal policy:
𝑉∗(s) = max
π
𝑉π(𝑠)
Our aim:
1. Finding the optimal policy
2. Finding the value of this policy for each state.
6. Dynamic Programing
𝑉(𝑠𝑡) =𝐸(𝑟𝑡+1 + γ𝑉(𝑠𝑡+1)│𝑠𝑡 = 𝑠)
• Learning is based on perfect model of the environment
(assuming Markov decision model )
• W𝑒 𝑢𝑝𝑑𝑎𝑡𝑒 𝑉𝜋
(𝑠𝑡) upon values of 𝑉𝜋
(𝑠𝑡+1). Where both of values are
estimated : bootstrapping
• The navigation follows mostly the ε-greedy methods
• Policy improvement
7. Estimating 𝑅𝑡 using Monte-Carlo
We estimate the value function by averaging the
obtained rewards. These methods are used in episodic
tasks.
We need to complete the episode for updating the value
function and learning is not “step by step” but episode by
episode.
8. Monte Carlo (Cont)
• Algorithm:
1. Choose policy
2. Run episode upon the policy .
In each step get the reward 𝑟𝑡 and add to the list Returns (𝑠𝑡)
3. V(s)=average(Returns (𝑠)) (Denoted as 𝑅𝑡)
The basic formula
𝑉 𝑠𝑡 = 𝑉 𝑠𝑡 +α[𝑅𝑡- 𝑉 𝑠𝑡 ]
9. Summary
Summary
1. Dynamic programming- bootstrapping : They learn
estimators upon other estimators.
Assumptions on the model
2. Monte Carlo –Learning from the experience between
episodes. The target for update is 𝑅𝑡
10. TD Methods
• The motivation is to combine sampling with bootstrapping, (DP with MC). The simplest formula
TD(0)
𝑉 𝑠𝑡 = 𝑉 𝑠𝑡 +α[𝑟𝑡 + γ𝑉(𝑠𝑡+1)- 𝑉 𝑠𝑡 ]
We do both bootstrapping (using estimated values of V) and sampling (using 𝑟𝑡)
TD->DP - Advantages
• No models of the environment (rewardprobabilities)
TD->MC – Advantages
• They can work online
• Independent in the length of the episode
• They don’t need to ignore or discount episodes with high experimental options since they learn from every
visit
• Convergence is guaranteed
11. The function -Q
• We discussed value functions V which are defined on
the episode states
• In some applications we wish to learn these functions
on the state-action space (s, a).
• The analysis for DP & MC is similar (you can check)
12. SARSA
• We learn each state upon the sequence:
S->A->R->S->A :
Q 𝑠𝑡, 𝑎𝑡 = 𝑄 𝑠𝑡, 𝑎𝑡 +α[𝑟𝑡 + γ𝑄(𝑠𝑡+1, 𝑎𝑡+1)- Q 𝑠𝑡, 𝑎𝑡 ]
We use the actions of time t+1 to learn action at time t
• We learn the 𝑄π
online and use the visits that it produced to update.
The policy goes toward greediness.
13. Off Policy- QLearning
• The basic formula:
Q 𝑠𝑡, 𝑎𝑡 = 𝑄 𝑠𝑡, 𝑎𝑡 +α[𝑟𝑡 + γ max
𝑎
𝑄(𝑠𝑡+1, 𝑎)- Q 𝑠𝑡, 𝑎𝑡 ]
• The idea is that we update the previous state not upon the current
state (it does not follow the policy) but upon the optimal state
• The pick of the option works in the same way, namely it follows
ε -greedy.
14. Game Theory
• What is Game?
1. Two or more players
2. Strategies
3. Rewards
18. Nash Equilibrium
• If f payoff function and 𝑥1 , 𝑥2, … . , 𝑥𝑛 strategies. Then 𝒙𝒊
∗
is an
equilibrium if for every player i and every strategy 𝒙𝒊
f𝑖 (𝒙𝒊
∗
, 𝑥−𝑖) > f𝑖 (𝒙𝒊, 𝑥−𝑖)
Existence Theorem
Every game with finite number of players has a Nash equilibrium in
mixed strategies.
21. Impartial Games
• Games in which in each state both players can take the same actions
and get the same rewards
Chess ,Go?
No ! In these games every player has his own pieces
What is the difference in impartial games between the players?
• Nothing! simply one player starts and the other does not !
• Both players have complete information
22. Impartial Games (Cont.)
Basic Concepts
Normal Games - The last player to play wins
Misere Games - The last player to play loses
NP- Positions
N-position - The set of states in which the first player wins
P-position - The set of states in which the second player wins
A word on terminology: We assume that each state is a starting state, thus the first
player is the Next player and the second player is the Previous player
23. Sprague Grundy Function
• Let G the graph of the game.
1. The followers of N-positions are P-Positions
2. The followers of P-positions are N-Positions
The Sum of Games
If 𝐺1 = 𝑉1 , 𝐸1 , 𝐺2 = 𝑉2 , 𝐸2 are games then Their Sum is:
G={𝑉1 × 𝑉2 ,E} and if 𝑣1 ,𝑤1 ∊ 𝑉1 , 𝑣2,𝑤2 ∊ 𝑉2
E={(𝑣1𝑣2, 𝑤1𝑣2) }∪{(𝑣1𝑣2,𝑣1𝑤2) }
24. Sprague Grundy Function
What Can we say about the sum?
P+P->P
N+P->N
N+N->?
We should do better in order to know the game!
25. The mex Function
• Let N the integers and S s.t. S⊊ N
mex(S)= min(N/ S)
• Let the graph: G={𝑉,E} we define a function
g: V-> N
g (v) = mex{g(w) |(v,w) ∊E}
• g (v) =0 ⇔ v is P-position
• For 𝐺1, 𝐺2 𝑠𝑢b − games and G their sum we have
g (v)=g(𝑣1) ⊕ g(𝑣2)
26. Impartial Games (Cont.)
Sprague Grundy Theorem
Every normal impartial game is isomorphic to Nim !!
(Hence Nim is not only a game but a class of games)
27. What is Nim?
• A two players game originally from China.
• It has been played in Europe since the 16th century
• 1901 Bouton gave the game its name and..
A nice mathematical study.
28. How do we play?
• Two players
• N heaps of items where the number of items for heap i is 𝑘𝑖
• In each turn a player should pick an amount of items :
1. He must pick!
2. He can pick only from a single heap!
• Winning Strategy –Simply xor the heap sizes (By induction)
29. Examples
• A single heap with M items .
Who is going to win?
• Two heaps one of K items and one of 1..
• Two heaps one of M and one of K (M >K)
30. The Three Heaps Example
• We have three heaps 3,4,5 items
Let’s Play
(1,4,5)
(1,2,5)
(1,2,3)
(1,2,2)
(0,2,2)
…
31. Nim & QLearning
• There is a natural similarity between games and reinforcement:
1. A collections of states S
2. For each s ∊ S There are potential actions that a player can take.
3. Every step provides a reward
32. QLearning for Nim-Design & implementation
We build a Nim engine for three heaps (Eric Jarleberg)
State
• A set of actions that are available
• A Boolean variable that indicates whether this is a goal (true (0,0,0))
• A function that performs the XOR
• A function that performs an action and thus jumps to another state
Actions
• A number that indicates the heap in which we perfrom the action
• A number that indicates the amount of items
33. QLearning for Nim-Design & implementation
Agent
Agent Types
1. Random Agent- Simply gets the state and randomly picks an action
2. Optimal Agent –The one that plays the optimal strategy
3. Qlearning Agent – An agent that learns to play through Q functions
Agent Implementation
1. A function that given a state recommends an action for the agent
2. A function that provides a feedback
34. QLearning for Nim-Design & implem.
• What is our objective?
1. We wish to learn as fast as possible
2. We wish to have an accurate Q function
Remarks
1. The learning process is sensitive to opponents
2. We may obtain an optimal policy and not having good Q
35. QLearning for Nim
• Loss function:
We learn a policy π and we have for each game state s :
𝑈π 𝑠 = 𝑖=0
∞
γ𝑖
R(𝑆𝑖)
• Originally we have for each 𝑠 its 𝑈 and we estimate the learning by
minimizing :
│𝑈π -𝑈∗│ (i.e. min 𝐿1 ) .
36. How to Estimate?
1. 𝐿1 -Simply minimize │𝑈π -𝑈∗│
2. For every state there is only one “right action”, We can measure
thus the amount of failures rather the utility function. What we do
is counting the wrong actions for each s ∈ N . (Due to remark 2)
43. Remarks
1. The most important factor is the opponent
2. The Q function maps positive values to states in N
i.e.: Learning to play is equivalent to classify states to N & P
Corollary: We can use Qlearning to any impartial game
3. An interesting step can be to perform actor-critic and compare
policies through KL-divergence.
44. Good Sources
• https://github.com/jamesbroadhead/Axelrod/blob/master/axelrod/strategies/qlearner.py
• https://github.com/NimQ/NimRL
• http://www.csc.kth.se/utbildning/kth/kurser/ -RL for Nim Erik Jarelberg
• https://papers.nips.cc/paper/2171-reinforcement-learning-to-play-an-optimal-nash-equilibrium-in-team-markov-
games.pdf
• https://www.researchgate.net/profile/J_Enrique_Agudo/publication/221174538_Reinforcement_Learning_for_the_N-
Persons_Iterated_Prisoners'_Dilemma/links/5535ec0c0cf268fd0015f0ac/Reinforcement-Learning-for-the-N-Persons-
Iterated-Prisoners-Dilemma.pdf
• Gabriel Nivasch –Ariel University
• Sutton & Barto